首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Chen PC 《Bio Systems》2005,81(2):155-163
This article presents an approach for synthesizing target strings in a class of computational models of DNA recombination. The computational models are formalized as splicing systems in the context of formal languages. Given a splicing system (of a restricted type) and a target string to be synthesized, we construct (i) a rule-embedded splicing automaton that recognizes languages containing strings embedded with symbols representing splicing rules, and (ii) an automaton that implicitly recognizes the target string. By manipulating these two automata, we extract all rule sequences that lead to the production of the target string (if that string belongs to the splicing language). An algorithm for synthesizing a certain type of target strings based on such rule sequences is presented.  相似文献   

2.
MOTIVATION: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinformatics. A dominant method of such string matching is the 'seed-and-extend' approach, in which occurrences of short subsequences called 'seeds' are used to search for potentially longer matches in a large database of sequences. Each such potential match is then checked to see if it extends beyond the seed. To be effective, the seed-and-extend approach needs to catalogue seeds from virtually every substring in the database of search strings. Projects such as mammalian genome assemblies and large-scale protein matching, however, have such large sequence databases that the resulting list of seeds cannot be stored in RAM on a single computer. This significantly slows the matching process. RESULTS: We present a simple and elegant method in which only a small fraction of seeds, called 'minimizers', needs to be stored. Using minimizers can speed up string-matching computations by a large factor while missing only a small fraction of the matches found using all seeds.  相似文献   

3.
W Saurin  P Marlière 《Biochimie》1985,67(5):517-521
A set of sequences can be defined by their common subsequences, and the length of these is a measure of the overall resemblance of the set. Each subsequence corresponds to a succession of symbols embedded in every sequence, following the same order but not necessarily contiguous. Determining the longest common subsequence (LCS) requires the exhaustive testing of all possible common subsequences, which sum up to about 2L, if L is the length of the shortest sequence. We present a polynomial algorithm (O(n X L4), where n is the number of sequences) for generating strings related to the LCS and constructed with the sequence alphabet and an indetermination symbol. Such strings are iteratively improved by deleting indetermination symbols and concomitantly introducing the greatest number of alphabet symbols. Processed accordingly, nucleic acid and protein sequences lead to key-words encompassing the salient positions of homologous chains, which can be used for aligning or classifying them, as well as for finding related sequences in data banks.  相似文献   

4.
In this paper, we present an approach based on Burrows–Wheeler transform to compare the protein sequences. The strings representing amino acid sequences do not reflect the chemical physical properties better, and it is very hard to extract any key features by reading these long character strings directly. The use of the Burrows–Wheeler similarity distribution needs a suitable representation which can reflect some interesting properties of the proteins. For the comparison of the primary protein sequences we convert the protein sequences into digital codes by the Ponnuswamy hydrophobicity index, and for the comparison of the structure of the proteins we adjust the topology of protein structure strings, which are simple but useful representation of the secondary structure of proteins to match the Burrows–Wheeler similarity distribution. At last, some experiments show that the approach proposed in this paper is a powerful and useful tool for the comparison of proteins.  相似文献   

5.
6.
We present a tool suited for searching for many short nucleotide sequences in large databases, allowing for a predefined number of gaps and mismatches. The commandline-driven program implements a non-deterministic automata matching algorithm on a keyword tree of the search strings. Both queries with and without ambiguity codes can be searched. Search time is short for perfect matches, and retrieval time rises exponentially with the number of edits allowed. AVAILABILITY: The C++ source code for PatMaN is distributed under the GNU General Public License and has been tested on the GNU/Linux operating system. It is available from http://bioinf.eva.mpg.de/patman. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

7.
Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (?) that has to be set in the algorithm used to retrieve them. Indeed, if ? is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ? is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ? mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.  相似文献   

8.
When conducting field studies, it is common for ecologists to choose the locations of sampling units arbitrarily at the time sampling occurs, rather than using a properly randomised sampling design. Unfortunately, this ‘haphazard’ sampling approach cannot provide formal statistical inference from the sample to the population without making untestable assumptions. Here, we argue that two recent technological developments remove the need for haphazard sampling in many situations. A general approach to simple randomised sampling designs is outlined, and some examples demonstrate that even complicated designs can be implemented easily using software that is widely used among ecologists. We consider that more rigorous, randomised sampling designs would strengthen the validity of the conclusions drawn from ecological studies, to the benefit of the discipline as a whole.  相似文献   

9.
Gene splicing by overlap extension is a new approach for recombining DNA molecules at precise junctions irrespective of nucleotide sequences at the recombination site and without the use of restriction endonucleases or ligase. Fragments from the genes that are to be recombined are generated in separate polymerase chain reactions (PCRs). The primers are designed so that the ends of the products contain complementary sequences. When these PCR products are mixed, denatured, and reannealed, the strands having the matching sequences at their 3' ends overlap and act as primers for each other. Extension of this overlap by DNA polymerase produces a molecule in which the original sequences are 'spliced' together. This technique is used to construct a gene encoding a mosaic fusion protein comprised of parts of two different class-I major histocompatibility genes. This simple and widely applicable approach has significant advantages over standard recombinant DNA techniques.  相似文献   

10.
A simple procedure is described for finding similarities between proteins using nucleotide sequence databases. The approach is illustrated by several examples of previously unknown correspondences with important biological implications: Drosophila elongation factor Tu is shown to be encoded by two genes that are differently expressed during development; a cluster of three Drosophila genes likely encode maltases; a flesh-fly fat body protein resembles the hypothesized Drosophila alcohol dehydrogenase ancestral protein; an unknown protein encoded at the multifunctional E. coli hisT locus resembles aspartate beta-semialdehyde dehydrogenase; and the E. coli tyrR protein is related to nitrogen regulatory proteins. These and other matches were discovered using a personal computer of the type available in most laboratories collecting DNA sequence data. As relatively few sequences were sampled to find these matches, it is likely that much of the existing data has not been adequately examined.  相似文献   

11.
12.
Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied to sequences lacking the genetic code such as ncRNAs and 5'-untranslated regions.  相似文献   

13.
A genetic algorithm optimization approach for designing treatment plans in intensity-modulated radiotherapy is proposed. The approach determines the beam intensities of the pencil-beam dose model such that the optimized dose distribution closely matches the prescribed dose distribution. The approach indirectly inverts the ill-conditioned dose-projection matrix, which can be very large and extremely sparse. The beam intensities are treated as chromosomes that are encoded as binary strings. The approach was used to design treatment plans for two deceptive clinical test cases. In both case, cancerous tissues in the planning target region received at least 98% of the prescribed dose level while dose levels delivered to the organs at risk were well within safe limits, with a maximum exposure of 2.5 and 52.5% of the prescribed tolerance level for the brain and prostrate cancer cases, respectively. Dose levels delivered to the healthy tissues were small with a mean exposure of 22.8 and 23.5% of the prescribed tolerance level.  相似文献   

14.
In this paper we consider networks of evolutionary processors with splicing rules and permitting context (NEPPS) as language generating and computational devices. Such a network consists of several processors placed on the nodes of a virtual graph and are able to perform splicing (which is a biologically motivated operation) on the words present in that node, according to the splicing rules present there. Before applying the splicing operation on words, we check for the presence of certain symbols (permitting context) in the strings on which the rule is applied. Each node is associated with an input and output filter. When the filters are based on random context conditions, one gets the computational power of Turing machines with networks of size two. We also show how these networks can be used to solve NP-complete problems in linear time.  相似文献   

15.
16.
A test for nucleotide sequence homology   总被引:3,自引:0,他引:3  
Two macromolecular sequences which have evolved from a common ancestor sequence will tend to include a large number of elements unaffected by replacement mutations in both sequences, as long as the evolutionary rate is not too high or the divergence time is not too great. The positions of corresponding elements may have changed in either daughter sequence due to deletion/insertion mutations involving other sequence elements, but their order can be expected to be the same in both sequences. These sets of correspondences, called matches, may be computed by a recursive algorithm which incorporates constraints on the number of deletion/insertion mutations hypothesized to have occurred. A test is developed which computes the significance of each deletion/insertion hypothesized, based on Monte-Carlo sampling of random sequences with the same base composition as the experimental sequences being tested. Applying the test to 5 S RNAs confirms the relation of Escherichia coli and KB carcinoma 5 S RNAs and establishes the previously undetected homology between Pseudomonas fluorescens and KB 5 S RNAs.  相似文献   

17.
The accurate partitioning of Ig H chain V(H)DJ(H) junctions and L chain V(L)J(L) junctions is problematic. We have developed a statistical approach for the partitioning of such sequences, by analyzing the distribution of point mutations between a determined V gene segment and putative Ig regions. The establishment of objective criteria for the partitioning of sequences between V(H), D, and J(H) gene segments has allowed us to more carefully analyze intervening putative nontemplated (N) nucleotides. An analysis of 225 IgM H chain sequences, with five or fewer V mutations, led to the alignment of 199 sequences. Only 5.0% of sequences lacked N nucleotides at the V(H)D junction (N1), and 10.6% at the DJ(H) junction (N2). Long N regions (>9 nt) were seen in 20.6% of N1 regions and 17.1% of N2 regions. Using a statistical analysis based upon known features of N addition, and mutation analysis, two of these N regions aligned with D gene segments, and a third aligned with an inverted D gene segment. Nine additional sequences included possible alignments with a second D segment. Four of the remaining 40 long N1 regions included 5' sequences having six or more matches to V gene end motifs, which may be the result of V gene replacement. Such sequences were not seen in long N2 regions. The long N regions frequently seen in the expressed repertoire of human Ig gene rearrangements can therefore only partly be explained by V gene replacement and D-D fusion.  相似文献   

18.
Fuglsang A 《Genetics》2006,172(2):1301-1307
In 1990, Frank Wright introduced a method for measuring synonymous codon usage bias in a gene by estimation of the "effective number of codons," N(c). Several attempts have been made recently to improve Wright's estimate of N(c), but the methods that work in cases where a gene encodes a protein not containing all amino acids with degenerate codons have not been tested against each other. In this article I derive five new estimators of N(c) and test them together with the two published estimators, using resampling under rigorous testing conditions. Estimation of codon homozygosity, F, turns out to be a key to the estimation of N(c). F can be estimated in two closely related ways, corresponding to sampling with or without replacement, the latter being what Wright used. The N(c) methods that are based on sampling without replacement showed much better accuracy at short gene lengths than those based on sampling with replacement, indicating that Wright's homozygosity method is superior. Surprisingly, the methods based on sampling with replacement displayed a superior correlation with mRNA levels in Escherichia coli.  相似文献   

19.
Bootstrap confidence intervals for adaptive cluster sampling   总被引:2,自引:0,他引:2  
Consider a collection of spatially clustered objects where the clusters are geographically rare. Of interest is estimation of the total number of objects on the site from a sample of plots of equal size. Under these spatial conditions, adaptive cluster sampling of plots is generally useful in improving efficiency in estimation over simple random sampling without replacement (SRSWOR). In adaptive cluster sampling, when a sampled plot meets some predefined condition, neighboring plots are added to the sample. When populations are rare and clustered, the usual unbiased estimators based on small samples are often highly skewed and discrete in distribution. Thus, confidence intervals based on asymptotic normal theory may not be appropriate. We investigated several nonparametric bootstrap methods for constructing confidence intervals under adaptive cluster sampling. To perform bootstrapping, we transformed the initial sample in order to include the information from the adaptive portion of the sample yet maintain a fixed sample size. In general, coverages of bootstrap percentile methods were closer to nominal coverage than the normal approximation.  相似文献   

20.
Exemplar longest common subsequence   总被引:1,自引:0,他引:1  
In this paper, we investigate the computational and approximation complexity of the Exemplar Longest Common Subsequence of a set of sequences (ELCS problem), a generalization of the Longest Common Subsequence problem, where the input sequences are over the union of two disjoint sets of symbols, a set of mandatory symbols and a set of optional symbols. We show that different versions of the problem are APX-hard even for instances with two sequences. Moreover, we show that the related problem of determining the existence of a feasible solution of the Exemplar Longest Common Subsequence of two sequences is NP-hard. On the positive side, we first present an efficient algorithm for the ELCS problem over instances of two sequences where each mandatory symbol can appear in total at most three times in the sequences. Furthermore, we present two fixed-parameter algorithms for the ELCS problem over instances of two sequences where the parameter is the number of mandatory symbols.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号