首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The explosive growth in biological data in recent years has led to the development of new methods to identify DNA sequences. Many algorithms have recently been developed that search DNA sequences looking for unique DNA sequences. This paper considers the application of the Burrows-Wheeler transform (BWT) to the problem of unique DNA sequence identification. The BWT transforms a block of data into a format that is extremely well suited for compression. This paper presents a time-efficient algorithm to search for unique DNA sequences in a set of genes. This algorithm is applicable to the identification of yeast species and other DNA sequence sets.  相似文献   

2.
General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory.  相似文献   

3.
4.
Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log2 |summation|)(1+epsilon) bits per character (where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(| summation|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 (1+|summation|)(1+epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.  相似文献   

5.
We discuss the statistical significance of local similarities found between DNA sequences, and illustrate the procedure with reference to the Queen and Korn algorithm. If the longest similarity found for two sequences has length L, this length is said to be significant at the 5% level if there is a probability of no more than 0.05 of finding a length of L or greater between a pair of sequences consisting of randomly chosen bases with the same overall base frequencies. The distribution of longest lengths is related to that of lengths from any particular pair of starting positions on the two sequences. For our implementation of the Queen and Korn algorithm, this latter distribution is constructed by combining the five different blocks of bases that may be added to extend a similarity. A table is given to assess the significance of longest similarities in sequences of length up to 1000 bases. Quite long similarities are expected to occur by chance alone. The critical values we calculate for assessing significance are preferable to expected numbers of similarities used by some commercial computer packages.  相似文献   

6.
Making sense of score statistics for sequence alignments   总被引:1,自引:0,他引:1  
The search for similarity between two biological sequences lies at the core of many applications in bioinformatics. This paper aims to highlight a few of the principles that should be kept in mind when evaluating the statistical significance of alignments between sequences. The extreme value distribution is first introduced, which in most cases describes the distribution of alignment scores between a query and a database. The effects of the similarity matrix and gap penalty values on the score distribution are then examined, and it is shown that the alignment statistics can undergo an abrupt phase transition. A few types of random sequence databases used in the estimation of statistical significance are presented, and the statistics employed by the BLAST, FASTA and PRSS programs are compared. Finally the different strategies used to assess the statistical significance of the matches produced by profiles and hidden Markov models are presented.  相似文献   

7.
动物肌动蛋白基因中内含子的来源及存在意义的探讨   总被引:5,自引:0,他引:5  
吴加金  吴晓霞 《遗传学报》1998,25(5):409-415
对动物界演化过程中肌动蛋白家族内含子插入位置分布的演化规律作了分析,并对相同插入位置的内含子序列按同亚型和不同亚型作了比较。结果得出:从整个肌动蛋白家族的外显子序列高度保守性推断整个肌动蛋白家族可能是从共同的祖先蛋白进化而来的;从同亚型肌动蛋白内含子序列的类似性随进化距离而变化,但在短进化距离的物种间,类似性都较高,不同亚型肌动蛋白内含子序列的类似性都较低,即使是同一物种(如人),类似性也远低于同亚型但进化距离较近的物种,由此可推断,同亚型肌动蛋白的内含子序列可能从共同祖先进化,不同亚型肌动蛋白的内含子序列从不同祖先进化,综上推断可导出内含子可能是在蛋白异化过程中获得的:还发现内含子在肌动蛋白家族编码基因中位置的分布随进化方向不同而逐步形成两种截然不同的模式,由此提出了内含子的位置分布与动物演化方向之间可能具有某种必然联系,为内含子的存在提出了某种依据。  相似文献   

8.
The concept of nucleic acid sequence base alternations is presented.The number of base alterations for the sequences of differentlength is established. The definition of "enlarged similarity"of nucleic acids sequences on the basis of sequence base alterationsis introduced. Mutual information between sequences is usedas a quantitative measure of enlarged similarity for two comparedsequences. The method of mutual information calculation is developedconsidering the correlation of bases in compared sequences.The definitions of correlated similarity and evolution similaritybetween compared sequences are given. Results of the use ofenlarged similarity approach for DNA sequences analysis arediscussed.  相似文献   

9.
在生物序列的二维图形表示的基础上,利用Balaban指数和信息分布指数比较生物序列的相似性,我们以包括人类等9种不同物种的DNA序列和yar029w等6种蛋白质为例来说明该方法的使用.  相似文献   

10.
It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.  相似文献   

11.
In this paper we report a novel mathematical method to transform the DNA sequences into the distribution vectors which correspond to points in the sixty dimensional space. Each component of the distribution vector represents the distribution of one kind of nucleotide in k segments of the DNA sequences. The mathematical and statistical properties of the distribution vectors are demonstrated and examined with huge datasets of human DNA sequences and random sequences. The determined expectation and standard deviation can make the mapping stable and practicable. Moreover, we apply the distribution vectors to the clustering of the Haemagglutinin (HA) gene of 60 H1N1 viruses from Human, Swine and Avian, the complete mitochondrial genomes from 80 placental mammals and the complete genomes from 50 bacteria. The 60 H1N1 viruses, 80 placental mammals and 50 bacteria are classified accurately and rapidly compared to the multiple sequence alignment methods. The results indicate that the distribution vectors can reveal the similarity and evolutionary relationship among homologous DNA sequences based on the distances between any two of these distribution vectors. The advantage of fast computation offers the distribution vectors the opportunity to deal with a huge amount of DNA sequences efficiently.  相似文献   

12.
In this paper, we present an approach based on Burrows–Wheeler transform to compare the protein sequences. The strings representing amino acid sequences do not reflect the chemical physical properties better, and it is very hard to extract any key features by reading these long character strings directly. The use of the Burrows–Wheeler similarity distribution needs a suitable representation which can reflect some interesting properties of the proteins. For the comparison of the primary protein sequences we convert the protein sequences into digital codes by the Ponnuswamy hydrophobicity index, and for the comparison of the structure of the proteins we adjust the topology of protein structure strings, which are simple but useful representation of the secondary structure of proteins to match the Burrows–Wheeler similarity distribution. At last, some experiments show that the approach proposed in this paper is a powerful and useful tool for the comparison of proteins.  相似文献   

13.
G von Heijne 《The EMBO journal》1984,3(10):2315-2318
A statistical analysis of the distribution of charged residues in the N-terminal region of 39 prokaryotic and 134 eukaryotic signal sequences reveals a remarkable similarity between the two samples, both in terms of net charge and in terms of the position of charged residues within the N-terminal region, and suggests that the formyl group on Metf is not removed in prokaryotic signal sequences.  相似文献   

14.
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source .  相似文献   

15.
The statistical distribution of nucleic acid similarities.   总被引:18,自引:6,他引:12       下载免费PDF全文
All pairs of a large set of known vertebrate DNA sequences were searched by computer for most similar segments. Analysis of this data shows that the computed similarity scores are distributed proportionally to the logarithm of the product of the lengths of the sequences involved. This distribution is closely related to recent results of Erdos and others on the longest run of heads in coin tossing. A simple rule is derived for determination of statistical significance of the similarity scores and to assist in relating statistical and biological significance.  相似文献   

16.
Locally optimal subalignments using nonlinear similarity functions   总被引:2,自引:0,他引:2  
Nonlinear similarity functions are often better than linear functions at distinguishing interesting subalignments from those due to chance. Nonlinear similarity functions useful for comparing biological sequences are developed. Several new algorithms are presented for finding locally optimal subalignments of two sequences. Unlike previous algorithms, they may use any reasonable similarity function as a selection criterion. Among these algorithms are VV-1, which finds all and only the locally optimal subalignments of two sequences, and CC-1, which finds all and only the weakly locally optimal subalignments of two sequences. The VV-1 algorithm is slow and interesting only for theoretical reasons. In contrast, the CC-1 algorithm has average time complexityO(MN) when used to find only very good subalignments.  相似文献   

17.
The nucleotide sequences from fresh-water Dugesia japonica and marine Planocera reticulata have been determined. The similarity between these two species is only 69%. The Planocera sequence reveals nearly 80% similarity (72-81%) to the sequences of multicellular animals, while the Dugesia sequences are considerably different from them (66-73%).  相似文献   

18.
Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.  相似文献   

19.
Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

20.
Genetic polymorphism of 83 isolates of E. coli, derived from 4 species of artiodactyla animals living in a relatively close contact on the grounds of a theme park ZOO Safarii Swierkocin (Poland) was determined using the rep-PCR fingerprinting method, which utilizes oligonucleotide primers matching interspersed repetitive DNA sequences in PCR reaction to yield DNA fingerprints of individual bacterial isolates based on repetitive extragenic palindrome (REP) primers. The fingerprint patterns demonstrated the essential polymorphism of distribution of REP sequences in genomes of the examined isolates. The arithmetic averages clustering algorithm (UPGMA) statistical analysis of fingerprints with the use of the Jaccard similarity coefficient differentiated E. coli isolates into three similarity groups containing various numbers of isolates. The groups comprised isolates derived from two, three and four species of the source animals. The isolates derived from each source segregated in the dendrogram in a different way, both within the similarity groups and among them, indicating an individual repertoire of E. coli in the examined species of animals. The similarity relations among E. coli derived from the same source, illustrated in a dendrogram with a number of subclusters of a low mutual similarity (< or = 20%), indicated an essential interstrain differentiation in terms of the distribution of REP sequences. Our results confirmed the hypothesis of the oligoclonal characters of populations obtained from particular sources. The rep-PCR fingerprinting method with REP primers is simple and highly differentiating and can be recommended for use in explorations of large groups of animals and monitoring the variability of strains.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号