首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Digital signal processing methods for biosequence comparison.   总被引:1,自引:1,他引:0       下载免费PDF全文
A method is discussed for DNA or protein sequence comparison using a finite field fast Fourier transform, a digital signal processing technique; and statistical methods are discussed for analyzing the output of this algorithm. This method compares two sequences of length N in computing time proportional to N log N compared to N2 for methods currently used. This method makes it feasible to compare very long sequences. An example is given to show that the method correctly identifies sites of known homology.  相似文献   

2.
Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.  相似文献   

3.
《Genomics》2019,111(6):1574-1582
Given the vast amount of genomic data, alignment-free sequence comparison methods are required due to their low computational complexity. k-mer based methods can improve comparison accuracy by extracting an effective feature of the genome sequences. The aim of this paper is to extract k-mer intervals of a sequence as a feature of a genome for high comparison accuracy. In the proposed method, we calculated the distance between genome sequences by comparing the distribution of k-mer intervals. Then, we identified the classification results using phylogenetic trees. We used viral, mitochondrial (MT), microbial and mammalian genome sequences to perform classification for various genome sets. We confirmed that the proposed method provides a better classification result than other k-mer based methods. Furthermore, the proposed method could efficiently be applied to long sequences such as human and mouse genomes.  相似文献   

4.
A survey of multiple sequence comparison methods   总被引:7,自引:0,他引:7  
Multiple sequence comparison refers to the search for similarity in three or more sequences. This article presents a survey of the exhaustive (optimal) and heuristic (possibly sub-optimal) methods developed for the comparison of multiple macromolecular sequences. Emphasis is given to the different approaches of the heuristic methods. Four distance measures derived from information engineering and genetic studies are introduced for the comparison between two alignments of sequences. The use ofentropy, which plays a central role in information theory as measures of information, choice and uncertainty, is proposed as a simple measure for the evaluation of the optimality of an alignment in the absence of anya priori knowledge about the structures of the sequences being compared. This article also gives two examples of comparison between alternative alignments of the same set of 5SRNAs as obtained by several different heuristic methods.  相似文献   

5.
Protein structure alignment   总被引:22,自引:0,他引:22  
A new method of comparing protein structures is described, based on distance plot analysis. It is relatively insensitive to insertions and deletions in sequence and is tolerant of the displacement of equivalent substructures between the two molecules being compared. When presented with the co-ordinate sets of two structures, the method will produce automatically an alignment of their sequences based on structural criteria. The method uses the dynamic programming optimization technique, which is widely used in the comparison of protein sequences and thus unifies the techniques of protein structure and sequence comparison. Typical structure comparison problems were examined and the results of the new method compared to the published results obtained using conventional methods. In most examples, the new method produced a result that was equivalent, and in some cases superior, to those reported in the literature.  相似文献   

6.
Many proteins need recognition of specific DNA sequences for functioning. The number of recognition sites and their distribution along the DNA might be of biological importance. For example, the number of restriction sites is often reduced in prokaryotic and phage genomes to decrease the probability of DNA cleavage by restriction endonucleases. We call a sequence an exceptional one if its frequency in a genome significantly differs from one predicted by some mathematical model. An exceptional sequence could be either under- or over-represented, depending on its frequency in comparison with the predicted one. Exceptional sequences could be considered biologically meaningful, for example, as targets of DNA-binding proteins or as parts of abundant repetitive elements. Several methods to predict frequency of a short sequence in a genome, based on actual frequencies of certain its subsequences, are used. The most popular are methods based on Markov chain models. But any rigorous comparison of the methods has not previously been performed. We compared three methods for the prediction of short sequence frequencies: the maximum-order Markov chain model-based method, the method that uses geometric mean of extended Markovian estimates, and the method that utilizes frequencies of all subsequences including discontiguous ones. We applied them to restriction sites in complete genomes of 2500 prokaryotic species and demonstrated that the results depend greatly on the method used: lists of 5% of the most under-represented sites differed by up to 50%. The method designed by Burge and coauthors in 1992, which utilizes all subsequences of the sequence, showed a higher precision than the other two methods both on prokaryotic genomes and randomly generated sequences after computational imitation of selective pressure. We propose this method as the first choice for detection of exceptional sequences in prokaryotic genomes.  相似文献   

7.
MOTIVATION: Dot-matrix plots are widely used for similarity analysis of biological sequences. Many algorithms and computer software tools have been developed for this purpose. Though some of these tools have been reported to handle sequences of a few 100 kb, analysis of genome sequences with a length of >10 Mb on a microcomputer is still impractical due to long execution time and computer memory requirement. RESULTS: Two dot-matrix comparison methods have been developed for analysis of large sequences. The methods initially locate similarity regions between two sequences using a fast word search algorithm, followed with an explicit comparison on these regions. Since the initial screening removes most of random matches, the computing time is substantially reduced. The methods produce high quality dot-matrix plots with low background noise. Space requirements are linear, so the algorithms can be used for comparison of genome size sequences. Computing speed may be affected by highly repetitive sequence structures of eukaryote genomes. A dot-matrix plot of Yeast genome (12 Mb) with both strands was generated in 80 s with a 1 GHz personal computer.  相似文献   

8.
MOTIVATION: Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. RESULTS: We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. AVAILABILITY: The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author.  相似文献   

9.
A probabilistic measure for alignment-free sequence comparison   总被引:3,自引:0,他引:3  
MOTIVATION: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULTS: The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). AVAILABILITY: All datasets and computer codes written in MATLAB are available upon request from the first author.  相似文献   

10.
A new method to compare two (or several) symbol sequences is developed. The method is based on the comparison of the frequencies of the small fragments of the compared sequences; it requires neither string editing, nor other transformations of the compared objects. The comparison is executed through a calculation of the specific entropy of a frequency dictionary against the special dictionary called the hybrid one; this latter is the statistical ancestor of the group of sequences under comparison. Some applications of the developed method in the fields of genetics and bioinformatics are discussed.  相似文献   

11.
We propose two approximate methods (one based on parsimony and one on pairwise sequence comparison) for estimating the pattern of nucleotide substitution and a parsimony-based method for estimating the gamma parameter for variable substitution rates among sites. The matrix of substitution rates that represents the substitution pattern can be recovered through its relationship with the observable matrix of site pattern frequences in pairwise sequence comparisons. In the parsimony approach, the ancestral sequences reconstructed by the parsimony algorithm were used, and the two sequences compared are those at the ends of a branch in the phylogenetic tree. The method for estimating the gamma parameter was based on a reinterpretation of the numbers of changes at sites inferred by parsimony. Three data sets were analyzed to examine the utility of the approximate methods compared with the more reliable likelihood methods. The new methods for estimating the substitution pattern were found to produce estimates quite similar to those obtained from the likelihood analyses. The new method for estimating the gamma parameter was effective in reducing the bias in conventional parsimony estimates, although it also overestimated the parameter. The approximate methods are computationally very fast and appear useful for analyzing large data sets, for which use of the likelihood method requires excessive computation.   相似文献   

12.
The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background.  相似文献   

13.
Pairwise sequence alignments aim to decide whether two sequences are related and, if so, to exhibit their related domains. Recent works have pointed out that a significant number of true homologous sequences are missed when using classical comparison algorithms. This is the case when two homologous sequences share several little blocks of homology, too small to lead to a significant score. On the other hand, classical alignment algorithms, when detecting homologies, may fail to recognize all the significant biological signals. The aim of the paper is to give a solution to these two problems. We propose a new scoring method which tends to increase the score of an alignment when "blocks" are detected. This so-called Block-Scoring algorithm, which makes use of dynamic programming, is worth being used as a complementary tool to classical exact alignments methods. We validate our approach by applying it on a large set of biological data. Finally, we give a limit theorem for the score statistics of the algorithm.  相似文献   

14.
Slippage is an important sequencing problem that can occur in EST projects. However, very few studies have addressed this. We propose three new methods to detect slippage artifacts: arithmetic mean method, geometric mean method, and echo coverage method. Each method is simple and has two different strategies for processing sequences: suffix and subsequence. Using the 291,689 EST sequences produced in the SUCEST project, we performed comparative tests between our proposed methods and the SUCEST method. The subsequence strategy is better than the suffix strategy, because it is not anchored at the end of the sequence, so it is more flexible to find slippage at the beginning of the EST. In a comparison with the SUCEST method, the advantage of our methods is that they do not discard the majority of the sequences marked as slippage, but instead only remove the slipped artifact from the sequence. Based on our tests the echo coverage method with subsequence strategy shows the best compromise between slippage detection and ease of calibration.  相似文献   

15.
Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (?) that has to be set in the algorithm used to retrieve them. Indeed, if ? is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ? is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ? mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.  相似文献   

16.
Among the fundamental problems in molecular evolution and in the analysis of homologous sequences are alignment, phylogeny reconstruction, and the reconstruction of ancestral sequences. This paper presents a fast, combined solution to these problems. The new algorithm gives an approximation to the minimal history in terms of a distance function on sequences. The distance function on sequences is a minimal weighted path length constructed from substitutions and insertions-deletions of segments of any length. Substitutions are weighted with an arbitrary metric on the set of nucleotides or amino acids, and indels are weighted with a gap penalty function of the form gk = a + (bxk), where k is the length of the indel and a and b are two positive numbers. A novel feature is the introduction of the concept of sequence graphs and a generalization of the traditional dynamic sequence comparison algorithm to the comparison of sequence graphs. Sequence graphs ease several computational problems. They are used to represent large sets of sequences that can then be compared simultaneously. Furthermore, they allow the handling of multiple, equally good, alignments, where previous methods were forced to make arbitrary choices. A program written in C implemented this method; it was tested first on 22 5S RNA sequences.   相似文献   

17.
Gilbert PB  Wu C  Jobes DV 《Biometrics》2008,64(1):198-207
Summary .   Consider a placebo-controlled preventive HIV vaccine efficacy trial. An HIV amino acid sequence is measured from each volunteer who acquires HIV, and these sequences are aligned together with the reference HIV sequence represented in the vaccine. We develop genome scanning methods to identify positions at which the amino acids in infected vaccine recipient sequences either (A) are more divergent from the reference amino acid than the amino acids in infected placebo recipient sequences or (B) have a different frequency distribution than the placebo sequences, irrespective of a reference amino acid. We consider t -test-type statistics for problem A and Euclidean, Mahalanobis, and Kullback–Leibler-type statistics for problem B. The test statistics incorporate weights to reflect biological information contained in different amino acid positions and mismatches. Position-specific p -values are obtained by approximating the null distribution of the statistics either by a permutation procedure or by a nonparametric estimation. A permutation method is used to estimate a cut-off p -value to control the per comparison error rate at a prespecified level. The methods are examined in simulations and are applied to two HIV examples. The methods for problem B address the general problem of comparing discrete frequency distributions between groups in a high-dimensional data setting.  相似文献   

18.
A new method for comparing and aligning protein sequences is described. This method, hydrophobic cluster analysis (HCA), relies upon a two-dimensional (2D) representation of the sequences. Hydrophobic clusters are determined in this 2D pattern and then used for the sequence comparisons. The method does not require powerful computer resources and can deal with distantly related proteins, even if no 3D data are available. This is illustrated in the present report by a comparison of human haemoglobin with leghaemoglobin, a comparison of the two domains of liver rhodanese (thiosulphate sulphurtransferase) and a comparison of plastocyanin and azurin.  相似文献   

19.
MOTIVATION: The development of powerful automatic methods for the comparison of protein sequences has become increasingly important. Profile-to-profile comparisons allow for the use of broader information about protein families, resulting in more sensitive and accurate comparisons of distantly related sequences. A key part in the comparison of two profiles is the method for the calculation of scores for the position matches. A number of methods based on various theoretical considerations have been proposed. We implemented several previously reported scoring functions as well as our own functions, and compared them on the basis of their ability to produce accurate short ungapped alignments of a given length. RESULTS: Our results suggest that the family of the probabilistic methods (log-odds based methods and prof_sim) may be the more appropriate choice for the generation of initial 'seeds' as the first step to produce local profile-profile alignments. The most effective scoring systems were the closely related modifications of functions previously implemented in the COMPASS and Picasso methods.  相似文献   

20.
Bilateral similarity function is designed for analyzing the similarities of biological sequences such as DNA, RNA secondary structure or protein in this paper. The defined function can perform comprehensive comparison between sequences remarkably well, both in terms of the Hamming distance of two compared sequences and the corresponding location difference. Compared with the existing methods for similarity analysis, the examination of similarities/dissimilarities illustrates that the proposed method with the computational complexity of O(N) is effective for these three kinds of biological sequences, and bears the universality for them.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号