首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches. RESULTS: We introduce a new algorithm, LSH-ALL-PAIRS, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found LSH-ALL-PAIRS to be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length  相似文献   

2.
Word matches are widely used to compare genomic sequences. Complete genome alignment methods often rely on the use of matches as anchors for building their alignments, and various alignment-free approaches that characterize similarities between large sequences are based on word matches. Among matches that are retrieved from the comparison of two genomic sequences, a part of them may correspond to spurious matches (SMs), which are matches obtained by chance rather than by homologous relationships. The number of SMs depends on the minimal match length (?) that has to be set in the algorithm used to retrieve them. Indeed, if ? is too small, a lot of matches are recovered but most of them are SMs. Conversely, if ? is too large, fewer matches are retrieved but many smaller significant matches are certainly ignored. To date, the choice of ? mostly depends on empirical threshold values rather than robust statistical methods. To overcome this problem, we propose a statistical approach based on the use of a mixture model of geometric distributions to characterize the distribution of the length of matches obtained from the comparison of two genomic sequences.  相似文献   

3.
MOTIVATION: Studies of efficient and sensitive sequence comparison methods are driven by a need to find homologous regions of weak similarity between large genomes. RESULTS: We describe an improved method for finding similar regions between two sets of DNA sequences. The new method generalizes existing methods by locating word matches between sequences under two or more word models and extending word matches into high-scoring segment pairs (HSPs). The method is implemented as a computer program named DDS2. Experimental results show that DDS2 can find more HSPs by using several word models than by using one word model. AVAILABILITY: The DDS2 program is freely available for academic use in binary code form at http://bioinformatics.iastate.edu/aat/align/align.html and in source code form from the corresponding author.  相似文献   

4.
In this article, we propose a new method for computing rare maximal exact matches between multiple sequences. A rare match between k sequences S(1), ... , S(k) is a string that occurs at most t(i)-times in the sequence S(i), where the t(i) > 0 are user-defined thresholds. First, the suffix tree of one of the sequences (the reference sequence) is built, and then the other sequences are matched separately against this suffix tree. Second, the resulting pairwise exact matches are combined to multiple exact matches. A clever implementation of this method yields a very fast and space efficient program. This program can be applied in several comparative genomics tasks, such as the identification of synteny blocks between whole genomes.  相似文献   

5.
From an evolutionary point of view, the complementarity-determining regions of antibodies are distinct from other proteins including the framework regions of antibodies. A search for identical nucleotide sequences of eighty-four 15 consecutive bp in the complementary-determining regions of human antibody heavy chains with other known sequences yielded four matches: two sequential 15-bp matches, or one 16-bp match, with the coding region of a sea-urchin testis histone H2b-2, one 15-bp match with the promotor region of a cauliflower mosaic virus inclusion body protein, and a 15-bp match with an intron between exons 1 and 2 of human factor IX. As a control, an identical search of eighty-four 15 consecutive bp in the framework regions of human antibody heavy chains yielded no matches with other sequences except those from other antibody framework regions. Since the currently available nucleotide sequence database used in the search consisted of about 1 x 10(7) bp, finding such matches in the complementarity-determining regions might not be random.  相似文献   

6.
Karp PD  Paley S  Zhu J 《Bioinformatics (Oxford, England)》2001,17(6):526-32; discussion 533-4
PROBLEM STATEMENT: We have studied the relationships among SWISS-PROT, TrEMBL, and GenBank with two goals. First is to determine whether users can reliably identify those proteins in SWISS-PROT whose functions were determined experimentally, as opposed to proteins whose functions were predicted computationally. If this information was present in reasonable quantities, it would allow researchers to decrease the propagation of incorrect function predictions during sequence annotation, and to assemble training sets for developing the next generation of sequence-analysis algorithms. Second is to assess the consistency between translated GenBank sequences and sequences in SWISS-PROT and TrEMBL. RESULTS: (1) Contrary to claims by the SWISS-PROT authors, we conclude that SWISS-PROT does not identify a significant number of experimentally characterized proteins. (2) SWISS-PROT is more incomplete than we expected in that version 38.0 from July 1999 lacks many proteins from the full genomes of important organisms that were sequenced years earlier. (3) Even if we combine SWISS-PROT and TrEMBL, some sequences from the full genomes are missing from the combined dataset. (4) In many cases, translated GenBank genes do not exactly match the corresponding SWISS-PROT sequences, for reasons that include missing or removed methionines, differing translation start positions, individual amino-acid differences, and inclusion of sequence data from multiple sequencing projects. For example, results show that for Escherichia coli, 80.6% of the proteins in the GenBank entry for the complete genome have identical sequence matches with SWISS-PROT/TrEMBL sequences, 13.4% have exact substring matches, and matches for 4.1% can be found using BLAST search; the remaining 2.0% of E.coli protein sequences (most of which are ORFs) have no clear matches to SWISS-PROT/TrEMBL. Although many of these differences can be explained by the complexity of the DB, and by the curation processes used to create it, the scale of the differences is notable.  相似文献   

7.
Vallat BK  Pillardy J  Elber R 《Proteins》2008,72(3):910-928
The first step in homology modeling is to identify a template protein for the target sequence. The template structure is used in later phases of the calculation to construct an atomically detailed model for the target. We have built from the Protein Data Bank (PDB) a large-scale learning set that includes tens of millions of pair matches that can be either a true template or a false one. Discriminatory learning (learning from positive and negative examples) is used to train a decision tree. Each branch of the tree is a mathematical programming model. The decision tree is tested on an independent set from PDB entries and on the sequences of CASP7. It provides significant enrichment of true templates (between 50 and 100%) when compared to PSI-BLAST. The model is further verified by building atomically detailed structures for each of the tentative true templates with modeller. The probability that a true match does not yield an acceptable structural model (within 6 A RMSD from the native structure) decays linearly as a function of the TM structural-alignment score.  相似文献   

8.
An efficient method for matching nucleic acid sequences.   总被引:2,自引:2,他引:0       下载免费PDF全文
A method of computing the fraction of matches between two nucleic acid sequences at all possible alignments is described. It makes use of the Fast Fourier Transform. It should be particularly efficient for very long sequences, achieving its result in a number of operations proportional to n ln n, where n is the length of the longer of the two sequences. Though the objective achieved is of limited interest, this method will complement algorithms for efficiently finding the longest matching parts of two sequences, and is faster than existing algorithms for finding matches allowing deletions and insertions. A variety of economies can be achieved by this Fast Fourier Transform technique in matching multiple sequences, looking for complementarity rather than identity, and matching the same sequences both in forward and reversed orientations.  相似文献   

9.
Computational analysis of small RNA cloning data   总被引:1,自引:0,他引:1  
Cloning and sequencing is the method of choice for small regulatory RNA identification. Using deep sequencing technologies one can now obtain up to a billion nucleotides--and tens of millions of small RNAs--from a single library. Careful computational analyses of such libraries enabled the discovery of miRNAs, rasiRNAs, piRNAs, and 21U RNAs. Given the large number of sequences that can be obtained from each individual sample, deep sequencing may soon become an alternative to oligonucleotide microarray technology for mRNA expression profiling. In this report we present the methods that we developed for the annotation and expression profiling of small RNAs obtained through large-scale sequencing. These include a fast algorithm for finding nearly perfect matches of small RNAs in sequence databases, a web-accessible software system for the annotation of small RNA libraries, and a Bayesian method for comparing small RNA expression across samples.  相似文献   

10.
11.
When two strings of symbols are aligned it is important to know whether the observed number of matches is better than that expected between two independent sequences with the same frequency of symbols. When strings are of different lengths, nulls need to be inserted in order to align the sequences. One approach is to use simple approximations of sampling for replacement. We describe an algorithm for exactly determining the frequencies of given numbers of matches, sampling without replacement. This does not lead to a simple closed form expression. However we show examples where sampling with, or without, replacement give very similar results and the simple approach may be adequate for all but the smallest cases.  相似文献   

12.
Algorithms for identifying local molecular sequence features   总被引:1,自引:0,他引:1  
Efficient algorithms are described for identifying local molecularsequence features including repeats, dyad symmetry pairingsand aligned matches between sequences, while allowing for errors.Specific applications are given to the genomic sequences ofthe Epstein-Barr virus, Varicella-Zoster virus and the bacteriophages and T7. Received on October 6, 1987; accepted on December 13, 1987  相似文献   

13.
We used both cultivation and direct recovery of bacterial 16S rRNA gene (rDNA) sequences to investigate the structure of the bacterial community in anoxic rice paddy soil. Isolation and phenotypic characterization of 19 saccharolytic and cellulolytic strains are described in the accompanying paper (K.-J. Chin, D. Hahn, U. Hengstmann, W. Liesack, and P. H. Janssen, Appl. Environ. Microbiol. 65:5042-5049, 1999). Here we describe the phylogenetic positions of these strains in relation to 57 environmental 16S rDNA clone sequences. Close matches between the two data sets were obtained for isolates from the culturable populations determined by the most-probable-number counting method to be large (3 x 10(7) to 2.5 x 10(8) cells per g [dry weight] of soil). This included matches with 16S rDNA similarity values greater than 98% within distinct lineages of the division Verrucomicrobia (strain PB90-1) and the Cytophaga-Flavobacterium-Bacteroides group (strains XB45 and PB90-2), as well as matches with similarity values greater than 95% within distinct lines of descent of clostridial cluster XIVa (strain XB90) and the family Bacillaceae (strain SB45). In addition, close matches with similarity values greater than 95% were obtained for cloned 16S rDNA sequences and bacteria (strains DR1/8 and RPec1) isolated from the same type of rice paddy soil during previous investigations. The correspondence between culture methods and direct recovery of environmental 16S rDNA suggests that the isolates obtained are representative geno- and phenotypes of predominant bacterial groups which account for 5 to 52% of the total cells in the anoxic rice paddy soil. Furthermore, our findings clearly indicate that a dual approach results in a more objective view of the structural and functional composition of a soil bacterial community than either cultivation or direct recovery of 16S rDNA sequences alone.  相似文献   

14.
MOTIVATION: In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. RESULTS: We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. AVAILABILITY: On request from the authors. SUPPLEMENTARY INFORMATION: http://bioinformatics.psb.ugent.be/.  相似文献   

15.
MOTIVATION: Dot-matrix plots are widely used for similarity analysis of biological sequences. Many algorithms and computer software tools have been developed for this purpose. Though some of these tools have been reported to handle sequences of a few 100 kb, analysis of genome sequences with a length of >10 Mb on a microcomputer is still impractical due to long execution time and computer memory requirement. RESULTS: Two dot-matrix comparison methods have been developed for analysis of large sequences. The methods initially locate similarity regions between two sequences using a fast word search algorithm, followed with an explicit comparison on these regions. Since the initial screening removes most of random matches, the computing time is substantially reduced. The methods produce high quality dot-matrix plots with low background noise. Space requirements are linear, so the algorithms can be used for comparison of genome size sequences. Computing speed may be affected by highly repetitive sequence structures of eukaryote genomes. A dot-matrix plot of Yeast genome (12 Mb) with both strands was generated in 80 s with a 1 GHz personal computer.  相似文献   

16.
We report the derivation of scores that are based on the analysis of residue-residue contact matrices from 443 3-dimensional structures aligned structurally as 96 families, which can be used to evaluate sequence-structure matches. Residue-residue contacts and the more than 3 x 10(6) amino acid substitutions that take place between pairs of these contacts at aligned positions within each family of structures have been tabulated and segregated according to the solvent accessibility of the residues involved. Contact maps within a family of structures are shown to be highly conserved (approximately 75%) even when the sequence identity is approaching 10%. In a comparison involving a globin structure and the search of a sequence databank (> 21,000 sequences), the contact probability scores are shown to provide a very powerful secondary screen for the top scoring sequence-structure matches, where between 69% and 84% of the unrelated matches are eliminated. The search of an aligned set of 2 globins against a sequence databank and the subsequent residue contact-based evaluation of matches locates all 618 globin sequences before the first non-globin match. From a single bacterial serine proteinase structure, the structural template approach coupled with residue-residue contact substitution data lead to the detection of the mammalian serine proteinase family among the top matches in the search of a sequence databank.  相似文献   

17.
Accelerated off-target search algorithm for siRNA   总被引:7,自引:0,他引:7  
MOTIVATION: Designing highly effective short interfering RNA (siRNA) sequences with maximum target-specificity for mammalian RNA interference (RNAi) is one of the hottest topics in molecular biology. The relationship between siRNA sequences and RNAi activity has been studied extensively to establish rules for selecting highly effective sequences. However, there is a pressing need to compute siRNA sequences that minimize off-target silencing effects efficiently and to match any non-targeted sequences with mismatches. RESULTS: The enumeration of potential cross-hybridization candidates is non-trivial, because siRNA sequences are short, ca. 19 nt in length, and at least three mismatches with non-targets are required. With at least three mismatches, there are typically four or five contiguous matches, so that a BLAST search frequently overlooks off-target candidates. By contrast, existing accurate approaches are expensive to execute; thus we need to develop an accurate, efficient algorithm that uses seed hashing, the pigeonhole principle, and combinatorics to identify mismatch patterns. Tests show that our method can list potential cross-hybridization candidates for any siRNA sequence of selected human gene rapidly, outperforming traditional methods by orders of magnitude in terms of computational performance. AVAILABILITY: http://design.RNAi.jp CONTACT: yamada@cb.k.u-tokyo.ac.jp.  相似文献   

18.
Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.  相似文献   

19.
《MABS-AUSTIN》2013,5(7):1197-1205
ABSTRACT

Recently it has become possible to query the great diversity of natural antibody repertoires using next-generation sequencing (NGS). These methods are capable of producing millions of sequences in a single experiment. Here we compare clinical-stage therapeutic antibodies to the ~1b sequences from 60 independent sequencing studies in the Observed Antibody Space database, which includes antibody sequences from NGS analysis of immunoglobulin gene repertoires. Of 242 post-Phase 1 antibodies, we found 16 with sequence identity matches of 95% or better for both heavy and light chains. There are also 54 perfect matches to therapeutic CDR-H3 regions in the NGS outputs, suggesting a nontrivial amount of convergence between naturally observed sequences and those developed artificially. This has potential implications for both the legal protection of commercial antibodies and the discovery of antibody therapeutics.  相似文献   

20.
MOTIVATION: RNA secondary structure analysis often requires searching for potential helices in large sequence data. RESULTS: We present a utility program GUUGle that efficiently locates potential helical regions under RNA base pairing rules, which include Watson-Crick as well as G-U pairs. It accepts a positive and a negative set of sequences, and determines all exact matches under RNA rules between positive and negative sequences that exceed a specified length. The GUUGle algorithm can also be adapted to use a precomputed suffix array of the positive sequence set. We show how this program can be effectively used as a filter preceding a more computationally expensive task such as miRNA target prediction. AVAILABILITY: GUUGle is available via the Bielefeld Bioinformatics Server at http://bibiserv.techfak.uni-bielefeld.de/guugle  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号