首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
N Tolstrup  P Rouz    S Brunak 《Nucleic acids research》1997,25(15):3159-3163
Little knowledge exists about branch points in plants; it has even been claimed that plant introns lack conserved branch point sequences similar to those found in vertebrate introns. A putative branch point consensus sequence for Arabidopsis thaliana resembling the well known metazoan consensus sequence has been proposed, but this is based on search of sequences similar to those in yeast and metazoa. Here we present a novel consensus sequence found by a non-circular approach. A hidden Markov model with a fixed A nucleotide was trained on sequences upstream of the acceptor site. The consensus found by the Markov model shares features with the metazoan consensus, but differs in its details from the consensus proposed earlier. Despite the fact that branch point consensus sequences in plants are weak, we show that a prediction scheme incorporating them leads to a substantial improvement in the recognition of true acceptor sites; the false positive rate being reduced by a factor of 2. We take this as an indication that the consensus found here is the genuine one and that the branch point does play a role in the proper recognition of the acceptor site in plants.  相似文献   

2.
MOTIVATION: A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules. RESULTS: This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases.  相似文献   

3.
MOTIVATION: Alignment of RNA has a wide range of applications, for example in phylogeny inference, consensus structure prediction and homology searches. Yet aligning structural or non-coding RNAs (ncRNAs) correctly is notoriously difficult as these RNA sequences may evolve by compensatory mutations, which maintain base pairing but destroy sequence homology. Ideally, alignment programs would take RNA structure into account. The Sankoff algorithm for the simultaneous solution of RNA structure prediction and RNA sequence alignment was proposed 20 years ago but suffers from its exponential complexity. A number of programs implement lightweight versions of the Sankoff algorithm by restricting its application to a limited type of structure and/or only pairwise alignment. Thus, despite recent advances, the proper alignment of multiple structural RNA sequences remains a problem. RESULTS: Here we present StrAl, a heuristic method for alignment of ncRNA that reduces sequence-structure alignment to a two-dimensional problem similar to standard multiple sequence alignment. The scoring function takes into account sequence similarity as well as up- and downstream pairing probability. To test the robustness of the algorithm and the performance of the program, we scored alignments produced by StrAl against a large set of published reference alignments. The quality of alignments predicted by StrAl is far better than that obtained by standard sequence alignment programs, especially when sequence homologies drop below approximately 65%; nevertheless StrAl's runtime is comparable to that of ClustalW.  相似文献   

4.
An automated algorithm is presented that delineates protein sequence fragments which display similarity. The method incorporates a selection of a number of local nonoverlapping sequence alignments with the highest similarity scores and a graphtheoretical approach to elucidate the consistent start and end points of the fragments comprising one or more ensembles of related subsequences. The procedure allows the simultaneous identification of different types of repeats within one sequence. A multiple alignment of the resulting fragments is performed and a consensus sequence derived from the ensemble(s). Finally, a profile is constructed form the multiple alignment to detect possible and more distant members within the sequence. The method tolerates mutations in the repeats as well as insertions and deletions. The sequence spans between the various repeats or repeat clusters may be of different lengths. The technique has been applied to a number of proteins where the repeating fragments have been derived from information additional to the protein sequences. © 1993 Wiley-Liss, Inc.  相似文献   

5.
MOTIVATION: The well-known Sankoff algorithm for simultaneous RNA sequence alignment and folding is currently considered an ideal, but computationally over-expensive method. Available tools implement this algorithm under various pragmatic restrictions. They are still expensive to use, and it is difficult to judge if the moderate quality of results is because of the underlying model or to its imperfect implementation. RESULTS: We propose to redefine the consensus structure prediction problem in a way that does not imply a multiple sequence alignment step. For a family of RNA sequences, our method explicitly and independently enumerates the near-optimal abstract shape space, and predicts as the consensus an abstract shape common to all sequences. For each sequence, it delivers the thermodynamically best structure which has this common shape. Since the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the number of sequences. Our evaluation shows that the new method compares favorably with available alternatives. AVAILABILITY: The new method has been implemented in the program RNAcast and is available on the Bielefeld Bioinformatics Server. CONTACT: jreeder@TechFak.Uni-Bielefeld.DE, robert@TechFak.Uni-Bielefeld.DE SUPPLEMENTARY INFORMATION: Available at http://bibiserv.techfak.uni-bielefeld.de/rnacast/supplementary.html  相似文献   

6.
Sensitivity analysis provides a way to measure robustness of clades in sequence‐based phylogenetic analyses to variation in alignment parameters rather than measuring their branch support. We compared three different approaches to multiple sequence alignment in the context of sensitivity analysis: progressive pairwise alignment, as implemented in MUSCLE; simultaneous multiple alignment of sequence fragments, as implemented in DCA; and direct optimization followed by generation of the implied alignment(s), as implemented in POY. We set out to determine the relative sensitivity of these three alignment methods using rDNA sequences and randomly generated sequences. A total of 36 parameter sets were used to create the alignments, varying the transition, transversion, and gap costs. Tree searches were performed using four different character‐coding and weighting approaches: the cost function used for alignment or equally weighted parsimony with gap positions treated as missing data, separate characters, or as fifth states. POY was found to be as sensitive, or more sensitive, to variation in alignment parameters than DCA and MUSCLE for the three empirical datasets, and POY was found to be more sensitive than MUSCLE, which in turn was found to be as sensitive, or more sensitive, than DCA when applied to the randomly generated sequences when sensitivity was measured using the averaged jackknife values. When significant differences in relative sensitivity were found between the different ways of weighting character‐state changes, equally weighted parsimony, for all three ways of treating gapped positions, was less sensitive than applying the same cost function used in alignment for phylogenetic analysis. When branch support is incorporated into the sensitivity criterion, our results favour the use of simultaneous alignment and progressive pairwise alignment using the similarity criterion over direct optimization followed by using the implied alignment(s) to calculate branch support.  相似文献   

7.
MOTIVATION: Consensus sequence generation is important in many kinds of sequence analysis ranging from sequence assembly to profile-based iterative search methods. However, how can a consensus be constructed when its inherent assumption-that the aligned sequences form a single linear consensus-is not true? RESULTS: Partial Order Alignment (POA) enables construction and analysis of multiple sequence alignments as directed acyclic graphs containing complex branching structure. Here we present a dynamic programming algorithm (heaviest_bundle) for generating multiple consensus sequences from such complex alignments. The number and relationships of these consensus sequences reveals the degree of structural complexity of the source alignment. This is a powerful and general approach for analyzing and visualizing complex alignment structures, and can be applied to any alignment. We illustrate its value for analyzing expressed sequence alignments to detect alternative splicing, reconstruct full length mRNA isoform sequences from EST fragments, and separate paralog mixtures that can cause incorrect SNP predictions. AVAILABILITY: The heaviest_bundle source code is available at http://www.bioinformatics.ucla.edu/poa  相似文献   

8.
Multiple sequence alignment by consensus.   总被引:5,自引:3,他引:2       下载免费PDF全文
An algorithm for multiple sequence alignment is given that matches words of length and degree of mismatch chosen by the user. The alignment maximizes an alignment scoring function. The method is based on a novel extension of our consensus sequence methods. The algorithm works for both DNA and protein sequences, and from earlier work on consensus sequences, it is possible to estimate statistical significance.  相似文献   

9.
Splice junction and possible branch point sequences have been collected from 177 plant introns. Consensus sequences for the 5' and 3' splice junctions and for possible branch points have been derived. The splice junction consensus sequences were virtually identical to those of animal introns except that the polypyrimidine stretch at the 3' splice junction was less pronounced in the plant introns. A search for possible branch points with sequences related to the yeast, vertebrate and fungal consensus sequences revealed a similar sequence in plant introns.  相似文献   

10.
An Eulerian path approach to global multiple alignment for DNA sequences.   总被引:3,自引:0,他引:3  
With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.  相似文献   

11.
Alignment of nucleotide and/or amino acid sequences is a fundamental component of sequence‐based molecular phylogenetic studies. Here we examined how different alignment methods affect the phylogenetic trees that are inferred from the alignments. We used simulations to determine how alignment errors can lead to systematic biases that affect phylogenetic inference from those sequences. We compared four approaches to sequence alignment: progressive pairwise alignment, simultaneous multiple alignment of sequence fragments, local pairwise alignment and direct optimization. When taking into account branch support, implied alignments produced by direct optimization were found to show the most extreme behaviour (based on the alignment programs for which nearly equivalent alignment parameters could be set) in that they provided the strongest support for the correct tree in the simulations in which it was easy to resolve the correct tree and the strongest support for the incorrect tree in our long‐branch‐attraction simulations. When applied to alignment‐sensitive process partitions with different histories, direct optimization showed the strongest mutual influence between the process partitions when they were aligned and phylogenetically analysed together, which makes detecting recombination more difficult. Simultaneous alignment performed well relative to direct optimization and progressive pairwise alignment across all simulations. Rather than relying upon methods that integrate alignment and tree search into a single step without accounting for alignment uncertainty, as with implied alignments, we suggest that simultaneous alignment using the similarity criterion, within the context of information available on biological processes and function, be applied whenever possible for sequence‐based phylogenetic analyses.  相似文献   

12.
The conformation of RNA sequences spanning five 3' splice sites and two 5' splice sites in adenovirus mRNA was probed by partial digestion with single-strand specific nucleases. Although cleavage of nucleotides near both 3' and 5' splice sites was observed, most striking was the preferential digestion of sequences near the 3' splice site. At each 3' splice site a region of very strong cleavage is observed at low concentrations of enzyme near the splice site consensus sequence or the upstream branch point consensus sequence. Additional sites of moderately strong cutting near the branch point consensus sequence were observed in those sequences where the splice site was the preferred target. Since recognition of the 3' splice site and branch site appear to be early events in mRNA splicing these observations may indicate that the local conformation of the splice site sequences may play a direct or indirect role in enhancing the accessibility of sequences important for splicing.  相似文献   

13.
Summary Various measures of sequence dissimilarity have been evaluated by how well the additive least squares estimation of edges (branch lengths) of an unrooted evolutionary tree fit the observed pairwise dissimilarity measures and by how consistent the trees are for different data sets derived from the same set of sequences. This evaluation provided sensitive discrimination among dissimilarity measures and among possible trees. Dissimilarity measures not requiring prior sequence alignment did about as well as did the traditional mismatch counts requiring prior sequence alignment. Application of Jukes-Cantor correction to singlet mismatch counts worsened the results. Measures not requiring alignment had the advantage of being applicable to sequences too different to be critically alignable. Two different measures of pairwise dissimilarity not requiring alignment have been used: (1) multiplet distribution distance (MDD), the square of the Euclidean distance between vectors of the fractions of base singlets (or doublets, or triplets, or…) in the respective sequences, and (2) complements of long words (CLW), the count of bases not occurring in significantly long common words. MDD was applicable to sequences more different than was CLW (noncoding), but the latter often gave better results where both measures were available (coding). MDD results were improved by using longer multiplets and, if the sequences were coding, by using the larger amino acid and codon alphabets rather than the nucleotide alphabet. The additive least squares method could be used to provide a reasonable consensus of different trees for the same set of species (or related genes).  相似文献   

14.
Xiong F  Jiang J  Han Z  Zhong R  He L  Zhuang W  Tang R 《Biochemical genetics》2011,49(5-6):352-363
A novel method is introduced for producing molecular markers in plants using single 15- to 18-mer PCR primers designed from the short conserved consensus branch point signal sequences and standard agarose gel electrophoresis. This method was tested on cultivated peanut and verified to give good fingerprinting results in other plant species (mango, banana, and longan). These single primers, designed from relatively conserved branch point signal sequences within gene introns, should be universal across other plant species. The method is rapid, simple, and efficient, and it requires no sequence information of the plant genome of interest. It could be used in conjunction with, or as a substitute for, conventional RAPD or ISSR techniques for applications including genetic diversity analysis, phylogenetic tree construction, and quantitative trait locus mapping. This technique provides a new way to develop molecular markers for assessing genetic diversity of germplasm in diverse species based on conserved branch point signal sequences.  相似文献   

15.
A flexible method to align large numbers of biological sequences   总被引:5,自引:0,他引:5  
Summary A method for the alignment of two or more biological sequences is described. The method is a direct extension of the method of Taylor (1987) incorporating a consensus sequence approach and allows considerable freedom in the control of the clustering of the sequences. At one extreme this is equivalent to the earlier method (Taylor 1987), whereas at the other, the clustering approaches the binary method of Feng and Doolittle (1987). Such freedom allows the program to be adapted to particular problems, which has the important advantage of resulting in considerable savings in computer time, allowing very large problems to be tackled. Besides a detailed analysis of the alignment of the cytochrome c superfamily, the clustering and alignment of the PIR sequence data bank (3500 sequences approx.) is described.  相似文献   

16.
17.
We present a computational scheme to locally align a collection of RNA sequences using sequence and structure constraints. In addition, the method searches for the resulting alignments with the most significant common motifs, among all possible collections. The first part utilizes a simplified version of the Sankoff algorithm for simultaneous folding and alignment of RNA sequences, but maintains tractability by constructing multi-sequence alignments from pairwise comparisons. The algorithm finds the multiple alignments using a greedy approach and has similarities to both CLUSTAL and CONSENSUS, but the core algorithm assures that the pairwise alignments are optimized for both sequence and structure conservation. The choice of scoring system and the method of progressively constructing the final solution are important considerations that are discussed. Example solutions, and comparisons with other approaches, are provided. The solutions include finding consensus structures identical to published ones.  相似文献   

18.
An intermediate stage in the process of eukaryotic RNA splicing is the formation of a lariat structure. It is anchored at an adenosine residue in intron between 10 and 50 nucleotides upstream of the 3' splice site. A short conserved sequence (the branch point sequence) functions as the recognition signal for the site of lariat formation. It has been generally assumed that the branch point is recognized mainly by the presence of its unique sequence where the lariat is formed. However, the known branch point consensus sequence is found to be distributed nearly randomly throughout the gene sequence with only a slightly higher frequency in the expected lariat region. Further, the known consensus sequence is found to be clearly inadequate to specify branch points. These observations have implications for understanding the mechanism of branch point recognition in the process of splicing, and the possible evolution of the branch point signal.  相似文献   

19.
MOTIVATION: A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution. RESULTS: The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach.  相似文献   

20.
Certain thalassemic human beta-globin pre-mRNAs carry mutations that generate aberrant splice sites and/or activate cryptic splice sites, providing a convenient and clinically relevant system to study splice site selection. Antisense 2'-O-methyl oligoribonucleotides were used to block a number of sequences in these pre-mRNAs and were tested for their ability to inhibit splicing in vitro or to affect the ratio between aberrantly and correctly spliced products. By this approach, it was found that (i) up to 19 nucleotides upstream from the branch point adenosine are involved in proper recognition and functioning of the branch point sequence; (ii) whereas at least 25 nucleotides of exon sequences at both 3' and 5' ends are required for splicing, this requirement does not extend past the 5' splice site sequence of the intron; and (iii) improving the 5' splice site of the internal exon to match the consensus sequence strongly decreases the accessibility of the upstream 3' splice site to antisense 2'-O-methyl oligoribonucleotides. This result most likely reflects changes in the strength of interactions near the 3' splice site in response to improvement of the 5' splice site and further supports the existence of communication between these sites across the exon.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号