首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

2.
Rai BK  Fiser A 《Proteins》2006,63(3):644-661
A major bottleneck in comparative protein structure modeling is the quality of input alignment between the target sequence and the template structure. A number of alignment methods are available, but none of these techniques produce consistently good solutions for all cases. Alignments produced by alternative methods may be superior in certain segments but inferior in others when compared to each other; therefore, an accurate solution often requires an optimal combination of them. To address this problem, we have developed a new approach, Multiple Mapping Method (MMM). The algorithm first identifies the alternatively aligned regions from a set of input alignments. These alternatively aligned segments are scored using a composite scoring function, which determines their fitness within the structural environment of the template. The best scoring regions from a set of alternative segments are combined with the core part of the alignments to produce the final MMM alignment. The algorithm was tested on a dataset of 1400 protein pairs using 11 combinations of two to four alignment methods. In all cases MMM showed statistically significant improvement by reducing alignment errors in the range of 3 to 17%. MMM also compared favorably over two alignment meta-servers. The algorithm is computationally efficient; therefore, it is a suitable tool for genome scale modeling studies.  相似文献   

3.
Reddy BV  Li WW  Shindyalov IN  Bourne PE 《Proteins》2001,42(2):148-163
An all-against-all protein structure comparison using the Combinatorial Extension (CE) algorithm applied to a representative set of PDB structures revealed a gallery of common substructures in proteins (http://cl.sdsc.edu/ce.html). These substructures represent commonly identified folds, domains, or components thereof. Most of the subsequences forming these similar substructures have no significant sequence similarity. We present a method to identify conserved amino acid positions and residue-dependent property clusters within these subsequences starting with structure alignments. Each of the subsequences is aligned to its homologues in SWALL, a nonredundant protein sequence database. The most similar sequences are purged into a common frequency matrix, and weighted homologues of each one of the subsequences are used in scoring for conserved key amino acid positions (CKAAPs). We have set the top 20% of the high-scoring positions in each substructure to be CKAAPs. It is hypothesized that CKAAPs may be responsible for the common folding patterns in either a local or global view of the protein-folding pathway. Where a significant number of structures exist, CKAAPs have also been identified in structure alignments of complete polypeptide chains from the same protein family or superfamily. Evidence to support the presence of CKAAPs comes from other computational approaches and experimental studies of mutation and protein-folding experiments, notably the Paracelsus challenge. Finally, the structural environment of CKAAPs versus non-CKAAPs is examined for solvent accessibility, hydrogen bonding, and secondary structure. The identification of CKAAPs has important implications for protein engineering, fold recognition, modeling, and structure prediction studies and is dependent on the availability of structures and an accurate structure alignment methodology. Proteins 2001;42:148-163.  相似文献   

4.
An algorithm is presented for the multiple alignment of protein sequences that is both accurate and rapid computationally. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, two sequences are aligned, then the third sequence is aligned against the alignment of both sequences one and two. Similarly, the fourth sequence is aligned against one, two and three. This is repeated until all sequences have been aligned. Iteration is then performed to yield a final alignment. The accuracy of sequence alignment is evaluated from alignment of the secondary structures in a family of proteins. For the globins, the multiple alignment was on average 99% accurate compared to 90% for pairwise comparison of sequences. For the alignment of immunoglobulin constant and variable domains, the use of many sequences yielded an alignment of 63% average accuracy compared to 41% average for individual variable/constant alignments. The multiple alignment algorithm yields an assignment of disulphide connectivity in mammalian serotransferrin that is consistent with crystallographic data, whereas pairwise alignments give an alternative assignment.  相似文献   

5.
Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology-based modeling of three-dimensional (3D)-structure. We investigated correspondence between "gold standard" alignments of 3D protein structures and the sequence alignments produced by the Smith-Waterman algorithm, currently the most sensitive method for pair-wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith-Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, "islands" between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith-Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith-Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency.  相似文献   

6.
Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased.  相似文献   

7.
MOTIVATION: Recent studies have revealed the importance of considering quality scores of reads generated by next-generation sequence (NGS) platforms in various downstream analyses. It is also known that probabilistic alignments based on marginal probabilities (e.g. aligned-column and/or gap probabilities) provide more accurate alignment than conventional maximum score-based alignment. There exists, however, no study about probabilistic alignment that considers quality scores explicitly, although the method is expected to be useful in SNP/indel callers and bisulfite mapping, because accurate estimation of aligned columns or gaps is important in those analyses. RESULTS: In this study, we propose methods of probabilistic alignment that consider quality scores of (one of) the sequences as well as a usual score matrix. The method is based on posterior decoding techniques in which various marginal probabilities are computed from a probabilistic model of alignments with quality scores, and can arbitrarily trade-off sensitivity and positive predictive value (PPV) of prediction (aligned columns and gaps). The method is directly applicable to read mapping (alignment) toward accurate detection of SNPs and indels. Several computational experiments indicated that probabilistic alignments can estimate aligned columns and gaps accurately, compared with other mapping algorithms e.g. SHRiMP2, Stampy, BWA and Novoalign. The study also suggested that our approach yields favorable precision for SNP/indel calling.  相似文献   

8.
MOTIVATION: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. METHODS: We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. RESULTS: We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable--a critical piece of information for comparative modeling applications.  相似文献   

9.
Rasmussen TK  Krink T 《Bio Systems》2003,72(1-2):5-17
Multiple sequence alignment (MSA) is one of the basic problems in computational biology. Realistic problem instances of MSA are computationally intractable for exact algorithms. One way to tackle MSA is to use Hidden Markov Models (HMMs), which are known to be very powerful in the related problem domain of speech recognition. However, the training of HMMs is computationally hard and there is no known exact method that can guarantee optimal training within reasonable computing time. Perhaps the most powerful training method is the Baum-Welch algorithm, which is fast, but bears the problem of stagnation at local optima. In the study reported in this paper, we used a hybrid algorithm combining particle swarm optimization with evolutionary algorithms to train HMMs for the alignment of protein sequences. Our experiments show that our approach yields better alignments for a set of benchmark protein sequences than the most commonly applied HMM training methods, such as Baum-Welch and Simulated Annealing.  相似文献   

10.
A hidden Markov model for progressive multiple alignment   总被引:4,自引:0,他引:4  
MOTIVATION: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. RESULTS: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment.  相似文献   

11.
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality.  相似文献   

12.
When aligning RNAs, it is important to consider both the secondary structure similarity and primary sequence similarity to find an accurate alignment. However, algorithms that can handle RNA secondary structures typically have high computational complexity that limits their utility. For this reason, there have been a number of attempts to find useful alignment constraints that can reduce the computations without sacrificing the alignment accuracy. In this paper, we propose a new method for finding effective alignment constraints for fast and accurate structural alignment of RNAs, including pseudoknots. In the proposed method, we use a profile-HMM to identify the “seedâ€� regions that can be aligned with high confidence. We also estimate the position range of the aligned bases that are located outside the seed regions. The location of the seed regions and the estimated range of the alignment positions are then used to establish the sequence alignment constraints. We incorporated the proposed constraints into the profile context-sensitive HMM (profile-csHMM) based RNA structural alignment algorithm. Experiments indicate that the proposed method can make the alignment speed up to 11 times faster without degrading the accuracy of the RNA alignment.  相似文献   

13.
Aligning gene expression time series with time warping algorithms   总被引:1,自引:0,他引:1  
motivation: Increasingly, biological processes are being studied through time series of RNA expression data collected for large numbers of genes. Because common processes may unfold at varying rates in different experiments or individuals, methods are needed that will allow corresponding expression states in different time series to be mapped to one another. Results: We present implementations of time warping algorithms applicable to RNA and protein expression data and demonstrate their application to published yeast RNA expression time series. Programs executing two warping algorithms are described, a simple warping algorithm and an interpolative algorithm, along with programs that generate graphics that visually present alignment information. We show time warping to be superior to simple clustering at mapping corresponding time states. We document the impact of statistical measurement noise and sample size on the quality of time alignments, and present issues related to statistical assessment of alignment quality through alignment scores. We also discuss directions for algorithm improvement including development of multiple time series alignments and possible applications to causality searches and non-temporal processes ('concentration warping').  相似文献   

14.
Newly determined protein structures are classified to belong to a new fold, if the structures are sufficiently dissimilar from all other so far known protein structures. To analyze structural similarities of proteins, structure alignment tools are used. We demonstrate that the usage of nonsequential structure alignment tools, which neglect the polypeptide chain connectivity, can yield structure alignments with significant similarities between proteins of known three-dimensional structure and newly determined protein structures that possess a new fold. The recently introduced protein structure alignment tool, GANGSTA, is specialized to perform nonsequential alignments with proper assignment of the secondary structure types by focusing on helices and strands only. In the new version, GANGSTA+, the underlying algorithms were completely redesigned, yielding enhanced quality of structure alignments, offering alignment against a larger database of protein structures, and being more efficient. We applied DaliLite, TM-align, and GANGSTA+ on three protein crystal structures considered to be novel folds. Applying GANGSTA+ to these novel folds, we find proteins in the ASTRAL40 database, which possess significant structural similarities, albeit the alignments are nonsequential and in some cases involve secondary structure elements aligned in reverse orientation. A web server is available at http://agknapp.chemie.fu-berlin.de/gplus for pairwise alignment, visualization, and database comparison.  相似文献   

15.
SUMMARY: We introduce an algorithm that uses the information gained from simultaneous consideration of an entire group of related proteins to create multiple structure alignments (MSTAs). Consistency-based alignment (CBA) first harnesses the information contained within regions that are consistently aligned among a set of pairwise superpositions in order to realign pairs of proteins through both global and local refinement methods. It then constructs a multiple alignment that is maximally consistent with the improved pairwise alignments. We validate CBA's alignments by assessing their accuracy in regions where at least two of the aligned structures contain the same conserved sequence motif. RESULTS: CBA correctly aligns well over 90% of motif residues in superpositions of proteins belonging to the same family or superfamily, and it outperforms a number of previously reported MSTA algorithms.  相似文献   

16.
MOTIVATION: Multiple structure alignments are becoming important tools in many aspects of structural bioinformatics. The current explosion in the number of available protein structures demands multiple structural alignment algorithms with an adequate balance of accuracy and speed, for large scale applications in structural genomics, protein structure prediction and protein classification. RESULTS: A new multiple structural alignment program, MAMMOTH-mult, is described. It is demonstrated that the alignments obtained with the new method are an improvement over previous manual or automatic alignments available in several widely used databases at all structural levels. Detailed analysis of the structural alignments for a few representative cases indicates that MAMMOTH-mult delivers biologically meaningful trees and conservation at the sequence and structural levels of functional motifs in the alignments. An important improvement over previous methods is the reduction in computational cost. Typical alignments take only a median time of 5 CPU seconds in a single R12000 processor. MAMMOTH-mult is particularly useful for large scale applications. AVAILABILITY: http://ub.cbm.uam.es/mammoth/mult.  相似文献   

17.
The characterization of biological function among newly determined protein structures is a central challenge in structural genomics. One class of computational solutions to this problem is based on the similarity of protein structure. Here, we implement a simple yet efficient measure of protein structure similarity, the contact metric. Even though its computation avoids structural alignments and is therefore nearly instantaneous, we find that small values correlate with geometrical root mean square deviations obtained from structural alignments. To test whether the contact metric detects functional similarity, as defined by Gene Ontology (GO) terms, it was compared in large-scale computational experiments to four other measures of structural similarity, including alignment algorithms as well as alignment independent approaches. The contact metric was the fastest method and its sensitivity, at any given specificity level, was a close second only to Fast Alignment and Search Tool--a structural alignment method that is slower by three orders of magnitude. Critically, nearly 40% of correct functional inferences by the contact metric were not identified by any other approach, which shows that the contact metric is complementary and computationally efficient in detecting functional relationships between proteins. A public 'Contact Metric Internet Server' is provided.  相似文献   

18.
Previously proposed methods for protein secondary structure prediction from multiple sequence alignments do not efficiently extract the evolutionary information that these alignments contain. The predictions of these methods are less accurate than they could be, because of their failure to consider explicitly the phylogenetic tree that relates aligned protein sequences. As an alternative, we present a hidden Markov model approach to secondary structure prediction that more fully uses the evolutionary information contained in protein sequence alignments. A representative example is presented, and three experiments are performed that illustrate how the appropriate representation of evolutionary relatedness can improve inferences. We explain why similar improvement can be expected in other secondary structure prediction methods and indeed any comparative sequence analysis method.  相似文献   

19.
Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these “bent” alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of α-helices and β-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.  相似文献   

20.

Background  

A structure alignment method based on a local geometric property is presented and its performance is tested in pairwise and multiple structure alignments. In this approach, the writhing number, a quantity originating from integral formulas of Vassiliev knot invariants, is used as a local geometric measure. This measure is used in a sliding window to calculate the local writhe down the length of the protein chain. By encoding the distribution of writhing numbers across all the structures in the protein databank (PDB), protein geometries are represented in a 20-letter alphabet. This encoding transforms the structure alignment problem into a sequence alignment problem and allows the well-established algorithms of sequence alignment to be employed. Such geometric alignments offer distinct advantages over structural alignments in Cartesian coordinates as it better handles structural subtleties associated with slight twists and bends that distort one structure relative to another.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号