首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A multiple alignment program for protein sequences   总被引:1,自引:0,他引:1  
A program for the multiple alignment of protein sequences ispresented. The program is an extension of the fast alignmentprogram by Wilbur et al. (1984) into higher dimensions. Theuse of hash procedures on fragments of the protein sequencesincreases the speed of calculation. Thereby we also take intoaccount fragments which are present in some, but not in all,sequences considered. The results of some multiple alignmentsare given. Received on September 11, 1986; accepted on March 18, 1987  相似文献   

2.
MOTIVATION: Structural RNA genes exhibit unique evolutionary patterns that are designed to conserve their secondary structures; these patterns should be taken into account while constructing accurate multiple alignments of RNA genes. The Sankoff algorithm is a natural alignment algorithm that includes the effect of base-pair covariation in the alignment model. However, the extremely high computational cost of the Sankoff algorithm precludes its application to most RNA sequences. RESULTS: We propose an efficient algorithm for the multiple alignment of structural RNA sequences. Our algorithm is a variant of the Sankoff algorithm, and it uses an efficient scoring system that reduces the time and space requirements considerably without compromising on the alignment quality. First, our algorithm computes the match probability matrix that measures the alignability of each position pair between sequences as well as the base pairing probability matrix for each sequence. These probabilities are then combined to score the alignment using the Sankoff algorithm. By itself, our algorithm does not predict the consensus secondary structure of the alignment but uses external programs for the prediction. We demonstrate that both the alignment quality and the accuracy of the consensus secondary structure prediction from our alignment are the highest among the other programs examined. We also demonstrate that our algorithm can align relatively long RNA sequences such as the eukaryotic-type signal recognition particle RNA that is approximately 300 nt in length; multiple alignment of such sequences has not been possible by using other Sankoff-based algorithms. The algorithm is implemented in the software named 'Murlet'. AVAILABILITY: The C++ source code of the Murlet software and the test dataset used in this study are available at http://www.ncrna.org/papers/Murlet/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

3.
MOTIVATION: We consider the problem of multiple alignment of protein sequences with the goal of achieving a large SP (Sum-of-Pairs) score. RESULTS: We introduce a new graph-based method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA starts with an initial alignment. It represents this alignment using a K-partite graph. It then improves the SP score of the initial alignment through local optimizations within a window that moves greedily on the alignment. QOMA uses two parameters to permit flexibility in time/accuracy trade off: (1) The size of the window for local optimization. (2) The sparsity of the K-partite graph. Unlike traditional progressive methods, QOMA is independent of the order of sequences. The experimental results on BAliBASE benchmarks show that QOMA produces higher SP score than the existing tools including ClustalW, Probcons, Muscle, T-Coffee and DCA. The difference is more significant for distant proteins. AVAILABILITY: The software is available from the authors upon request.  相似文献   

4.
A multiple alignment methodology that can produce high-qualityalignment is extremely important for predicting the structureof unknown proteins. Nearly all the methodologies developedso far have employed two-way alignment only. Although thesemethods are fast, the alignments they produce lose reliabilityas the similarity of sequences reduces. We developed the MASCOTmultiple alignment system. MASCOT can sustain the reliabilityof alignment even when the similarity of sequences is low. MASCOTachieves high-quality alignment by employing three-way alignmentin addition to two-way alignment. The resultant alignments arerefined by simulated annealing to higher quality. We also usea cluster analysis of sequences to produce highly reliable alignments.  相似文献   

5.
A novel interactive method for generating multiple protein sequencealignments is described. The program has no internal limit tothe number or length of sequences it can handle and is designedfor use with DEC VAX processors running the VMS operating system.The approach used is essentially one of manual sequence manipulation,aided by built-in symbolic displays of identities and similarities,and strict and ‘fuzzy’ (ambiguous) pattern-matchingfacilities. Additional flexibility is provided by means of aninterface to a publicly available automatic alignment systemand to a comprehensive sequence analysis package. Received on August 28, 1990; accepted on November 20, 1990  相似文献   

6.
7.
SUMMARY: MuSiC is a web server to perform the constrained alignment of a set of sequences, such that the user-specified residues/nucleotides are aligned with each other. The input of the MuSiC system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC is a constrained multiple sequence alignment in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree of similarity to a constraint are aligned together. The current MuSiC system is implemented in Java language and can be accessed via a simple web interface. AVAILABILITY: http://genome.life.nctu.edu.tw/MUSIC  相似文献   

8.
We investigated and optimized a method for structure comparison which is based on rigid body superimposition. The method maximizes the number of structurally equivalent residues while keeping the root mean square deviation constant. The resulting number of equivalent residues then provides an adequate similarity measure, which is easy to interpret. We demonstrate that the approach is able to detect remote structural similarity. We show that the number of equivalent residues is a suitable measure for ranking database searches and that the results are in good agreement with expert knowledge protein structure classification. Structure comparison frequently has multiple solutions. The approach that we use provides a range of alternative alignments rather a single solution. We discuss the nature of alternative solutions on several examples.  相似文献   

9.
In the growing field of genomics, multiple alignment programs are confronted with ever increasing amounts of data. To address this growing issue we have dramatically improved the running time and memory requirement of Kalign, while maintaining its high alignment accuracy. Kalign version 2 also supports nucleotide alignment, and a newly introduced extension allows for external sequence annotation to be included into the alignment procedure. We demonstrate that Kalign2 is exceptionally fast and memory-efficient, permitting accurate alignment of very large numbers of sequences. The accuracy of Kalign2 compares well to the best methods in the case of protein alignments while its accuracy on nucleotide alignments is generally superior. In addition, we demonstrate the potential of using known or predicted sequence annotation to improve the alignment accuracy. Kalign2 is freely available for download from the Kalign web site (http://msa.sbc.su.se/).  相似文献   

10.
SUMMARY: Improving and ascertaining the quality of a multiple sequence alignment is a very challenging step in protein sequence analysis. This is particularly the case when dealing with sequences in the 'twilight zone', i.e. sharing < 30% identity. Here we describe INTERALIGN, a dedicated user-friendly alignment editor including a view of secondary structures and a synchronized display of carbon alpha traces of corresponding protein structures. Profile alignment, using CLUSTALW, is implemented to improve the alignment of a sequence of unknown structure with the visually optimized structural alignment as compared with a standard multiple sequence alignment. Tree-based ordering further helps in identifying the structure closest to a given sequence.  相似文献   

11.
We describe a multiple alignment program named MAP2 based on a generalized pairwise global alignment algorithm for handling long, different intergenic and intragenic regions in genomic sequences. The MAP2 program produces an ordered list of local multiple alignments of similar regions among sequences, where different regions between local alignments are indicated by reporting only similar regions. We propose two similarity measures for the evaluation of the performance of MAP2 and existing multiple alignment programs. Experimental results produced by MAP2 on four real sets of orthologous genomic sequences show that MAP2 rarely missed a block of transitively similar regions and that MAP2 never produced a block of regions that are not transitively similar. Experimental results by MAP2 on six simulated data sets show that MAP2 found the boundaries between similar and different regions precisely. This feature is useful for finding conserved functional elements in genomic sequences. The MAP2 program is freely available in source code form at http://bioinformatics.iastate.edu/aat/sas.html for academic use.  相似文献   

12.
In this paper we present a branch and bound algorithm for local gapless multiple sequence alignment (motif alignment) and its implementation. The algorithm uses both score-based bounding and a novel bounding technique based on the "consistency" of the alignment. A sequence order independent search tree is used in conjunction with a technique for avoiding redundant calculations inherent in the structure of the tree. This is the first program to exploit the fact that the motif alignment problem is easier for short motifs. Indeed, for a short fixed motif width, the running time of the algorithm is asymptotically linear in the size of the input. We tested the performance of the program on a dataset of 300 E. coli promoter sequences and a dataset of 85 lipocalin protein sequences. For a motif width of 4, the optimal alignment of the entire set of sequences can be found. For the more natural motif width of 6, the program can align 21 sequences of length 100, more than twice the number of sequences which can be aligned by the best previous exact algorithm. The algorithm can relax the constraint of requiring each sequence to be aligned, and align 105 of the 300 promoter sequences with a motif width of 6. For the lipocalin dataset, we introduce a technique for reducing the effective alphabet size with a minimal loss of useful information. With this technique, we show that the program can find meaningful motifs in a reasonable amount of time by optimizing the score over three motif positions.  相似文献   

13.
14.
Membrane proteins play a crucial role in various cellular processes and are essential components of cell membranes. Computational methods have emerged as a powerful tool for studying membrane proteins due to their complex structures and properties that make them difficult to analyze experimentally. Traditional features for protein sequence analysis based on amino acid types, composition, and pair composition have limitations in capturing higher-order sequence patterns. Recently, multiple sequence alignment (MSA) and pre-trained language models (PLMs) have been used to generate features from protein sequences. However, the significant computational resources required for MSA-based features generation can be a major bottleneck for many applications. Several methods and tools have been developed to accelerate the generation of MSAs and reduce their computational cost, including heuristics and approximate algorithms. Additionally, the use of PLMs such as BERT has shown great potential in generating informative embeddings for protein sequence analysis. In this review, we provide an overview of traditional and more recent methods for generating features from protein sequences, with a particular focus on MSAs and PLMs. We highlight the advantages and limitations of these approaches and discuss the methods and tools developed to address the computational challenges associated with features generation. Overall, the advancements in computational methods and tools provide a promising avenue for gaining deeper insights into the function and properties of membrane proteins, which can have significant implications in drug discovery and personalized medicine.  相似文献   

15.
MOTIVATION: Multiple alignment of highly divergent sequences is a challenging problem for which available programs tend to show poor performance. Generally, this is due to a scoring function that does not describe biological reality accurately enough or a heuristic that cannot explore solution space efficiently enough. In this respect, we present a new program, Align-m, that uses a non-progressive local approach to guide a global alignment. RESULTS: Two large test sets were used that represent the entire SCOP classification and cover sequence similarities between 0 and 50% identity. Performance was compared with the publicly available algorithms ClustalW, T-Coffee and DiAlign. In general, Align-m has comparable or slightly higher accuracy in terms of correctly aligned residues, especially for distantly related sequences. Importantly, it aligns much fewer residues incorrectly, with average differences of over 15% compared with some of the other algorithms. AVAILABILITY: Align-m and the test sets are available at http://bioinformatics.vub.ac.be  相似文献   

16.

Background

Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data.

Results

In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general.

Conclusion

GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.  相似文献   

17.
An algorithm is presented for the multiple alignment of protein sequences that is both accurate and rapid computationally. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, two sequences are aligned, then the third sequence is aligned against the alignment of both sequences one and two. Similarly, the fourth sequence is aligned against one, two and three. This is repeated until all sequences have been aligned. Iteration is then performed to yield a final alignment. The accuracy of sequence alignment is evaluated from alignment of the secondary structures in a family of proteins. For the globins, the multiple alignment was on average 99% accurate compared to 90% for pairwise comparison of sequences. For the alignment of immunoglobulin constant and variable domains, the use of many sequences yielded an alignment of 63% average accuracy compared to 41% average for individual variable/constant alignments. The multiple alignment algorithm yields an assignment of disulphide connectivity in mammalian serotransferrin that is consistent with crystallographic data, whereas pairwise alignments give an alternative assignment.  相似文献   

18.
This paper presents a method for the multiple alignment of asequence set. The MASH algorithm uses a non-redundant databaseof common motifs and an ‘alignment priority’ criterionthat depends on the length and the occurrence frequency of thepatterns in the set of sequences. This user-defined criterionallows the determination of the series of the patterns to bealigned. This program is applied to a fragment of envelope geneenv gp120 for 20 isolates of the immunodeficiency virus. Themultiplicity of alignments obtained by modifying the criterionparameters reveals different aspects of similarity between thesequences. Received on June 4, 1990; accepted on December 14, 1990  相似文献   

19.
Four algorithms, A–D, were developed to align two groupsof biological sequences. Algorithm A is equivalent to the conventionaldynamic programming method widely used for aligning ordinarysequences, whereas algorithms B – D are designed to evaluatethe cost for a deletion/insertion more accurately when internalgaps are present in either or both groups of sequences. Rigorousoptimization of the ‘sum of pairs’ (SP) score isachieved by algorithm D, whose average performance is closeto O(MNL2) where M and N are numbers of sequences included inthe two groups and L is the mean length of the sequences. AlgorithmB uses some app mximations to cope with profile-based operations,whereas algorithm C is a simpler variant of algorithm D. Thesegroup-to-group alignment algorithms were applied to multiplesequence alignment with two iterative strategies: a progressivemethod based on a given binary tree and a randomized grouping-realignmentmethod. The advantages and disadvantages of the four algorithmsare discussed on the basis of the results of exatninations ofseveral protein families.  相似文献   

20.
A novel algorithm for multiple alignment of biological sequences is suggested. At the first step the DotHelix procedure is employed for construction of motifs, i.e. continuous fragments of local similarity of various “thickness” and strength, and then these motifs are concatenated into chains consistent with the order of letters in the sequences. The algorithm is implemented in the MA-Tools program of the GeneBee package. An example illustrating the effectivity of the algorithm is presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号