首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 831 毫秒
1.
This paper describes a multiple alignment method using a workstationand supercomputer. The method is based on the alignment of aset of aligned sequences with the new sequence, and uses a recursiveprocedure of such alignment. The alignment is executed in areasonable computation time on diverse levels from a workstationto a supercomputer, from the viewpoint of alignment resultsand computational speed by parallel processing. The applicationof the algorithm is illustrated by several examples of multiplealignment of 12 amino acid and DNA sequences of HIV (human immunodeficiencyvirus) env genes. Colour graphic programs on a workstation andparallel processing on a supercomputer are discussed. Received on April 26, 1988; accepted on July 7, 1988  相似文献   

2.
Fast and sensitive multiple sequence alignments on a microcomputer   总被引:99,自引:0,他引:99  
A strategy is described for the rapid alignment of many longnucleic acid or protein sequences on a microcomputer. The programdescribed can handle up to 100 sequences of 1200 residues each.The approach is based on progressively aligning sequences accordingto the branching order in an initial phylogenetic tree. Theresults obtained using the package appear to be as sensitiveas those from any other available method. Received on October 7, 1988; accepted on December 6, 1988  相似文献   

3.
A program for template matching of protein sequences   总被引:1,自引:0,他引:1  
The matching of a template to a protein sequence is simplifiedby treating it as a special case of sequence alignment. Restrictionof the distances between motifs in the template controls againstspurious matches within very long sequences. The program usingthis algorithm is fast enough to be used in scanning large databasesfor sequences matching a complex template. Received on August 17, 1987; accepted on January 11, 1988  相似文献   

4.
Multiple sequence alignment with hierarchical clustering.   总被引:155,自引:8,他引:147       下载免费PDF全文
F Corpet 《Nucleic acids research》1988,16(22):10881-10890
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c.  相似文献   

5.
Algorithms often align sequences by minimizing a cost. Suchalgorithms usually operate by aligning successively longer sub-sequencesuntil they finish the alignment. Efficient algorithms, suchas those of Fickett and Ukkonen, speed the computation by ignoringbad subalignments. A general principle underlies the efficiencyof these two algorithms: inequalities can direct computationsto promising subalignments. Hence inequalities can be used tosuggest alignment algorithms. Inequalities for unweighted end-gaps,affine and concave gap weights, etc., are discussed, and empiricalresults evaluating new algorithms for single indel costs andweighted end-gaps are presented. Empirical results show thenew algorithms are, under certain circumstances, much fasterthan known algorithms. Received on September 23, 1988; accepted on February 2, 1990  相似文献   

6.
PCMA (profile consistency multiple sequence alignment) is a progressive multiple sequence alignment program that combines two different alignment strategies. Highly similar sequences are aligned in a fast way as in ClustalW, forming pre-aligned groups. The T-Coffee strategy is applied to align the relatively divergent groups based on profile-profile comparison and consistency. The scoring function for local alignments of pre-aligned groups is based on a novel profile-profile comparison method that is a generalization of the PSI-BLAST approach to profile-sequence comparison. PCMA balances speed and accuracy in a flexible way and is suitable for aligning large numbers of sequences. AVAILABILITY: PCMA is freely available for non-commercial use. Pre-compiled versions for several platforms can be downloaded from ftp://iole.swmed.edu/pub/PCMA/.  相似文献   

7.
MELDB: a database for microbial esterases and lipases   总被引:1,自引:0,他引:1  
Kang HY  Kim JF  Kim MH  Park SH  Oh TK  Hur CG 《FEBS letters》2006,580(11):2736-2740
MELDB is a comprehensive protein database of microbial esterases and lipases which are hydrolytic enzymes important in the modern industry. Proteins in MELDB are clustered into groups according to their sequence similarities based on a local pairwise alignment algorithm and a graph clustering algorithm (TribeMCL). This differs from traditional approaches that use global pairwise alignment and joining methods. Our procedure was able to reduce the noise caused by dubious alignment in the distantly related or unrelated regions in the sequences. In the database, 883 esterase and lipase sequences derived from microbial sources are deposited and conserved parts of each protein are identified. HMM profiles of each cluster were generated to classify unknown sequences. Contents of the database can be keyword-searched and query sequences can be aligned to sequence profiles and sequences themselves.  相似文献   

8.
Page RD 《Nucleic acids research》2000,28(20):3839-3845
Comparative analysis is the preferred method of inferring RNA secondary structure, but its use requires considerable expertise and manual effort. As the importance of secondary structure for accurate sequence alignment and phylogenetic analysis becomes increasingly realised, the need for secondary structure models for diverse taxonomic groups becomes more pressing. The number of available structures bears little relation to the relative diversity or importance of the different taxonomic groups. Insects, for example, comprise the largest group of animals and yet are very poorly represented in secondary structure databases. This paper explores the utility of maximum weighted matching (MWM) to help automate the process of comparative analysis by inferring secondary structure for insect mitochondrial small subunit (12S) rRNA sequences. By combining information on correlated changes in substitutions and helix dot plots, MWM can rapidly generate plausible models of secondary structure. These models can be further refined using standard comparative techniques. This paper presents a secondary structure model for insect 12S rRNA based on an alignment of 225 insect sequences and an alignment for 16 exemplar insect sequences. This alignment is used as a template for a web server that automatically generates secondary structures for insect sequences.  相似文献   

9.
Motivations: Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address. RESULTS: We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions. AVAILABILITY: BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/  相似文献   

10.
Poxvirus Orthologous Clusters (POCs) is a JAVA client-server application which accesses an updated database containing all complete poxvirus genomes; it automatically groups orthologous genes into families based on BLASTP scores for assessment by a human database curator. POCs has a user-friendly interface permitting complex SQL queries to retrieve interesting groups of DNA and protein sequences as well as gene families for subsequent interrogation by a variety of integrated tools: BLASTP, BLASTX, TBLASTN, Jalview (multiple alignment), Dotlet (Dotplot), Laj (local alignment), and NAP (nucleotide to amino acid alignment).  相似文献   

11.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.  相似文献   

12.
The sequences of the ubiquitous and phylogenetically diversified cyclophilin family of proteins were divided into six groups, namely, vertebrates, invertebrates, other metazoa, plants, fungi, and prokaryotes. These groups of sequences were aligned with the multiple sequence alignment program Clustal-W. The variations of amino acid substitutions and amino acid compositions for these six groups of cyclophilins were calculated using a novel suite of multiple-sequence alignment analysis routines. The cyclophilins from vertebrates can be divided for at least two distinct structural classes that differ from each other by a variable-length amino acid insert within the loop that links alpha-helix II and beta-strand III. A similar structural feature is also present in the other groups of cyclophilins, namely, those from invertebrates, other metazoa, plants, and fungi. The sequences of cyclophilins from fungi and prokaryotes are more diversified than those from vertebrates, and their alterations involve structures other than the amino acid inserts within the loops. Variations of the hydrophobicity and bulkiness of amino acid substitutions of the aligned sequences were calculated for each group of cyclophilins and for the alignment of all the sequences. The variations have clear asymmetry that may signify the need for modification of the physical properties of certain fragments of cyclophilins that are involved in interactions with various cellular components in the evolving environment.  相似文献   

13.
An algorithm is presented for the multiple alignment of protein sequences that is both accurate and rapid computationally. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, two sequences are aligned, then the third sequence is aligned against the alignment of both sequences one and two. Similarly, the fourth sequence is aligned against one, two and three. This is repeated until all sequences have been aligned. Iteration is then performed to yield a final alignment. The accuracy of sequence alignment is evaluated from alignment of the secondary structures in a family of proteins. For the globins, the multiple alignment was on average 99% accurate compared to 90% for pairwise comparison of sequences. For the alignment of immunoglobulin constant and variable domains, the use of many sequences yielded an alignment of 63% average accuracy compared to 41% average for individual variable/constant alignments. The multiple alignment algorithm yields an assignment of disulphide connectivity in mammalian serotransferrin that is consistent with crystallographic data, whereas pairwise alignments give an alternative assignment.  相似文献   

14.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9. Supported by the National Natural Science Foundation of China (Grant Nos. 90403120, 10474041 and 10021001) and the Nonlinear Project (973) of the NSM  相似文献   

15.
The chaperonin HSP60 (GroEL) proteins are essential in eubacterial genomes and in eukaryotic organelles. Functional regions inferred from mutation studies and the Escherichia coli GroEL 3D crystal complexes are evaluated in a multiple alignment across 43 diverse HSP60 sequences, centering on ATP/ADP and Mg2+ binding sites, on residues interacting with substrate, on GroES contact positions, on interface regions between monomers and domains, and on residues important in allosteric conformational changes. The most evolutionary conserved residues relate to the ATP/ADP and Mg2+ binding sites. Hydrophobic residues that contribute in substrate binding are also significantly conserved. A large number of charged residues line the central cavity of the GroEL-GroES complex in the substrate-releasing conformation. These span statistically significant intra- and inter-monomer three-dimensional (3D) charge clusters that are highly conserved among sequences and presumably play an important role interacting with the substrate. Unaligned short segments between blocks of alignment are generally exposed at the outside wall of the Anfinsen cage complex. The multiple alignment reveals regions of divergence common to specific evolutionary groups. For example, rickettsial sequences diverge in the ATP/ADP binding domain and gram-positive sequences diverge in the allosteric transition domain. The evolutionary information of the multiple alignment proffers attractive sites for mutational studies.  相似文献   

16.
A software package that allows one to carry out multiple alignmentof protein and nucleic acid sequences of almost unlimited lengthand number of sequences is developed on C-DAC parallel computer—atransputer-based machine. The farming approach is used for dataparallelization. The speed gains are almost linear when thenumber of transputers is increased from 4 to 64. The softwareis used to carry out multiple alignment of 100 sequences eachof -chain and ß-chain of hemoglobin and 83 cytochromec sequences. The signature sequence of cytochrome c was foundto be PGTKMXF. The single parameter, multiple alignment score,S, has been used to categorize proteins in different subfamiliesand groups.  相似文献   

17.
Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct' assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file (containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/  相似文献   

18.
Four algorithms, A–D, were developed to align two groupsof biological sequences. Algorithm A is equivalent to the conventionaldynamic programming method widely used for aligning ordinarysequences, whereas algorithms B – D are designed to evaluatethe cost for a deletion/insertion more accurately when internalgaps are present in either or both groups of sequences. Rigorousoptimization of the ‘sum of pairs’ (SP) score isachieved by algorithm D, whose average performance is closeto O(MNL2) where M and N are numbers of sequences included inthe two groups and L is the mean length of the sequences. AlgorithmB uses some app mximations to cope with profile-based operations,whereas algorithm C is a simpler variant of algorithm D. Thesegroup-to-group alignment algorithms were applied to multiplesequence alignment with two iterative strategies: a progressivemethod based on a given binary tree and a randomized grouping-realignmentmethod. The advantages and disadvantages of the four algorithmsare discussed on the basis of the results of exatninations ofseveral protein families.  相似文献   

19.
The aim of the work is to develop a common method for estimating the pairwise alignment quality versus the evolutionary distance (degree of homology) between the sequences being compared and versus the type of alignment procedure. 3D alignments or any data on 3D protein structure are not used in the study. Based on the accepted protein sequences evolution model, it is possible to estimate the capability of the concrete alignment algorithm to recover the genuine alignment. In this study a classical Needleman and Wunsch global alignment algorithm has been tested on a set of sequences from the Prefab database. Accuracy and confidence of a global alignment procedure were calculated as dependent on the shares of insertions/deletions and mutations.  相似文献   

20.
An Eulerian path approach to global multiple alignment for DNA sequences.   总被引:3,自引:0,他引:3  
With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号