首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Phylogenetic and population genetic studies often deal with multiple sequence alignments that require manipulation or processing steps such as sequence concatenation, sequence renaming, sequence translation or consensus sequence generation. In recent years phylogenetic data sets have expanded from single genes to genome wide markers comprising hundreds to thousands of loci. Processing of these large phylogenomic data sets is impracticable without using automated process pipelines. Currently no stand-alone or pipeline compatible program exists that offers a broad range of manipulation and processing steps for multiple sequence alignments in a single process run.

Results

Here we present FASconCAT-G, a system independent editor, which offers various processing options for multiple sequence alignments. The software provides a wide range of possibilities to edit and concatenate multiple nucleotide, amino acid, and structure sequence alignment files for phylogenetic and population genetic purposes. The main options include sequence renaming, file format conversion, sequence translation between nucleotide and amino acid states, consensus generation of specific sequence blocks, sequence concatenation, model selection of amino acid replacement with ProtTest, two types of RY coding as well as site exclusions and extraction of parsimony informative sites. Convieniently, most options can be invoked in combination and performed during a single process run. Additionally, FASconCAT-G prints useful information regarding alignment characteristics and editing processes such as base compositions of single in- and outfiles, sequence areas in a concatenated supermatrix, as well as paired stem and loop regions in secondary structure sequence strings.

Conclusions

FASconCAT-G is a command-line driven Perl program that delivers computationally fast and user-friendly processing of multiple sequence alignments for phylogenetic and population genetic applications and is well suited for incorporation into analysis pipelines.
  相似文献   

2.
Sequence alignments may be the most fundamental computational resource for molecular biology. The best methods that identify sequence relatedness through profile-profile comparisons are much slower and more complex than sequence-sequence and sequence-profile comparisons such as, respectively, BLAST and PSI-BLAST. Families of related genes and gene products (proteins) can be represented by consensus sequences that list the nucleic/amino acid most frequent at each sequence position in that family. Here, we propose a novel approach for consensus-sequence-based comparisons. This approach improved searches and alignments as a standard add-on to PSI-BLAST without any changes of code. Improvements were particularly significant for more difficult tasks such as the identification of distant structural relations between proteins and their corresponding alignments. Despite the fact that the improvements were higher for more divergent relations, they were consistent even at high accuracy/low error rates for non-trivially related proteins. The improvements were very easy to achieve; no parameter used by PSI-BLAST was altered and no single line of code changed. Furthermore, the consensus sequence add-on required relatively little additional CPU time. We discuss how advanced users of PSI-BLAST can immediately benefit from using consensus sequences on their local computers. We have also made the method available through the Internet (http://www.rostlab.org/services/consensus/).  相似文献   

3.
MOTIVATION: Consensus sequence generation is important in many kinds of sequence analysis ranging from sequence assembly to profile-based iterative search methods. However, how can a consensus be constructed when its inherent assumption-that the aligned sequences form a single linear consensus-is not true? RESULTS: Partial Order Alignment (POA) enables construction and analysis of multiple sequence alignments as directed acyclic graphs containing complex branching structure. Here we present a dynamic programming algorithm (heaviest_bundle) for generating multiple consensus sequences from such complex alignments. The number and relationships of these consensus sequences reveals the degree of structural complexity of the source alignment. This is a powerful and general approach for analyzing and visualizing complex alignment structures, and can be applied to any alignment. We illustrate its value for analyzing expressed sequence alignments to detect alternative splicing, reconstruct full length mRNA isoform sequences from EST fragments, and separate paralog mixtures that can cause incorrect SNP predictions. AVAILABILITY: The heaviest_bundle source code is available at http://www.bioinformatics.ucla.edu/poa  相似文献   

4.
Glutamate decarboxylase: computer studies of enzyme evolution   总被引:2,自引:0,他引:2  
The homology of subunit primary sequence of 40 glutamate decarboxylases (GAD) of different origin was analyzed by multiple alignment. A phylogenetic tree was designed on the basis of the resulting data. The following groups are distinguished in the consensus tree: archeans, bacteria, plant eukaryotes, and animal eukaryotes. The latter are clearly divided into two branches according to two enzyme isoforms. Borders of PLP domains in each enzyme were detected. The consensus phylogenetic tree for PLP domains is structurally rather similar to that obtained for subunits. Twenty homologous motifs of from 15 to 87 amino acid residues were revealed in all GAD studied. The results revealed the division of all of the enzymes into groups with characteristic sets of motifs in each and a fixed order of their arrangement along the sequence. Thus, we can show the divergent evolution of the enzyme. The results of multiple alignments during structural analysis of the 40 GAD confirmed and extended our previous data on conserved residues that arrange the position of the coenzyme (PLP) in the enzyme active center. The following residues should be noted: lysine forming a Schiff base with the PLP aldehyde group, an adjacent histidine, and aspartic acid that establishes a link with nitrogen of the PLP pyridine ring. The homology of the primary sequence fragments was also found in the residues in contact with the PLP phosphate group. Comparison of the GAD amino acid sequence with that of another PLP enzyme, aspartate aminotransferase, revealed a binding site for carboxylic group of the substrate—glutamic acid. The structures carrying out a particular catalytic function of all GAD studied were detected, i.e., convergent evolution of the enzyme was revealed.  相似文献   

5.
MOTIVATION: In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the 'twilight zone' of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure. RESULTS: HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.  相似文献   

6.
Fructose-1,6-bisphosphate aldolase was purified from human skeletal-muscle by affinity elution chromatography. Four CNBr-cleavage fragments were purified by gel filtration, and their N-terminal amino acid sequences were determined. Cleavage with o-iodosobenzoic acid at the three tryptophan residues also yielded fragments suitable for N-terminal sequence analysis. Thus, the sequence of 272 of the 363 residues was established. These sequence results allow many of the discrepancies between the two published rabbit skeletal-muscle aldolase sequences to be resolved. The human aldolase sequence reported here is 96% identical to a "consensus" rabbit aldolase sequence. A comparison with a partial sequence of Drosophila aldolase (103 residues) shows 80% identity. The determination of the amino acid sequence of human aldolase is important for the interpretation of the crystal structure of this enzyme.  相似文献   

7.
MOTIVATION: Multiple sequence alignments of homologous proteins are useful for inferring their phylogenetic history and to reveal functionally important regions in the proteins. Functional constraints may lead to co-variation of two or more amino acids in the sequence, such that a substitution at one site is accompanied by compensatory substitutions at another site. It is not sufficient to find the statistical correlations between sites in the alignment because these may be the result of several undetermined causes. In particular, phylogenetic clustering will lead to many strong correlations. RESULTS: A procedure is developed to detect statistical correlations stemming from functional interaction by removing the strong phylogenetic signal that leads to the correlations of each site with many others in the sequence. Our method relies upon the accuracy of the alignment but it does not require any assumptions about the phylogeny or the substitution process. The effectiveness of the method was verified using computer simulations and then applied to predict functional interactions between amino acids in the Pfam database of alignments.  相似文献   

8.
DNA fragments were amplified by PCR from all tested strains of Aeromonas hydrophila, A. caviae, and A. sobria with primers designed based on sequence alignment of all lipase, phospholipase C, and phospholipase A1 genes and the cytotonic enterotoxin gene, all of which have been reported to have the consensus region of the putative lipase substrate-binding domain. All strains showed lipase activity, and all amplified DNA fragments contained a nucleotide sequence corresponding to the substrate-binding domain. Thirty-five distinct nucleotide sequence patterns and 15 distinct deduced amino acid sequence patterns were found in the amplified DNA fragments from 59 A. hydrophila strains. The deduced amino acid sequences of the amplified DNA fragments from A. caviae and A. sobria strains had distinctive amino acids, suggesting a species-specific sequence in each organism. Furthermore, the amino acid sequence patterns appear to differ between clinical and environmental isolates among A. hydrophila strains. Some strains whose nucleotide sequences were identical to one another in the amplified region showed an identical DNA fingerprinting pattern by repetitive extragenic palindromic sequence-PCR genotyping. These results suggest that A. hydrophila, and also A. caviae and A. sobria strains, have a gene encoding a protein with lipase activity. Homologs of the gene appear to be widely distributed in Aeromonas strains, probably associating with the evolutionary genetic difference between clinical and environmental isolates of A. hydrophila. Additionally, the distinctive nucleotide sequences of the genes could be attributed to the genotype of each strain, suggesting that their analysis may be helpful in elucidating the genetic heterogeneity of Aeromonas.  相似文献   

9.
Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta.  相似文献   

10.
Complementary DNA sequence of lamprey fibrinogen beta chain   总被引:6,自引:0,他引:6  
The cDNA sequence of the beta chain of lamprey fibrinogen has been determined. To that end, an oligonucleotide probe was synthesized that corresponded to an amino acid sequence from the carboxy-terminal region of the lamprey fibrinogen beta chain. The insert actually began with residue 3 of the fibrin beta chain; it ran through to a terminator codon following the carboxy-terminal residue at position 443 and then continued for an additional 606 nucleotides of noncoding sequence to its 3' end. The inferred amino acid sequence was verified by comparison with assorted cyanogen bromide fragments isolated from the beta-chain protein, including two carbohydrate-containing peptides that corresponded to segments containing the carbohydrate-attachment consensus sequence. Overall, the lamprey chain is 49% identical with the beta chain from human fibrinogen. This is the same degree of resemblance as was found for the lamprey and human gamma chains. Moreover, the principal regions of conservation are the same in both the beta and gamma chains. Differences and similarities in the physiological behavior of the two fibrinogens are assessed in terms of the observed amino acid replacements.  相似文献   

11.
DNA fragments were amplified by PCR from all tested strains of Aeromonas hydrophila, A. caviae, and A. sobria with primers designed based on sequence alignment of all lipase, phospholipase C, and phospholipase A1 genes and the cytotonic enterotoxin gene, all of which have been reported to have the consensus region of the putative lipase substrate-binding domain. All strains showed lipase activity, and all amplified DNA fragments contained a nucleotide sequence corresponding to the substrate-binding domain. Thirty-five distinct nucleotide sequence patterns and 15 distinct deduced amino acid sequence patterns were found in the amplified DNA fragments from 59 A. hydrophila strains. The deduced amino acid sequences of the amplified DNA fragments from A. caviae and A. sobria strains had distinctive amino acids, suggesting a species-specific sequence in each organism. Furthermore, the amino acid sequence patterns appear to differ between clinical and environmental isolates among A. hydrophila strains. Some strains whose nucleotide sequences were identical to one another in the amplified region showed an identical DNA fingerprinting pattern by repetitive extragenic palindromic sequence-PCR genotyping. These results suggest that A. hydrophila, and also A. caviae and A. sobria strains, have a gene encoding a protein with lipase activity. Homologs of the gene appear to be widely distributed in Aeromonas strains, probably associating with the evolutionary genetic difference between clinical and environmental isolates of A. hydrophila. Additionally, the distinctive nucleotide sequences of the genes could be attributed to the genotype of each strain, suggesting that their analysis may be helpful in elucidating the genetic heterogeneity of Aeromonas.  相似文献   

12.
An automated algorithm is presented that delineates protein sequence fragments which display similarity. The method incorporates a selection of a number of local nonoverlapping sequence alignments with the highest similarity scores and a graphtheoretical approach to elucidate the consistent start and end points of the fragments comprising one or more ensembles of related subsequences. The procedure allows the simultaneous identification of different types of repeats within one sequence. A multiple alignment of the resulting fragments is performed and a consensus sequence derived from the ensemble(s). Finally, a profile is constructed form the multiple alignment to detect possible and more distant members within the sequence. The method tolerates mutations in the repeats as well as insertions and deletions. The sequence spans between the various repeats or repeat clusters may be of different lengths. The technique has been applied to a number of proteins where the repeating fragments have been derived from information additional to the protein sequences. © 1993 Wiley-Liss, Inc.  相似文献   

13.
Aligning amino acid sequences: comparison of commonly used methods   总被引:5,自引:0,他引:5  
We examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of "weighting" in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical "jumbling test." This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.  相似文献   

14.
The major aim of tertiary structure prediction is to obtain protein models with the highest possible accuracy. Fold recognition, homology modeling, and de novo prediction methods typically use predicted secondary structures as input, and all of these methods may significantly benefit from more accurate secondary structure predictions. Although there are many different secondary structure prediction methods available in the literature, their cross-validated prediction accuracy is generally <80%. In order to increase the prediction accuracy, we developed a novel hybrid algorithm called Consensus Data Mining (CDM) that combines our two previous successful methods: (1) Fragment Database Mining (FDM), which exploits the Protein Data Bank structures, and (2) GOR V, which is based on information theory, Bayesian statistics, and multiple sequence alignments (MSA). In CDM, the target sequence is dissected into smaller fragments that are compared with fragments obtained from related sequences in the PDB. For fragments with a sequence identity above a certain sequence identity threshold, the FDM method is applied for the prediction. The remainder of the fragments are predicted by GOR V. The results of the CDM are provided as a function of the upper sequence identities of aligned fragments and the sequence identity threshold. We observe that the value 50% is the optimum sequence identity threshold, and that the accuracy of the CDM method measured by Q(3) ranges from 67.5% to 93.2%, depending on the availability of known structural fragments with sufficiently high sequence identity. As the Protein Data Bank grows, it is anticipated that this consensus method will improve because it will rely more upon the structural fragments.  相似文献   

15.
Although multiple sequence alignments (MSAs) are essential for a wide range of applications from structure modeling to prediction of functional sites, construction of accurate MSAs for distantly related proteins remains a largely unsolved problem. The rapidly increasing database of spatial structures is a valuable source to improve alignment quality. We explore the use of 3D structural information to guide sequence alignments constructed by our MSA program PROMALS. The resulting tool, PROMALS3D, automatically identifies homologs with known 3D structures for the input sequences, derives structural constraints through structure-based alignments and combines them with sequence constraints to construct consistency-based multiple sequence alignments. The output is a consensus alignment that brings together sequence and structural information about input proteins and their homologs. PROMALS3D can also align sequences of multiple input structures, with the output representing a multiple structure-based alignment refined in combination with sequence constraints. The advantage of PROMALS3D is that it gives researchers an easy way to produce high-quality alignments consistent with both sequences and structures of proteins. PROMALS3D outperforms a number of existing methods for constructing multiple sequence or structural alignments using both reference-dependent and reference-independent evaluation methods.  相似文献   

16.
Multiple protein structure alignment.   总被引:5,自引:2,他引:3       下载免费PDF全文
A method was developed to compare protein structures and to combine them into a multiple structure consensus. Previous methods of multiple structure comparison have only concatenated pairwise alignments or produced a consensus structure by averaging coordinate sets. The current method is a fusion of the fast structure comparison program SSAP and the multiple sequence alignment program MULTAL. As in MULTAL, structures are progressively combined, producing intermediate consensus structures that are compared directly to each other and all remaining single structures. This leads to a hierarchic "condensation," continually evaluated in the light of the emerging conserved core regions. Following the SSAP approach, all interatomic vectors were retained with well-conserved regions distinguished by coherent vector bundles (the structural equivalent of a conserved sequence position). Each bundle of vectors is summarized by a resultant, whereas vector coherence is captured in an error term, which is the only distinction between conserved and variable positions. Resultant vectors are used directly in the comparison, which is weighted by their error values, giving greater importance to the matching of conserved positions. The resultant vectors and their errors can also be used directly in molecular modeling. Applications of the method were assessed by the quality of the resulting sequence alignments, phylogenetic tree construction, and databank scanning with the consensus. Visual assessment of the structural superpositions and consensus structure for various well-characterized families confirmed that the consensus had identified a reasonable core.  相似文献   

17.
The colH gene encoding a collagenase was cloned from Clostridium histolyticum JCM 1403. Nucleotide sequencing showed a major open reading frame encoding a 116-kDa protein of 1,021 amino acid residues. The deduced amino acid sequence contains a putative signal sequence and a zinc metalloprotease consensus sequence, HEXXH. A 116-kDa collagenase and a 98-kDa gelatinase were copurified from culture supernatants of C. histolyticum. While the former degraded both native and denatured collagen, the latter degraded only denatured collagen. Peptide mapping with V8 protease showed that all peptide fragments, except a few minor ones, liberated from the two enzymes coincided with each other. Analysis of the N-terminal amino acid sequence of the two enzymes revealed that their first 24 amino acid residues were identical and coincided with those deduced from the nucleotide sequence. These results indicate that the 98-kDa gelatinase is generated from the 116-kDa collagenase by cleaving off the C-terminal region, which could be responsible for binding or increasing the accessibility of the collagenase to native collagen fibers. The role of the C-terminal region in the functional and evolutional aspects of the collagenase was further studied by comparing the amino acid sequence of the C. histolyticum collagenase with those of three homologous enzymes: the collagenases from Clostridium perfringens and Vibrio alginolyticus and Achromobacter lyticus protease I.  相似文献   

18.
The complete amino acid sequence of the subunit of branched-chain amino acid aminotransferase (transaminase B, EC 2.6.1.42) of Salmonella typhimurium was determined. An Escherichia coli recombinant containing the ilvGEDAY gene cluster of Salmonella was used as the source of the hexameric enzyme. The peptide fragments used for sequencing were generated by treatment with trypsin, Staphylococcus aureus V8 protease, endoproteinase Lys-C, and cyanogen bromide. The enzyme subunit contains 308 residues and has a molecular weight of 33,920. To determine the coenzyme-binding site, the pyridoxal 5-phosphate containing enzyme was treated with tritiated sodium borohydride prior to trypsin digestion. Peptide map comparisons with an apoenzyme tryptic digest and monitoring radioactivity incorporation allowed identification of the pyridoxylated peptide, which was then isolated and sequenced. The coenzyme-binding site is the lysyl residue at position 159. The amino acid sequence of Salmonella transaminase B is 97.4% identical with that of Escherichia coli, differing in only eight amino acid positions. Sequence comparisons of transaminase B to other known aminotransferase sequences revealed limited sequence similarity (24-33%) when conserved amino acid substitutions are allowed and alignments were forced to occur on the coenzyme-binding site.  相似文献   

19.
The nucleotide sequence of a 1884 bp DNA fragment of E. coli, carrying the gene dacB, was determined. The DNA codes for penicillin-binding protein 4 (PBP4), an enzyme of 477 amino acids, being involved as a DD-carboxypeptidase-endopeptidase in murein metabolism. The enzyme is translated with a cleavable signal peptide of 20 amino acids, which was verified by sequencing the amino-terminus of the isolated protein. The characteristic active-site fingerprints SXXK, SXN and KTG of class A beta-lactamases and penicillin-binding proteins were located in the sequence. On the basis of amino acid alignments we propose, that PBP4 and class A beta-lactamases share a common evolutionary origin but PBP4 has acquired an additional domain of 188 amino acids in the region between the SXXK and SXN elements.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号