首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A structure-based method for protein sequence alignment   总被引:1,自引:0,他引:1  
MOTIVATION: With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database. RESULTS: In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family. AVAILABILITY: A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer. CONTACT: bryant@ncbi.nlm.nih.gov.  相似文献   

2.
Non-coding DNA segments that are conserved between the human and mouse genomic sequence are good indicators of possible regulatory sequences. Here we report on a systematic approach to delineate such conserved elements from upstream regions of orthologous gene pairs from man and mouse. We focus on orthologous genes in order to maximize our chances to find functionally similar regulatory elements. The identification of conserved elements is effected using the Waterman-Eggert local suboptimal alignment algorithm. We have modified an implementation of this algorithm such that it integrates the determination of statistical significance for the local suboptimal alignments. This has the effect of outputting a dynamically determined number of suboptimal alignments that are deemed statistically significant. Comparison with experimentally determined annotation shows a striking enrichement of regulatory sites among the conserved regions. Furthermore, the conserved regions tend to cover the promotor region described in the EPD database.  相似文献   

3.
4.
5.
MOTIVATION: Amino acid sequence alignments are widely used in the analysis of protein structure, function and evolutionary relationships. Proteins within a superfamily usually share the same fold and possess related functions. These structural and functional constraints are reflected in the alignment conservation patterns. Positions of functional and/or structural importance tend to be more conserved. Conserved positions are usually clustered in distinct motifs surrounded by sequence segments of low conservation. Poorly conserved regions might also arise from the imperfections in multiple alignment algorithms and thus indicate possible alignment errors. Quantification of conservation by attributing a conservation index to each aligned position makes motif detection more convenient. Mapping these conservation indices onto a protein spatial structure helps to visualize spatial conservation features of the molecule and to predict functionally and/or structurally important sites. Analysis of conservation indices could be a useful tool in detection of potentially misaligned regions and will aid in improvement of multiple alignments. RESULTS: We developed a program to calculate a conservation index at each position in a multiple sequence alignment using several methods. Namely, amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. We utilize both unweighted frequencies and frequencies weighted using two different strategies. Three conceptually different approaches (entropy-based, variance-based and matrix score-based) are implemented in the algorithm to define the conservation index. Calculating conservation indices for 35522 positions in 284 alignments from SMART database we demonstrate that different methods result in highly correlated (correlation coefficient more than 0.85) conservation indices. Conservation indices show statistically significant correlation between sequentially adjacent positions i and i + j, where j < 13, and averaging of the indices over the window of three positions is optimal for motif detection. Positions with gaps display substantially lower conservation properties. We compare conservation properties of the SMART alignments or FSSP structural alignments to those of the ClustalW alignments. The results suggest that conservation indices should be a valuable tool of alignment quality assessment and might be used as an objective function for refinement of multiple alignments. AVAILABILITY: The C code of the AL2CO program and its pre-compiled versions for several platforms as well as the details of the analysis are freely available at ftp://iole.swmed.edu/pub/al2co/.  相似文献   

6.
Conserved segments in DNA or protein sequences are strong candidates for functional elements and thus appropriate methods for computing them need to be developed and compared. We describe five methods and computer programs for finding highly conserved blocks within previously computed multiple alignments, primarily for DNA sequences. Two of the methods are already in common use; these are based on good column agreement and high information content. Three additional methods find blocks with minimal evolutionary change, blocks that differ in at most k positions per row from a known center sequence and blocks that differ in at most k positions per row from a center sequence that is unknown a priori. The center sequence in the latter two methods is a way to model potential binding sites for known or unknown proteins in DNA sequences. The efficacy of each method was evaluated by analysis of three extensively analyzed regulatory regions in mammalian beta-globin gene clusters and the control region of bacterial arabinose operons. Although all five methods have quite different theoretical underpinnings, they produce rather similar results on these data sets when their parameters are adjusted to best approximate the experimental data. The optimal parameters for the method based on information content varied little for different regulatory regions of the beta-globin gene cluster and hence may be extrapolated to many other regulatory regions. The programs based on maximum allowed mismatches per row have simple parameters whose values can be chosen a priori and thus they may be more useful than the other methods when calibration against known functional sites is not available.  相似文献   

7.
The performances of five global multiple-sequence alignment programs (CLUSTAL W, Divide and Conquer, Malign, PileUp, and TreeAlign) were evaluated using part of the animal mitochondrial small subunit (12S) rRNA molecule. Conserved sequence motifs derived from an alignment based on secondary structural information were used to score how well each program aligned a data set of five vertebrate and five invertebrate taxa over a range of parameter values. All of the programs could align the motifs with reasonable accuracy for at least one set of parameter conditions, although if the whole sequence was considered, similarity to the structural alignment was only 25%-34%. Use of small gap costs generally gave more accurate results, although Malign and TreeAlign generated longer alignments when gap costs were low. The programs differed in the consistency of the alignments when gap cost was varied; CLUSTAL W, Divide and Conquer, and TreeAlign were the most accurate and robust, while PileUp performed poorly as gap cost values increased, and the accuracy of Malign fluctuated. Default settings for the programs did not give the best results, and attempting to select similar parameter values in different programs did not always result in more similar alignments. Poor alignment of even well-conserved motifs can occur if these are near sites with insertions or deletions. Since there is no a priori way to determine gap costs and because such costs can vary over the gene, alignment of rRNA sequences, particularly the less well conserved regions, should be treated carefully and aided by secondary structure and conserved motifs. Some motifs are single bases and so are often invisible to alignment programs. Our tests involved the most conserved regions of the 12S rRNA gene, and alignment of less well conserved regions will be more problematical. None of the alignments we examined produced a fully resolved phylogeny for the data set, indicating that this portion of 12S rRNA is insufficient for resolution of distant evolutionary relationships.  相似文献   

8.
We have developed a quick web-based application for designing conserved genomic PCR and RT-PCR primers from multigenome alignments targeting specific exons or introns. We used Pygr (The Python Graph Database Framework for Bioinformatics) to query intervals from multigenome alignments, which gives us less than a millisecond access to any intervals of any genome within multigenome alignments. PRIMER3 was used to extract optimal primers from a gene of interest. QPRIMER creates an electronic genomic PCR image from a set of conserved primers as well as summary pages for primer alignments and products. QPRIMER supports human, mouse, rat, chicken, dog, zebrafish and fruit fly. Availability: http://www.bioinformatics.ucla.edu/QPRIMER/.  相似文献   

9.
The HSSP (Homology-Derived Secondary Structure of Proteins) database provides multiple sequence alignments (MSAs) for proteins of known three-dimensional (3D) structure in the Protein Data Bank (PDB). The database also contains an estimate of the degree of evolutionary conservation at each amino acid position. This estimate, which is based on the relative entropy, correlates with the functional importance of the position; evolutionarily conserved positions (i.e., positions with limited variability and low entropy) are occasionally important to maintain the 3D structure and biological function(s) of the protein. We recently developed the Rate4Site algorithm for scoring amino acid conservation based on their calculated evolutionary rate. This algorithm takes into account the phylogenetic relationships between the homologs and the stochastic nature of the evolutionary process. Here we present the ConSurf-HSSP database of Rate4Site estimates of the evolutionary rates of the amino acid positions, calculated using HSSP's MSAs. The database provides precalculated evolutionary rates for nearly all of the PDB. These rates are projected, using a color code, onto the protein structure, and can be viewed online using the ConSurf server interface. To exemplify the database, we analyzed in detail the conservation pattern obtained for pyruvate kinase and compared the results with those observed using the relative entropy scores of the HSSP database. It is reassuring to know that the main functional region of the enzyme is detectable using both conservation scores. Interestingly, the ConSurf-HSSP calculations mapped additional functionally important regions, which are moderately conserved and were overlooked by the original HSSP estimate. The ConSurf-HSSP database is available online (http://consurf-hssp.tau.ac.il).  相似文献   

10.
11.
We have cloned and sequenced bovine apoA-I cDNA. Comparison with the apoA-I sequences of six other vertebrates shows the bovine gene to be most similar to that of the dog. Estimates of substitution rates show that apoA-I evolves approximately 25% faster than an average gene in mammalian lineages. All portions of the coding region evolve at roughly similar rates, suggesting that global conformation is conserved. However, a region of the rat protein has evolved rapidly both relative to other portions of the rat sequence and relative to homologous regions in other mammals. To extend our analysis to other apolipoproteins, we compared four vertebrate apoB-100 sequences. Conserved regions were found to include two putative LDL receptor binding domains, in addition to several regions of unidentified function. Comparison of the apoA-I sequences and the apoB-100 sequences indicates that the latter evolve approximately 40% faster than the former and at twice the average rate for mammalian proteins.  相似文献   

12.
Phylogenetic analysis of the formin homology 2 domain   总被引:6,自引:0,他引:6       下载免费PDF全文
Formin proteins are key regulators of eukaryotic actin filament assembly and elongation, and many species possess multiple formin isoforms. A nomenclature system based on fundamental features would be desirable, to aid the rapid identification and characterization of novel formins. In this article, we attempt to systematize the formin family by performing phylogenetic analyses of the formin homology 2 (FH2) domain, an independently folding region common to all formins, which alone can influence actin dynamics. Through database searches, we identify 101 FH2 domains from 26 eukaryotic species, including 15 in mice. Sequence alignments reveal a highly conserved yeast-specific insert in the "knob loop" region of the FH2 domain, with unknown functional consequences. Phylogenetic analysis using minimum evolution (ME), maximum parsimony (MP), and maximum likelihood (ML) algorithms strongly supports the existence of seven metazoan groups. Yeast FH2 domains segregate from all other eukaryotes, including metazoans, other fungi, plants, and protists. Sequence comparisons of non-FH2 regions support relationships between three metazoan groups (Dia, DAAM, and FRL) and examine previously identified coiled-coil and Diaphanous auto-regulatory domain sequences. This analysis allows for a formin nomenclature system based on sequence relationships, as well as suggesting strategies for the determination of biochemical and cellular activities of these proteins.  相似文献   

13.
14.
Human immunodeficiency virus type 1 (HIV-1) sequences were generated from blood and from brain tissue obtained by stereotactic biopsy from six patients undergoing a diagnostic neurosurgical procedure. Proviral DNA was directly amplified by nested PCR, and 8 to 36 clones from each sample were sequenced. Phylogenetic analysis of intrapatient envelope V3-V5 region HIV-1 DNA sequence sets revealed that brain viral sequences were clustered relative to the blood viral sequences, suggestive of tissue-specific compartmentalization of the virus in four of the six cases. In the other two cases, the blood and brain virus sequences were intermingled in the phylogenetic analyses, suggesting trafficking of virus between the two tissues. Slide-based PCR-driven in situ hybridization of two of the patients' brain biopsy samples confirmed our interpretation of the intrapatient phylogenetic analyses. Interpatient V3 region brain-derived sequence distances were significantly less than blood-derived sequence distances. Relative to the tip of the loop, the set of brain-derived viral sequences had a tendency towards negative or neutral charge compared with the set of blood-derived viral sequences. Entropy calculations were used as a measure of the variability at each position in alignments of blood and brain viral sequences. A relatively conserved set of positions were found, with a significantly lower entropy in the brain-than in the blood-derived viral sequences. These sites constitute a brain "signature pattern," or a noncontiguous set of amino acids in the V3 region conserved in viral sequences derived from brain tissue. This brain-derived signature pattern was also well preserved among isolates previously characterized in vitro as macrophage tropic. Macrophage-monocyte tropism may be the biological constraint that results in the conservation of the viral brain signature pattern.  相似文献   

15.
Comparison of polymorphism at synonymous and non-synonymous sites in protein-coding DNA can provide evidence for selective constraint. Non-coding DNA that forms part of the regulatory landscape presents more of a challenge since there is not such a clear-cut distinction between sites under stronger and weaker selective constraint. Here, we consider putative regulatory elements termed Conserved Non-coding Elements (CNEs) defined by their high level of sequence identity across all vertebrates. Some mutations in these regions have been implicated in developmental disorders; we analyse CNE polymorphism data to investigate whether such deleterious effects are widespread in humans. Single nucleotide variants from the HapMap and 1000 Genomes Projects were mapped across nearly 2000 CNEs. In the 1000 Genomes data we find a significant excess of rare derived alleles in CNEs relative to coding sequences; this pattern is absent in HapMap data, apparently obscured by ascertainment bias. The distribution of polymorphism within CNEs is not uniform; we could identify two categories of sites by exploiting deep vertebrate alignments: stretches that are non-variant, and those that have at least one substitution. The conserved category has fewer polymorphic sites and a greater excess of rare derived alleles, which can be explained by a large proportion of sites under strong purifying selection within humans – higher than that for non-synonymous sites in most protein coding regions, and comparable to that at the strongly conserved trans-dev genes. Conversely, the more evolutionarily labile CNE sites have an allele frequency distribution not significantly different from non-synonymous sites. Future studies should exploit genome-wide re-sequencing to obtain better coverage in selected non-coding regions, given the likelihood that mutations in evolutionarily conserved enhancer sequences are deleterious. Discovery pipelines should validate non-coding variants to aid in identifying causal and risk-enhancing variants in complex disorders, in contrast to the current focus on exome sequencing.  相似文献   

16.
Four different intergenic regions of mitochondrial DNA (mt-IGS), a fragment of the intergenic spacer (IGS) region of the rDNA (rDNA-IGS), and a fragment of the ras-related protein (Ypt1) gene were amplified and sequenced from a panel of 31 Phytophthora species representing the most significant forest pathogens and the breadth of diversity in the genus. Over 80 kbp of novel sequences were generated and alignments showed very variable (introns and non-coding regions) as well as conserved coding regions. The mitochondrial DNA regions had an AT/GC ratio ranging from 67.2 to 89.0% and were appropriate for diagnostic development and phylogeographic analysis. The IGS fragment was less variable but still appropriate to discriminate amongst some important forest pathogens. The introns of the Ypt1 gene were sufficiently polymorphic for the development of molecular markers for almost all Phytophthora species, with more conserved flanking coding regions appropriate for the design of Phytophthora genus-specific primers. In general, phylogenetic analysis of the sequence alignments grouped species in clades that matched those based on the ITS regions of the rDNA. In many cases the resolution was improved over ITS but in other cases sequences were too variable to align accurately and yielded phylograms inconsistent with other data. Key studies on the intraspecific variation and primer specificity remain. However the research has already yielded an enormous dataset for the identification, detection and study of the molecular evolution of Phytophthora species.  相似文献   

17.
The nef genes of the human immunodeficiency viruses type 1 and 2 (HIV-1 and HIV-2) and the related simian immunodeficiency viruses (SIVs) encode a protein (Nef) whose role in virus replication and cytopathicity remains uncertain. As an attempt to elucidate the function of nef, we characterized the nucleotide and corresponding protein sequences of naturally occurring nef genes obtained from several HIV-1-infected individuals. A consensus Nef sequence was derived and used to identify several features that were highly conserved among the Nef sequences. These features included a nearly invariant myristylation signal, regions of sequence polymorphism and variable duplication, a region with an acidic charge, a (Pxx)4 repeat sequence, and a potential protein kinase C phosphorylation site. Clustering of premature stop codons at position 124 was noted in 6 of the 54 Nef sequences. Further analysis revealed four stretches of residues that were highly conserved not only among the patient-derived HIV-1 Nef sequences, but also among the Nef sequences of HIV-2 and the SIVs, suggesting that Nef proteins expressed by these retroviruses are functionally equivalent. The "Nef-defining" sequences were used to evaluate the sequence alignments of known proteins reported to share sequence similarity with Nef sequences and to conduct additional computer-based searches for similar protein sequences. A gene encoding the consensus Nef sequence was also generated. This gene encodes a full-length Nef protein that should be a valuable tool in further studies of Nef function.  相似文献   

18.
Identification of protein sequence homology by consensus template alignment   总被引:26,自引:0,他引:26  
A pattern-matching procedure is described, based on fitting templates to the sequence, which allows general structural constraints to be imposed on the patterns identified. The templates correspond to structurally conserved regions of the sequence and were initially derived from a small number of related sequences whose tertiary structures are known. The templates were then made more representative by aligning other sequences of unknown structure. Two alignments were built up containing 100 immunoglobulin variable domain sequences and 85 constant domain sequences, respectively. From each of these extended alignments, templates were generated to represent features conserved in all the sequences. These consisted mainly of patterns of hydrophobicity associated with beta-structure. For structurally conserved beta-strands with no conserved features, templates based on general secondary structure prediction principles were used to identify their possible locations. The specificity of the templates was demonstrated by their ability to identify the conserved features in known immunoglobulin and immunoglobulin-related sequences but not in other non-immunoglobulin sequences.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号