首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Models of molecular evolution tend to be overly simplistic caricatures of biology that are prone to assigning high probabilities to biologically implausible DNA or protein sequences. Here, we explore how to construct time-reversible evolutionary models that yield stationary distributions of sequences that match given target distributions. By adopting comparatively realistic target distributions,evolutionary models can be improved. Instead of focusing on estimating parameters, we concentrate on the population genetic implications of these models. Specifically, we obtain estimates of the product of effective population size and relative fitness difference of alleles. The approach is illustrated with two applications to protein-coding DNA. In the first, a codon-based evolutionary model yields a stationary distribution of sequences, which, when the sequences are translated,matches a variable-length Markov model trained on human proteins. In the second, we introduce an insertion-deletion model that describes selectively neutral evolutionary changes to DNA. We then show how to modify the neutral model so that its stationary distribution at the amino acid level can match a profile hidden Markov model, such as the one associated with the Pfam database.  相似文献   

2.
Fragment-HMM: a new approach to protein structure prediction   总被引:1,自引:0,他引:1  
We designed a simple position-specific hidden Markov model to predict protein structure. Our new framework naturally repeats itself to converge to a final target, conglomerating fragment assembly, clustering, target selection, refinement, and consensus, all in one process. Our initial implementation of this theory converges to within 6 A of the native structures for 100% of decoys on all six standard benchmark proteins used in ROSETTA (discussed by Simons and colleagues in a recent paper), which achieved only 14%-94% for the same data. The qualities of the best decoys and the final decoys our theory converges to are also notably better.  相似文献   

3.
InterPro, an integrated documentation resource for protein families, protein domains, and functional sites, was developed to amalgamate the individual efforts of the PROSITE, PRINTS, Pfam, and ProDom databases. InterPro can be used for the computational functional classification of newly determined amino acid sequences that lack biochemical characterization and for comparative genome analysis. InterPro contains over 3500 entries for more than 1 000 000 hits in SWISS-PROT and TrEMBL. The database is accessible for text-and sequence-based searches at http://www.ebi.ac.uk/interpro/. InterPro was used for the complete analysis of the proteome of the pathogenic microorganism Mycobacterium tuberculosis and the comparison with the predicted protein-coding sequences of the complete genomes of Bacillus subtilis and Escherichia coli. It was found that 64.8% of proteins in the proteome of M. tuberculosis matched InterPro entries and can be classified by their functions. The comparison with B. subtilis and E. coli provided information on the most common protein families and domains and on the most highly represented protein families in each organism. Thus, InterPro is a useful tool for general comparison of complete proteomes and their compositions.  相似文献   

4.
The mitochondrial inner and outer membranes are composed of a variety of integral membrane proteins, assembled into the membranes posttranslationally. The small translocase of the inner mitochondrial membranes (TIMs) are a group of approximately 10 kDa proteins that function as chaperones to ferry the imported proteins across the mitochondrial intermembrane space to the outer and inner membranes. In yeast, there are 5 small TIM proteins: Tim8, Tim9, Tim10, Tim12, and Tim13, with equivalent proteins reported in humans. Using hidden Markov models, we find that many eukaryotes have proteins equivalent to the Tim8 and Tim13 and the Tim9 and Tim10 subunits. Some eukaryotes provide "snapshots" of evolution, with a single protein showing the features of both Tim8 and Tim13, suggesting that a single progenitor gene has given rise to each of the small TIMs through duplication and modification. We show that no "Tim12" family of proteins exist, but rather that variant forms of the cognate small TIMs have been recently duplicated and modified to provide new functions: the yeast Tim12 is a modified form of Tim10, whereas in humans and some protists variant forms of Tim9, Tim8, and Tim13 are found instead. Sequence motif analysis reveals acidic residues conserved in the Tim10 substrate-binding tentacles, whereas more hydrophobic residues are found in the equivalent substrate-binding region of Tim13. The substrate-binding region of Tim10 and Tim13 represent structurally independent domains: when the acidic domain from Tim10 is attached to Tim13, the Tim8-Tim13(10) complex becomes essential and the Tim9-Tim10 complex becomes dispensable. The conserved features in the Tim10 and Tim13 subunits provide distinct binding surfaces to accommodate the broad range of substrate proteins delivered to the mitochondrial inner and outer membranes.  相似文献   

5.
We consider hidden Markov models as a versatile class of models for weakly dependent random phenomena. The topic of the present paper is likelihood-ratio testing for hidden Markov models, and we show that, under appropriate conditions, the standard asymptotic theory of likelihood-ratio tests is valid. Such tests are crucial in the specification of multivariate Gaussian hidden Markov models, which we use to illustrate the applicability of our general results. Finally, the methodology is illustrated by means of a real data set.  相似文献   

6.
Bernsel A  Viklund H  Elofsson A 《Proteins》2008,71(3):1387-1399
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.  相似文献   

7.
Qian B  Goldstein RA 《Proteins》2003,52(3):446-453
It is often desired to identify further homologs of a family of biological sequences from the ever-growing sequence databases. Profile hidden Markov models excel at capturing the common statistical features of a group of biological sequences. With these common features, we can search the biological database and find new homologous sequences. Most general profile hidden Markov model methods, however, treat the evolutionary relationships between the sequences in a homologous group in an ad-hoc manner. We hereby introduce a method to incorporate phylogenetic information directly into hidden Markov models, and demonstrate that the resulting model performs better than most of the current multiple sequence-based methods for finding distant homologs.  相似文献   

8.
Chaudhuri I  Söding J  Lupas AN 《Proteins》2008,71(2):795-803
beta-Propellers are toroidal folds, in which repeated, four-stranded beta-meanders are arranged in a circular and slightly tilted fashion, like the blades of a propeller. They are found in all domains of life, with a strong preponderance among eukaryotes. Propellers show considerable sequence diversity and are classified into six separate structural groups by the SCOP and CATH databases. Despite this diversity, they often show similarities across groups, not only in structure but also in sequence, raising the possibility of a common origin. In agreement with this hypothesis, most propellers group together in a cluster map of all-beta folds generated by sequence similarity, because of numerous pairwise matches, many of which are individually nonsignificant. In total, 45 of 60 propellers in the SCOP25 database, covering four SCOP folds, are clustered in this group and analysis with sensitive sequence comparison methods shows that they are similar at a level indicative of homology. Two mechanisms appear to contribute to the evolution of beta-propellers: amplification from single blades and subsequent functional differentiation. The observation of propellers with nearly identical blades in genomic sequences show that these mechanisms are still operating today.  相似文献   

9.
Selection for new favorable variants can lead to selective sweeps. However, such sweeps might be rare in the evolution of different species for which polygenic adaptation or selection on standing variation might be more common. Still, strong selective sweeps have been described in domestic species such as chicken lines or dog breeds. The goal of our study was to use a panel of individuals from 12 different cattle breeds genotyped at high density (800K SNPs) to perform a whole‐genome scan for selective sweeps defined as unexpectedly long stretches of reduced heterozygosity. To that end, we developed a hidden Markov model in which one of the hidden states corresponds to regions of reduced heterozygosity. Some unexpectedly long regions were identified. Among those, six contained genes known to affect traits with simple genetic architecture such as coat color or horn development. However, there was little evidence for sweeps associated with genes underlying production traits.  相似文献   

10.
Methylated non-CpGs (mCpHs) in mammalian cells yield weak enrichment signals and colocalize with methylated CpGs (mCpGs), thus have been considered byproducts of hyperactive methyltransferases. However, mCpHs are cell type-specific and associated with epigenetic regulation, although their dependency on mCpGs remains to be elucidated. In this study, we demonstrated that mCpHs colocalize with mCpGs in pluripotent stem cells, but not in brain cells. In addition, profiling genome-wide methylation patterns using a hidden Markov model revealed abundant genomic regions in which CpGs and CpHs are differentially methylated in brain. These regions were frequently located in putative enhancers, and mCpHs within the enhancers increased in correlation with brain age. The enhancers with hypermethylated CpHs were associated with genes functionally enriched in immune responses, and some of the genes were related to neuroinflammation and degeneration. This study provides insight into the roles of non-CpG methylation as an epigenetic code in the mammalian brain genome.  相似文献   

11.
M Rehmsmeier  M Vingron 《Proteins》2001,45(4):360-371
We present a database search method that is based on phylogenetic trees (treesearch). The method is used to search a protein sequence database for homologs to a protein family. In preparation for the search, a phylogenetic tree is constructed from a given multiple alignment of the family. During the search, each database sequence is temporarily inserted into the tree, thus adding a new edge to the tree. Homology between family and sequence is then judged from the length of this edge. In a comparison of our method to profiles (ISREC pfsearch), two implementations of hidden Markov models (HMMER hmmsearch and SAM hmmscore), and to the family pairwise search (FPS) method on 43 families from the SCOP database based on minimum false-positive counts (min-FPCs), we found a considerable gain in sensitivity. In 69% of the test cases, treesearch showed a min-FPC of at most 50, whereas the two second best methods (hmmsearch and FPS) showed this performance only in 53% cases. A similar impression holds for a large range of min-FPC thresholds. The results demonstrate that phylogenetic information can significantly improve the detection of distant homologies and justify our method as a useful alternative to existing methods.  相似文献   

12.
A combined transmembrane topology and signal peptide prediction method   总被引:31,自引:0,他引:31  
An inherent problem in transmembrane protein topology prediction and signal peptide prediction is the high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions. To improve predictions further, it is therefore important to make a predictor that aims to discriminate between the two classes. In addition, topology information can be gained when successfully predicting a signal peptide leading a transmembrane protein since it dictates that the N terminus of the mature protein must be on the non-cytoplasmic side of the membrane. Here, we present Phobius, a combined transmembrane protein topology and signal peptide predictor. The predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states. Training was done on a newly assembled and curated dataset. Compared to TMHMM and SignalP, errors coming from cross-prediction between transmembrane segments and signal peptides were reduced substantially by Phobius. False classifications of signal peptides were reduced from 26.1% to 3.9% and false classifications of transmembrane helices were reduced from 19.0% to 7.7%. Phobius was applied to the proteomes of Homo sapiens and Escherichia coli. Here we also noted a drastic reduction of false classifications compared to TMHMM/SignalP, suggesting that Phobius is well suited for whole-genome annotation of signal peptides and transmembrane regions. The method is available at as well as at  相似文献   

13.
14.
基于隐马氏模型对编码序列缺失与插入的检测(英)   总被引:2,自引:0,他引:2  
在基因组测序工作完成后,利用计算工具进行基因识别以及基因结构预测受到了越来越多人的重视.人们开发了大量的相关应用软件,如GenScan, Genemark, GRAIL等,这些软件在寻找新基因方面提供了很重要的线索.但基因的识别和预测问题仍未得到完全解决,当目标基因的编码序列有缺失和插入时,其预测结果和基因的实际结构相差很大.为了消除测序错误对预测结果的影响,希望能找出编码序列区的测序错误.基于这种想法,尝试根据DNA序列的一些统计特性,利用隐马尔科夫模型(Hidden Markov Model),引入缺失和插入状态,然后用Viterbi算法,从中找出含有缺失和插入的外显子序列片段.在常用的Burset/Guigo检测集进行检测,得到的结果在外显子水平上,Sn(sensitivity)和Sp(specificity)均达到84%以上.  相似文献   

15.
16.
In many countries, high somatic cell scores (SCS) in milk are used as an indicator for mastitis because they are collected on a routine basis. However, individual test-day SCS are not very accurate in identifying infected cows. Mathematical models may improve the accuracy of the biological marker by making better use of the information contained in the available data. Here, a simple hidden Markov model (HMM) is described mathematically and applied to SCS recorded monthly on cows with or without clinical mastitis to evaluate its accuracy in estimating parameters (mean, variance and transition probabilities) under healthy or diseased states. The SCS means were estimated at 1.96 (s.d. = 0.16) and 4.73 (s.d. = 0.71) for the hidden healthy and infected states, and the common variance at 0.83 (s.d. = 0.11). The probability of remaining uninfected, recovering from infection, getting newly infected and remaining infected between consecutive test days was estimated at 78.84%, 60.49%, 11.70% and 15%, respectively. Three different health-related states were compared: clinical stages observed by farmers, subclinical cases defined for somatic cell counts below or above 250 000 cells/ml and infected stages obtained from the HMM. The results showed that HMM identifies infected cows before the appearance of clinical and subclinical signs, which may critically improve the power of the studies on the genetic determinants of SCS and reduce biases in predicting breeding values for SCS.  相似文献   

17.
HumGene是一个采用广义隐Markov模型(GHMM)的人类基因预测软件.利用人类基因的结构特点,采用概率模型为基因结构中各个特定区域建立了独立的子模型,能够获得全局统一的评价指数,使得系统整体框架具有一定的扩展性.采用一种新的简化算法,有效地降低了计算的复杂度.介绍了软件的构成,对软件进行了测试,给出了与其它类似软件的结果比较.  相似文献   

18.
Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences. Proteins: 28:405–420, 1997. © 1997 Wiley-Liss, Inc.  相似文献   

19.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like “linker” sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be “plugged-into” routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.  相似文献   

20.
In this paper, we review developments in probabilistic methods of gene recognition in prokaryotic genomes with the emphasis on connections to the general theory of hidden Markov models (HMM). We show that the Bayesian method implemented in GeneMark, a frequently used gene-finding tool, can be augmented and reintroduced as a rigorous forward-backward (FB) algorithm for local posterior decoding described in the HMM theory. Another earlier developed method, prokaryotic GeneMark.hmm, uses a modification of the Viterbi algorithm for HMM with duration to identify the most likely global path through hidden functional states given the DNA sequence. GeneMark and GeneMark.hmm programs are worth using in concert for analysing prokaryotic DNA sequences that arguably do not follow any exact mathematical model. The new extension of GeneMark using the FB algorithm was implemented in the software program GeneMark.fba. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. The prediction accuracy of GeneMark.fba determined in our tests was compared favourably to the accuracy of the initial (standard) GeneMark program. Comparison to the prokaryotic GeneMark.hmm has also demonstrated a certain, yet species-specific, degree of improvement in raw gene detection, ie detection of correct reading frame (and stop codon). The accuracy of exact gene prediction, which is concerned about precise prediction of gene start (which in a prokaryotic genome unambiguously defines the reading frame and stop codon, thus, the whole protein product), still remains more accurate in GeneMarkS, which uses more elaborate HMM to specifically address this task.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号