共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION:Aligning multiple proteins based on sequence information alone is challenging if sequence identity is low or there is a significant degree of structural divergence. We present a novel algorithm (SATCHMO) that is designed to address this challenge. SATCHMO simultaneously constructs a tree and a set of multiple sequence alignments, one for each internal node of the tree. The alignment at a given node contains all sequences within its sub-tree, and predicts which positions in those sequences are alignable and which are not. Aligned regions therefore typically get shorter on a path from a leaf to the root as sequences diverge in structure. Current methods either regard all positions as alignable (e.g. ClustalW), or align only those positions believed to be homologous across all sequences (e.g. profile HMM methods); by contrast SATCHMO makes different predictions of alignable regions in different subgroups. SATCHMO generates profile hidden Markov models at each node; these are used to determine branching order, to align sequences and to predict structurally alignable regions. RESULTS: In experiments on the BAliBASE benchmark alignment database, SATCHMO is shown to perform comparably to ClustalW and the UCSC SAM HMM software. Results using SATCHMO to identify protein domains are demonstrated on potassium channels, with implications for the mechanism by which tumor necrosis factor alpha affects potassium current. AVAILABILITY: The software is available for download from http://www.drive5.com/lobster/index.htm 相似文献
2.
MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information 总被引:3,自引:2,他引:3
We have developed MUMMALS, a program to construct multiple protein sequence alignment using probabilistic consistency. MUMMALS improves alignment quality by using pairwise alignment hidden Markov models (HMMs) with multiple match states that describe local structural information without exploiting explicit structure predictions. Parameters for such models have been estimated from a large library of structure-based alignments. We show that (i) on remote homologs, MUMMALS achieves statistically best accuracy among several leading aligners, such as ProbCons, MAFFT and MUSCLE, albeit the average improvement is small, in the order of several percent; (ii) a large collection (>10000) of automatically computed pairwise structure alignments of divergent protein domains is superior to smaller but carefully curated datasets for estimation of alignment parameters and performance tests; (iii) reference-independent evaluation of alignment quality using sequence alignment-dependent structure superpositions correlates well with reference-dependent evaluation that compares sequence-based alignments to structure-based reference alignments. 相似文献
3.
4.
Arby: automatic protein structure prediction using profile-profile alignment and confidence measures 总被引:1,自引:0,他引:1
MOTIVATION: Arby is a new server for protein structure prediction that combines several homology-based methods for predicting the three-dimensional structure of a protein, given its sequence. The methods used include a threading approach, which makes use of structural information, and a profile-profile alignment approach that incorporates secondary structure predictions. The combination of the different methods with the help of empirically derived confidence measures affords reliable template selection. RESULTS: According to the recent CAFASP3 experiment, the server is one of the most sensitive methods for predicting the structure of single domain proteins. The quality of template selection is assessed using a fold-recognition experiment. AVAILABILITY: The Arby server is available through the portal of the Helmholtz Network for Bioinformatics at http://www.hnbioinfo.de under the protein structure category. 相似文献
5.
Predicting GPCR-G-protein coupling using hidden Markov models 总被引:1,自引:0,他引:1
6.
Segmentation of yeast DNA using hidden Markov models 总被引:2,自引:0,他引:2
7.
Kikuchi N Kwon YD Gotoh M Narimatsu H 《Biochemical and biophysical research communications》2003,310(2):574-579
In order to investigate the relationship between glycosyltransferase families and the motif for them, we classified 47 glycosyltransferase families in the CAZy database into four superfamilies, GTS-A, -B, -C, and -D, using a profile Hidden Markov Model method. On the basis of the classification and the similarity between GTS-A and nucleotidylyltransferase family catalyzing the synthesis of nucleotide-sugar, we proposed that ancient oligosaccharide might have been synthesized by the origin of GTS-B whereas the origin of GTS-A might be the gene encoding for synthesis of nucleotide-sugar as the donor and have evolved to glycosyltransferases to catalyze the synthesis of divergent carbohydrates. We also suggested that the divergent evolution of each superfamily in the corresponding subcellular component has increased the complexities of eukaryotic carbohydrate structure. 相似文献
8.
SUMMARY: The package HMMGEP performs cluster analysis on gene expression data using hidden Markov models. AVAILABILITY: HMMGEP, including the source code, documentation and sample data files, is available at http://www.bioinfo.tsinghua.edu.cn:8080/~rich/hmmgep_download/index.html. 相似文献
9.
MOTIVATION: Hidden Markov models (HMMs) calculate the probability that a sequence was generated by a given model. Log-odds scoring provides a context for evaluating this probability, by considering it in relation to a null hypothesis. We have found that using a reverse-sequence null model effectively removes biases owing to sequence length and composition and reduces the number of false positives in a database search. Any scoring system is an arbitrary measure of the quality of database matches. Significance estimates of scores are essential, because they eliminate model- and method-dependent scaling factors, and because they quantify the importance of each match. Accurate computation of the significance of reverse-sequence null model scores presents a problem, because the scores do not fit the extreme-value (Gumbel) distribution commonly used to estimate HMM scores' significance. RESULTS: To get a better estimate of the significance of reverse-sequence null model scores, we derive a theoretical distribution based on the assumption of a Gumbel distribution for raw HMM scores and compare estimates based on this and other distribution families. We derive estimation methods for the parameters of the distributions based on maximum likelihood and on moment matching (least-squares fit for Student's t-distribution). We evaluate the modeled distributions of scores, based on how well they fit the tail of the observed distribution for data not used in the fitting and on the effects of the improved E-values on our HMM-based fold-recognition methods. The theoretical distribution provides some improvement in fitting the tail and in providing fewer false positives in the fold-recognition test. An ad hoc distribution based on assuming a stretched exponential tail does an even better job. The use of Student's t to model the distribution fits well in the middle of the distribution, but provides too heavy a tail. The moment-matching methods fit the tails better than maximum-likelihood methods. AVAILABILITY: Information on obtaining the SAM program suite (free for academic use), as well as a server interface, is available at http://www.soe.ucsc.edu/research/compbio/sam.html and the open-source random sequence generator with varying compositional biases is available at http://www.soe.ucsc.edu/research/compbio/gen_sequence 相似文献
10.
Statistical significance of probabilistic sequence alignment and related local hidden Markov models.
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the "local" version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified "semi-probabilistic" alignment consisting of a hybrid of Smith-Waterman and probabilistic alignment is then proposed and studied in detail. It is predicted that the score statistics of the hybrid algorithm is of the Gumbel universal form, with the key Gumbel parameter lambda taking on a fixed asymptotic value for a wide variety of scoring systems and parameters. A simple recipe for the computation of the "relative entropy," and from it the finite size correction to lambda, is also given. These predictions compare well with direct numerical simulations for sequences of lengths between 100 and 1,000 examined using various PAM substitution scores and affine gap functions. The sensitivity of the hybrid method in the detection of sequence homology is also studied using correlated sequences generated from toy mutation models. It is found to be comparable to that of the Smith-Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment. 相似文献
11.
Hofmann K 《Briefings in bioinformatics》2000,1(2):167-178
Sequence database searches have become an important tool for the life sciences in general and for gene discovery-driven biotechnology in particular. Both the functional assignment of newly found proteins and the mining of genome databases for functional candidates are equally important tasks typically addressed by database searches. Sensitivity and reliability of the search methods are of crucial importance.The overall performance of sequence alignments and database searches can be enhanced considerably, when profiles or hidden Markov models (HMMs) derived from protein families are used as query objects instead of single sequences.This review discusses the concept of profiles, generalised profiles and profile-HMMs, the methods how they are constructed and the scope of possible applications in gene discovery and gene functional assignment. 相似文献
12.
Nicolas P Bize L Muri F Hoebeke M Rodolphe F Ehrlich SD Prum B Bessières P 《Nucleic acids research》2002,30(6):1418-1426
We present here the use of a new statistical segmentation method on the Bacillus subtilis chromosome sequence. Maximum likelihood parameter estimation of a hidden Markov model, based on the expectation-maximization algorithm, enables one to segment the DNA sequence according to its local composition. This approach is not based on sliding windows; it enables different compositional classes to be separated without prior knowledge of their content, size and localization. We compared these compositional classes, obtained from the sequence, with the annotated DNA physical map, sequence homologies and repeat regions. The first heterogeneity revealed discriminates between the two coding strands and the non-coding regions. Other main heterogeneities arise; some are related to horizontal gene transfer, some to t-enriched composition of hydrophobic protein coding strands, and others to the codon usage fitness of highly expressed genes. Concerning potential and established gene transfers, we found 9 of the 10 known prophages, plus 14 new regions of atypical composition. Some of them are surrounded by repeats, most of their genes have unknown function or possess homology to genes involved in secondary catabolism, metal and antibiotic resistance. Surprisingly, we notice that all of these detected regions are a + t-richer than the host genome, raising the question of their remote sources. 相似文献
13.
STRUCTFAST is a novel profile-profile alignment algorithm capable of detecting weak similarities between protein sequences. The increased sensitivity and accuracy of the STRUCTFAST method are achieved through several unique features. First, the algorithm utilizes a novel dynamic programming engine capable of incorporating important information from a structural family directly into the alignment process. Second, the algorithm employs a rigorous analytical formula for profile-profile scoring to overcome the limitations of ad hoc scoring functions that require adjustable parameter training. Third, the algorithm employs Convergent Island Statistics (CIS) to compute the statistical significance of alignment scores independently for each pair of sequences. STRUCTFAST routinely produces alignments that meet or exceed the quality obtained by an expert human homology modeler, as evidenced by its performance in the latest CAFASP4 and CASP6 blind prediction benchmark experiments. 相似文献
14.
Quasi-consensus-based comparison of profile hidden Markov models for protein sequences 总被引:2,自引:0,他引:2
A simple approach for the sensitive detection of distant relationships among protein families and for sequence-structure alignment via comparison of hidden Markov models based on their quasi-consensus sequences is presented. Using a previously published benchmark dataset, the approach is demonstrated to give better homology detection and yield alignments with improved accuracy in comparison to an existing state-of-the-art dynamic programming profile-profile comparison method. This method also runs significantly faster and is therefore suitable for a server covering the rapidly increasing structure database. A server based on this method is available at http://liao.cis.udel.edu/website/servers/modmod 相似文献
15.
Hidden Markov models (HMMs) have been successfully applied to a variety of problems in molecular biology, ranging from alignment problems to gene finding and annotation. Alignment problems can be solved with pair HMMs, while gene finding programs rely on generalized HMMs in order to model exon lengths. In this paper, we introduce the generalized pair HMM (GPHMM), which is an extension of both pair and generalized HMMs. We show how GPHMMs, in conjunction with approximate alignments, can be used for cross-species gene finding and describe applications to DNA-cDNA and DNA-protein alignment. GPHMMs provide a unifying and probabilistically sound theory for modeling these problems. 相似文献
16.
Surveillance data for communicable nosocomial pathogens usually consist of short time series of low-numbered counts of infected patients. These often show overdispersion and autocorrelation. To date, almost all analyses of such data have ignored the communicable nature of the organisms and have used methods appropriate only for independent outcomes. Inferences that depend on such analyses cannot be considered reliable when patient-to-patient transmission is important. We propose a new method for analysing these data based on a mechanistic model of the epidemic process. Since important nosocomial pathogens are often carried asymptomatically with overt infection developing in only a proportion of patients, the epidemic process is usually only partially observed by routine surveillance data. We therefore develop a 'structured' hidden Markov model where the underlying Markov chain is generated by a simple transmission model. We apply both structured and standard (unstructured) hidden Markov models to time series for three important pathogens. We find that both methods can offer marked improvements over currently used approaches when nosocomial spread is important. Compared to the standard hidden Markov model, the new approach is more parsimonious, is more biologically plausible, and allows key epidemiological parameters to be estimated. 相似文献
17.
Hidden Markov models have been used to restore recorded signals of single ion channels buried in background noise. Parameter estimation and signal restoration are usually carried out through likelihood maximization by using variants of the Baum-Welch forward-backward procedures. This paper presents an alternative approach for dealing with this inferential task. The inferences are made by using a combination of the framework provided by Bayesian statistics and numerical methods based on Markov chain Monte Carlo stochastic simulation. The reliability of this approach is tested by using synthetic signals of known characteristics. The expectations of the model parameters estimated here are close to those calculated using the Baum-Welch algorithm, but the present methods also yield estimates of their errors. Comparisons of the results of the Bayesian Markov Chain Monte Carlo approach with those obtained by filtering and thresholding demonstrate clearly the superiority of the new methods. 相似文献
18.
Background
Detection of common evolutionary origin (homology) is a primary means of inferring protein structure and function. At present, comparison of protein families represented as sequence profiles is arguably the most effective homology detection strategy. However, finding the best way to represent evolutionary information of a protein sequence family in the profile, to compare profiles and to estimate the biological significance of such comparisons, remains an active area of research. 相似文献19.
A hidden Markov model for progressive multiple alignment 总被引:4,自引:0,他引:4
MOTIVATION: Progressive algorithms are widely used heuristics for the production of alignments among multiple nucleic-acid or protein sequences. Probabilistic approaches providing measures of global and/or local reliability of individual solutions would constitute valuable developments. RESULTS: We present here a new method for multiple sequence alignment that combines an HMM approach, a progressive alignment algorithm, and a probabilistic evolution model describing the character substitution process. Our method works by iterating pairwise alignments according to a guide tree and defining each ancestral sequence from the pairwise alignment of its child nodes, thus, progressively constructing a multiple alignment. Our method allows for the computation of each column minimum posterior probability and we show that this value correlates with the correctness of the result, hence, providing an efficient mean by which unreliably aligned columns can be filtered out from a multiple alignment. 相似文献
20.
Liang KC Wang X Anastassiou D 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2007,4(3):430-440
It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers. 相似文献