首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 33 毫秒
1.
This article presents a statistical method for detecting recombination in DNA sequence alignments, which is based on combining two probabilistic graphical models: (1) a taxon graph (phylogenetic tree) representing the relationship between the taxa, and (2) a site graph (hidden Markov model) representing interactions between different sites in the DNA sequence alignments. We adopt a Bayesian approach and sample the parameters of the model from the posterior distribution with Markov chain Monte Carlo, using a Metropolis-Hastings and Gibbs-within-Gibbs scheme. The proposed method is tested on various synthetic and real-world DNA sequence alignments, and we compare its performance with the established detection methods RECPARS, PLATO, and TOPAL, as well as with two alternative parameter estimation schemes.  相似文献   

2.
1.  Linking the movement and behaviour of animals to their environment is a central problem in ecology. Through the use of electronic tagging and tracking (ETT), collection of in situ data from free-roaming animals is now commonplace, yet statistical approaches enabling direct relation of movement observations to environmental conditions are still in development.
2.  In this study, we examine the hidden Markov model (HMM) for behavioural analysis of tracking data. HMMs allow for prediction of latent behavioural states while directly accounting for the serial dependence prevalent in ETT data. Updating the probability of behavioural switches with tag or remote-sensing data provides a statistical method that links environmental data to behaviour in a direct and integrated manner.
3.  It is important to assess the reliability of state categorization over the range of time-series lengths typically collected from field instruments and when movement behaviours are similar between movement states. Simulation with varying lengths of times series data and contrast between average movements within each state was used to test the HMMs ability to estimate movement parameters.
4.  To demonstrate the methods in a realistic setting, the HMMs were used to categorize resident and migratory phases and the relationship between movement behaviour and ocean temperature using electronic tagging data from southern bluefin tuna ( Thunnus maccoyii ). Diagnostic tools to evaluate the suitability of different models and inferential methods for investigating differences in behaviour between individuals are also demonstrated.  相似文献   

3.
Analysis of the structure of indels in algorithmic versus evolutionary alignments based on a set of inequalities confirms the conclusions from numerical modeling. For the more divergent sequences (PAM > 60), the tested aligning algorithm (SW) tends to increase the mean length of indels and decrease their number.  相似文献   

4.
We consider hidden Markov models as a versatile class of models for weakly dependent random phenomena. The topic of the present paper is likelihood-ratio testing for hidden Markov models, and we show that, under appropriate conditions, the standard asymptotic theory of likelihood-ratio tests is valid. Such tests are crucial in the specification of multivariate Gaussian hidden Markov models, which we use to illustrate the applicability of our general results. Finally, the methodology is illustrated by means of a real data set.  相似文献   

5.
A hidden Markov model (HMM) of electrocardiogram (ECG) signal is presented for detection of myocardial ischemia. The time domain signals that are recorded by the ECG before and during the episode of local ischemia were pre-processed to produce input sequences, which is needed for the model training. The model is also verified by test data, and the results show that the models have certain function for the detection of myocardial ischemia. The algorithm based on HMM provides a possible approach for the timely, rapid and automatic diagnosis of myocardial ischemia, and also can be used in portable medical diagnostic equipment in the future.  相似文献   

6.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high‐quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high‐quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.  相似文献   

7.
Methods that predict the topology of helical membrane proteins are standard tools when analyzing any proteome. Therefore, it is important to improve the performance of such methods. Here we introduce a novel method, PRODIV-TMHMM, which is a profile-based hidden Markov model (HMM) that also incorporates the best features of earlier HMM methods. In our tests, PRODIV-TMHMM outperforms earlier methods both when evaluated on "low-resolution" topology data and on high-resolution 3D structures. The results presented here indicate that the topology could be correctly predicted for approximately two-thirds of all membrane proteins using PRODIV-TMHMM. The importance of evolutionary information for topology prediction is emphasized by the fact that compared with using single sequences, the performance of PRODIV-TMHMM (as well as two other methods) is increased by approximately 10 percentage units by the use of homologous sequences. On a more general level, we also show that HMM-based (or similar) methods perform superiorly to methods that focus mainly on identification of the membrane regions.  相似文献   

8.
Sequence database searches have become an important tool for the life sciences in general and for gene discovery-driven biotechnology in particular. Both the functional assignment of newly found proteins and the mining of genome databases for functional candidates are equally important tasks typically addressed by database searches. Sensitivity and reliability of the search methods are of crucial importance.The overall performance of sequence alignments and database searches can be enhanced considerably, when profiles or hidden Markov models (HMMs) derived from protein families are used as query objects instead of single sequences.This review discusses the concept of profiles, generalised profiles and profile-HMMs, the methods how they are constructed and the scope of possible applications in gene discovery and gene functional assignment.  相似文献   

9.
Exposure to air pollution is associated with increased morbidity and mortality. Recent technological advancements permit the collection of time-resolved personal exposure data. Such data are often incomplete with missing observations and exposures below the limit of detection, which limit their use in health effects studies. In this paper, we develop an infinite hidden Markov model for multiple asynchronous multivariate time series with missing data. Our model is designed to include covariates that can inform transitions among hidden states. We implement beam sampling, a combination of slice sampling and dynamic programming, to sample the hidden states, and a Bayesian multiple imputation algorithm to impute missing data. In simulation studies, our model excels in estimating hidden states and state-specific means and imputing observations that are missing at random or below the limit of detection. We validate our imputation approach on data from the Fort Collins Commuter Study. We show that the estimated hidden states improve imputations for data that are missing at random compared to existing approaches. In a case study of the Fort Collins Commuter Study, we describe the inferential gains obtained from our model including improved imputation of missing data and the ability to identify shared patterns in activity and exposure among repeated sampling days for individuals and among distinct individuals.  相似文献   

10.
Making sense of score statistics for sequence alignments   总被引:1,自引:0,他引:1  
The search for similarity between two biological sequences lies at the core of many applications in bioinformatics. This paper aims to highlight a few of the principles that should be kept in mind when evaluating the statistical significance of alignments between sequences. The extreme value distribution is first introduced, which in most cases describes the distribution of alignment scores between a query and a database. The effects of the similarity matrix and gap penalty values on the score distribution are then examined, and it is shown that the alignment statistics can undergo an abrupt phase transition. A few types of random sequence databases used in the estimation of statistical significance are presented, and the statistics employed by the BLAST, FASTA and PRSS programs are compared. Finally the different strategies used to assess the statistical significance of the matches produced by profiles and hidden Markov models are presented.  相似文献   

11.
Surveillance data for communicable nosocomial pathogens usually consist of short time series of low-numbered counts of infected patients. These often show overdispersion and autocorrelation. To date, almost all analyses of such data have ignored the communicable nature of the organisms and have used methods appropriate only for independent outcomes. Inferences that depend on such analyses cannot be considered reliable when patient-to-patient transmission is important. We propose a new method for analysing these data based on a mechanistic model of the epidemic process. Since important nosocomial pathogens are often carried asymptomatically with overt infection developing in only a proportion of patients, the epidemic process is usually only partially observed by routine surveillance data. We therefore develop a 'structured' hidden Markov model where the underlying Markov chain is generated by a simple transmission model. We apply both structured and standard (unstructured) hidden Markov models to time series for three important pathogens. We find that both methods can offer marked improvements over currently used approaches when nosocomial spread is important. Compared to the standard hidden Markov model, the new approach is more parsimonious, is more biologically plausible, and allows key epidemiological parameters to be estimated.  相似文献   

12.
Sequence alignment is fundamental for analyzing protein structure and function. For all but closely-related proteins, alignments based on structures are more accurate than alignments based purely on amino-acid sequences. However, the disparity between the large amount of sequence data and the relative paucity of experimentally-determined structures has precluded the general applicability of structure alignment. Based on the success of AlphaFold (and its likes) in producing high-quality structure predictions, we suggest that when aligning homologous proteins, lacking experimental structures, better results can be obtained by a structural alignment of predicted structures than by an alignment based only on amino-acid sequences. We present a quantitative evaluation, based on pairwise alignments of sequences and structures (both predicted and experimental) to support this hypothesis.  相似文献   

13.
14.
15.
Plant architecture is the result of repetitions that occur through growth and branching processes. During plant ontogeny, changes in the morphological characteristics of plant entities are interpreted as the indirect translation of different physiological states of the meristems. Thus connected entities can exhibit either similar or very contrasted characteristics. We propose a statistical model to reveal and characterize homogeneous zones and transitions between zones within tree-structured data: the hidden Markov tree (HMT) model. This model leads to a clustering of the entities into classes sharing the same 'hidden state'. The application of the HMT model to two plant sets (apple trees and bush willows), measured at annual shoot scale, highlights ordered states defined by different morphological characteristics. The model provides a synthetic overview of state locations, pointing out homogeneous zones or ruptures. It also illustrates where within branching structures, and when during plant ontogeny, morphological changes occur. However, the labelling exhibits some patterns that cannot be described by the model parameters. Some of these limitations are addressed by two alternative HMT families.  相似文献   

16.
The objective of the present research was to investigate whether hidden Markov models can be used to recognise and classify balance signals extracted from two subject groups, the healthy and patients suffering from otoneurological vertiginous diseases. Two different testing protocols were applied: arising from a chair and standing on the force platform. Signals recorded according to these protocols were trained for models with different numbers of states to find the best choices as model structures. We found that these models with 7–15 states were able to recognise the healthy subjects from the patients with the accuracy between 70 and 90% although their balance measurements were visually very similar and difficult to separate between two groups.  相似文献   

17.
Species identification through DNA barcoding or metabarcoding has become a key approach for biodiversity evaluation and ecological studies. However, the rapid accumulation of barcoding data has created some difficulties: for instance, global enquiries to a large reference library can take a very long time. We here devise a two‐step searching strategy to speed identification procedures of such queries. This firstly uses a Hidden Markov Model (HMM) algorithm to narrow the searching scope to genus level and then determines the corresponding species using minimum genetic distance. Moreover, using a fuzzy membership function, our approach also estimates the credibility of assignment results for each query. To perform this task, we developed a new software pipeline, FuzzyID2, using Python and C++. Performance of the new method was assessed using eight empirical data sets ranging from 70 to 234,535 barcodes. Five data sets (four animal, one plant) deployed the conventional barcode approach, one used metabarcodes, and two were eDNA‐based. The results showed mean accuracies of generic and species identification of 98.60% (with a minimum of 95.00% and a maximum of 100.00%) and 94.17% (with a range of 84.40%–100.00%), respectively. Tests with simulated NGS sequences based on realistic eDNA and metabarcode data demonstrated that FuzzyID2 achieved a significantly higher identification success rate than the commonly used Blast method, and the TIPP method tends to find many fewer species than either FuzztID2 or Blast. Furthermore, data sets with tens of thousands of barcodes need only a few seconds for each query assignment using FuzzyID2. Our approach provides an efficient and accurate species identification protocol for biodiversity‐related projects with large DNA sequence data sets.  相似文献   

18.
19.
We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.  相似文献   

20.
Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号