首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1-the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses.  相似文献   

Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models.  相似文献   

While immunological distances among taxa have had wide use in systematics, there has been some doubt about their utility because of the observed non-metricity of such distance matrices. A model is presented here relating observed immunological distance to the actual number of antigenic site differences between taxa. This model accounts for the observed departures of these distances from the metric condition of reciprocity and triangle inequality. Based upon the model, two procedures are suggested for the transformation of immunological distances to metric distances appropriate for phylogenetic analysis. The model implies that the usual scaling adjustments applied to the immunological distance matrix are inappropriate; however, the same transformation applied instead to an initial similarity matrix will solve a scaling problem. Non-reciprocity of the distances is shown to remain a problem independent of this initial scaling problem. It is suggested that further transformation of these re-scaled distances may be obtained through an extension of the ADCLUS procedure developed in psychology. This approach suggests a general strategy for a transformation to metric distances, given a particular model of non-metricity for the data.  相似文献   

Aggregated Markov processes related by similarity transformation are equivalent in that they cannot be distinguished by steady-state experiments. We derive an explicit formula for the set of all detailed-balance preserving similarity transformations between such continuous time Markov chains with N states. The matrices that define the allowed similarity transformations are found to be a simple non-linear function applied to almost any element of the special orthogonal group in N dimensions. Since a model is identifiable only if there is no similarity transformations to an equivalent model, we expect this result to prove useful in the theory of identification of aggregated Markov chains, an enterprise of growing importance as more and more single molecules yield to observation.  相似文献   

Stochastic matrix models are used to predict population viability and the risk of extinction. Different stochastic methods require different amounts of estimation effort and may lead to divergent estimates. We used 16 transition matrices collected from ten populations of the perennial herb Primula veris to compare population estimates produced by different stochastic methods, such as selection of matrices, selection of vital rates, selection of matrix elements, and Tuljapurkar's approximation. Specifically, we tested the reliability of the methods using different numbers of transition matrices, and examined the importance of correlations among matrix entries. When correlations among matrix entries were included in the models, selection of vital rates produced the lowest and Tuljapurkar's approximation produced the highest estimates of mean population growth rates. Selection of matrices and matrix elements often produced nearly similar population estimates. Simulations based on incompletely estimated correlations among matrix entries considerably differed from those based on all correlations estimated, particularly when correlations were strong. The magnitude of correlations among matrix entries depended on the number of matrices, which made it difficult to generalize correlations within a species. Given that selection of vital rates or matrix elements is used, correlations among matrix entries should usually be included in the model, and they should preferably be estimated from the present data rather than according to other information of the species.  相似文献   

A stochastic Markov chain model for metastatic progression is developed for primary lung cancer based on a network construction of metastatic sites with dynamics modeled as an ensemble of random walkers on the network. We calculate a transition matrix, with entries (transition probabilities) interpreted as random variables, and use it to construct a circular bi-directional network of primary and metastatic locations based on postmortem tissue analysis of 3827 autopsies on untreated patients documenting all primary tumor locations and metastatic sites from this population. The resulting 50 potential metastatic sites are connected by directed edges with distributed weightings, where the site connections and weightings are obtained by calculating the entries of an ensemble of transition matrices so that the steady-state distribution obtained from the long-time limit of the Markov chain dynamical system corresponds to the ensemble metastatic distribution obtained from the autopsy data set. We condition our search for a transition matrix on an initial distribution of metastatic tumors obtained from the data set. Through an iterative numerical search procedure, we adjust the entries of a sequence of approximations until a transition matrix with the correct steady-state is found (up to a numerical threshold). Since this constrained linear optimization problem is underdetermined, we characterize the statistical variance of the ensemble of transition matrices calculated using the means and variances of their singular value distributions as a diagnostic tool. We interpret the ensemble averaged transition probabilities as (approximately) normally distributed random variables. The model allows us to simulate and quantify disease progression pathways and timescales of progression from the lung position to other sites and we highlight several key findings based on the model.  相似文献   

By assuming the brain as a multi-stable system, different scenarios have been introduced for transition from normal to epileptic state. But, the path through which this transition occurs is under debate. In this paper a stochastic model for seizure genesis is presented that is consistent with all scenarios: a two-level spontaneous seizure generation model is proposed in which, in its first level the behavior of physiological parameters is modeled with a stochastic process. The focus is on some physiological parameters that are essential in simulating different activities of ElectroEncephaloGram (EEG), i.e., excitatory and inhibitory synaptic gains of neuronal populations. There are many depth-EEG models in which excitatory and inhibitory synaptic gains are the adjustable parameters. Using one of these models at the second level, our proposed seizure generator is complete. The suggested stochastic model of first level is a hidden Markov process whose transition matrices are obtained through analyzing the real parameter sequences of a seizure onset area. These real parameter sequences are estimated from real depth-EEG signals via applying a parameter identification algorithm. In this paper both short-term and long-term validations of the proposed model are done. The long-term synthetic depth-EEG signals simulated by this model can be taken as a suitable tool for comparing different seizure prediction algorithms.  相似文献   

Many biological quantities cannot be measured directly but rather need to be estimated from models. Estimates from models are statistical objects with variance and, when derived simultaneously, covariance. It is well known that their variance–covariance (VC) matrix must be considered in subsequent analyses. Although it is always preferable to carry out the proposed analyses on the raw data themselves, a two‐step approach cannot always be avoided. This situation arises when the parameters of a multinomial must be regressed against a covariate. The Delta method is an appropriate and frequently recommended way of deriving variance approximations of transformed and correlated variables. Implementing the Delta method is not trivial, and there is a lack of a detailed information on the procedure in the literature for complex situations such as those involved in constraining the parameters of a multinomial distribution. This paper proposes a how‐to guide for calculating the correct VC matrices of dependant estimates involved in multinomial distributions and how to use them for testing the effects of covariates in post hoc analyses when the integration of these analyses directly into a model is not possible. For illustrative purpose, we focus on variables calculated in capture–recapture models, but the same procedure can be applied to all analyses dealing with correlated estimates with multinomial distribution and their variances and covariances.  相似文献   

ABSTRACT: BACKGROUND: A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages). RESULTS: We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm produces DNA alignments of desired length. GenNon-h is publicly available for download. CONCLUSION: The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree. GenNon-h provides the user with the nonstationary or nonhomogeneous phylogenetic data that is well suited for testing complex biological hypotheses, exploring the limits of the reconstruction algorithms and their robustness to such models.  相似文献   

The ensemble modeling (EM) approach has shown promise in capturing kinetic and regulatory effects in the modeling of metabolic networks. Efficacy of the EM procedure relies on the identification of model parameterizations that adequately describe all observed metabolic phenotypes upon perturbation. In this study, we propose an optimization-based algorithm for the systematic identification of genetic/enzyme perturbations to maximally reduce the number of models retained in the ensemble after each round of model screening. The key premise here is to design perturbations that will maximally scatter the predicted steady-state fluxes over the ensemble parameterizations. We demonstrate the applicability of this procedure for an Escherichia coli metabolic model of central metabolism by successively identifying single, double, and triple enzyme perturbations that cause the maximum degree of flux separation between models in the ensemble. Results revealed that optimal perturbations are not always located close to reaction(s) whose fluxes are measured, especially when multiple perturbations are considered. In addition, there appears to be a maximum number of simultaneous perturbations beyond which no appreciable increase in the divergence of flux predictions is achieved. Overall, this study provides a systematic way of optimally designing genetic perturbations for populating the ensemble of models with relevant model parameterizations.  相似文献   

I explore the use of multiple regression on distance matrices (MRM), an extension of partial Mantel analysis, in spatial analysis of ecological data. MRM involves a multiple regression of a response matrix on any number of explanatory matrices, where each matrix contains distances or similarities (in terms of ecological, spatial, or other attributes) between all pair-wise combinations of n objects (sample units); tests of statistical significance are performed by permutation. The method is flexible in terms of the types of data that may be analyzed (counts, presence–absence, continuous, categorical) and the shapes of response curves. MRM offers several advantages over traditional partial Mantel analysis: (1) separating environmental distances into distinct distance matrices allows inferences to be made at the level of individual variables; (2) nonparametric or nonlinear multiple regression methods may be employed; and (3) spatial autocorrelation may be quantified and tested at different spatial scales using a series of lag matrices, each representing a geographic distance class. The MRM lag matrices model may be parameterized to yield very similar inferences regarding spatial autocorrelation as the Mantel correlogram. Unlike the correlogram, however, the lag matrices model may also include environmental distance matrices, so that spatial patterns in species abundance distances (community similarity) may be quantified while controlling for the environmental similarity between sites. Examples of spatial analyses with MRM are presented.  相似文献   

Semi-Markov and modulated renewal processes provide a large class of multi-state models which can be used for analysis of longitudinal failure time data. In biomedical applications, models of this kind are often used to describe evolution of a disease and assume that patient may move among a finite number of states representing different phases in the disease progression. Several authors proposed extensions of the proportional hazard model for regression analysis of these processes. In this paper, we consider a general class of censored semi-Markov and modulated renewal processes and propose use of transformation models for their analysis. Special cases include modulated renewal processes with interarrival times specified using transformation models, and semi-Markov processes with with one-step transition probabilities defined using copula-transformation models. We discuss estimation of finite and infinite dimensional parameters and develop an extension of the Gaussian multiplier method for setting confidence bands for transition probabilities and related parameters. A transplant outcome data set from the Center for International Blood and Marrow Transplant Research is used for illustrative purposes.  相似文献   

In the analysis of the feeding habits of the 11 most abundant fish species in the Guadalquivir Estuary, collected monthly (February 1998 to January 1999) at two different sampling sites, a total of 46 prey taxa were identified. Classifications (based on Bray–Curtis similarities derived from occurrence, number and mass data) of the different fish categories (postlarvae and juvenile–adults of each species) revealed two main trophic guilds, whose preferential prey (SIMPER analysis) were mysids and copepods, respectively. The similarity matrices derived from occurrence, number and mass data were always significantly correlated (RELATE: r >0.636; P <0.01), indicating that a good agreement in feeding patterns emerged from these variables. The seasonal coincidence of maximal fish and key-prey species densities suggests that food availability may be a principal factor influencing the nursery function of the Guadalquivir Estuary.  相似文献   

A class of generalized linear mixed models can be obtained by introducing random effects in the linear predictor of a generalized linear model, e.g. a split plot model for binary data or count data. Maximum likelihood estimation, for normally distributed random effects, involves high-dimensional numerical integration, with severe limitations on the number and structure of the additional random effects. An alternative estimation procedure based on an extension of the iterative re-weighted least squares procedure for generalized linear models will be illustrated on a practical data set involving carcass classification of cattle. The data is analysed as overdispersed binomial proportions with fixed and random effects and associated components of variance on the logit scale. Estimates are obtained with standard software for normal data mixed models. Numerical restrictions pertain to the size of matrices to be inverted. This can be dealt with by absorption techniques familiar from e.g. mixed models in animal breeding. The final model fitted to the classification data includes four components of variance and a multiplicative overdispersion factor. Basically the estimation procedure is a combination of iterated least squares procedures and no full distributional assumptions are needed. A simulation study based on the classification data is presented. This includes a study of procedures for constructing confidence intervals and significance tests for fixed effects and components of variance. The simulation results increase confidence in the usefulness of the estimation procedure.  相似文献   

Null models exploring species co-occurrence and trait-based limiting similarity are increasingly used to explore the influence of competition on community assembly; however, assessments of common models have not thoroughly explored the influence of variation in matrix size on error rates, in spite of the fact that studies have explored community matrices that vary considerably in size. To determine how smaller matrices, which are of greatest concern, perform statistically, we generated biologically realistic presence-absence matrices ranging in size from 3–50 species and sites, as well as associated trait matrices. We examined co-occurrence tests using the C-Score statistic and independent swap algorithm. For trait-based limiting similarity null models, we used the mean nearest neighbour trait distance (NN) and the standard deviation of nearest neighbour distances (SDNN) as test statistics, and considered two common randomization algorithms: abundance independent trait shuffling (AITS), and abundance weighted trait shuffling (AWTS). Matrices as small as three × three resulted in acceptable type I error rates (p < 0.05) for both the co-occurrence and trait-based limiting similarity null models when exclusive p-values were used. The commonly used inclusive p-value (≤ or ≥, as opposed to exclusive p-values; < or >) was associated with increased type I error rates, particularly for matrices with fewer than eight species. Type I error rates increased for limiting similarity tests using the AWTS randomization scheme when community matrices contained more than 35 sites; a similar randomization used in null models of phylogenetic dispersion has previously been viewed as robust. Notwithstanding other potential deficiencies related to the use of small matrices to represent communities, the application of both classes of null model should be restricted to matrices with 10 or more species to avoid the possibility of type II errors. Additionally, researchers should restrict the use of the AWTS randomization to matrices with fewer than 35 sites to avoid type I errors when testing for trait-based limiting similarity. The AITS randomization scheme performed better in terms of type I error rates, and therefore may be more appropriate when considering systems for which traits are not clustered by abundance.  相似文献   

Cluster analysis has proven to be a valuable statistical method for analyzing whole genome expression data. Although clustering methods have great utility, they do represent a lower level statistical analysis that is not directly tied to a specific model. To extend such methods and to allow for more sophisticated lines of inference, we use cluster analysis in conjunction with a specific model of gene expression dynamics. This model provides phenomenological dynamic parameters on both linear and non-linear responses of the system. This analysis determines the parameters of two different transition matrices (linear and nonlinear) that describe the influence of one gene expression level on another. Using yeast cell cycle microarray data as test set, we calculated the transition matrices and used these dynamic parameters as a metric for cluster analysis. Hierarchical cluster analysis of this transition matrix reveals how a set of genes influence the expression of other genes activated during different cell cycle phases. Most strikingly, genes in different stages of cell cycle preferentially activate or inactivate genes in other stages of cell cycle, and this relationship can be readily visualized in a two-way clustering image. The observation is prior to any knowledge of the chronological characteristics of the cell cycle process. This method shows the utility of using model parameters as a metric in cluster analysis.  相似文献   

Summary The vast majority of population models work using age or stage not length but there are many cases where animals cannot be aged sensibly or accurately. For these cases length‐based models form the logical alternative but there has been little work done to develop and compare different methods of estimating growth transition matrices to be used in such models. This article demonstrates how a consistent Bayesian framework for estimating growth parameters and a novel method for constructing length transition matrices accounts for variation in growth in a clear and consistent manner and avoids potential subjective choices required using more established methods. The inclusion of the resultant growth uncertainty in population assessment models and the potential impact on management decisions is also addressed.  相似文献   

A typical task in the application of aggregated Markov models to ion channel data is the estimation of the transition rates between the states. Realistic models for ion channel data often have one or more loops. We show that the transition rates of a model with loops are not identifiable if the model has either equal open or closed dwell times. This non-identifiability of the transition rates also has an effect on the estimation of the transition rates for models which are not subject to the constraint of either equal open or closed dwell times. If a model with loops has nearly equal dwell times, the Hessian matrix of its likelihood function will be ill-conditioned and the standard deviations of the estimated transition rates become extraordinarily large for a number of data points which are typically recorded in experiments.  相似文献   

Modeling vital rates improves estimation of population projection matrices   总被引:1,自引:1,他引:0  
Population projection matrices are commonly used by ecologists and managers to analyze the dynamics of stage-structured populations. Building projection matrices from data requires estimating transition rates among stages, a task that often entails estimating many parameters with few data. Consequently, large sampling variability in the estimated transition rates increases the uncertainty in the estimated matrix and quantities derived from it, such as the population multiplication rate and sensitivities of matrix elements. Here, we propose a strategy to avoid overparameterized matrix models. This strategy involves fitting models to the vital rates that determine matrix elements, evaluating both these models and ones that estimate matrix elements individually with model selection via information criteria, and averaging competing models with multimodel averaging. We illustrate this idea with data from a population of Silene acaulis (Caryophyllaceae), and conduct a simulation to investigate the statistical properties of the matrices estimated in this way. The simulation shows that compared with estimating matrix elements individually, building population projection matrices by fitting and averaging models of vital-rate estimates can reduce the statistical error in the population projection matrix and quantities derived from it.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号