首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1-the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses.  相似文献   

2.
Tan YH  Huang H  Kihara D 《Proteins》2006,64(3):587-600
Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.  相似文献   

3.
We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.  相似文献   

4.
Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.  相似文献   

5.
Bioinformatic software has used various numerical encoding schemes to describe amino acid sequences. Orthogonal encoding, employing 20 numbers to describe the amino acid type of one protein residue, is often used with artificial neural network (ANN) models. However, this can increase the model complexity, thus leading to difficulty in implementation and poor performance. Here, we use ANNs to derive encoding schemes for the amino acid types from protein three-dimensional structure alignments. Each of the 20 amino acid types is characterized with a few real numbers. Our schemes are tested on the simulation of amino acid substitution matrices. These simplified schemes outperform the orthogonal encoding on small data sets. Using one of these encoding schemes, we generate a colouring scheme for the amino acids in which comparable amino acids are in similar colours. We expect it to be useful for visual inspection and manual editing of protein multiple sequence alignments.  相似文献   

6.
One of the biggest problems in modeling distantly related proteins is the quality of the target-template alignment. This problem often results in low quality models that do not utilize all the information available in the template structure. The divergence of alignments at a low sequence identity level, which is a hindrance in most modeling attempts, is used here as a basis for a new technique of Multiple Model Approach (MMA). Alternative alignments prepared here using different mutation matrices and gap penalties, combined with automated model building, are used to create a set of models that explore a range of possible conformations for the target protein. Models are evaluated using different techniques to identify the best model. In the set of examples studied here, the correct target structure is known, which allows the evaluation of various alignment and evaluation strategies. For a randomly selected group of distantly homologous protein pairs representing all structural classes and various fold types, it is shown that a threading score based on simplified statistical potentials of mean force can identify the best models and, consequently, the most reliable alignment. In cases where the difference between target and template structures is significant, the threading score shows clearly that all models are wrong, therefore disqualifying the template.  相似文献   

7.
Modelling invasion for a habitat generalist and a specialist plant species   总被引:2,自引:0,他引:2  
Predicting suitable habitat and the potential distribution of invasive species is a high priority for resource managers and systems ecologists. Most models are designed to identify habitat characteristics that define the ecological niche of a species with little consideration to individual species' traits. We tested five commonly used modelling methods on two invasive plant species, the habitat generalist Bromus tectorum and habitat specialist Tamarix chinensis , to compare model performances, evaluate predictability, and relate results to distribution traits associated with each species. Most of the tested models performed similarly for each species; however, the generalist species proved to be more difficult to predict than the specialist species. The highest area under the receiver-operating characteristic curve values with independent validation data sets of B. tectorum and T. chinensis was 0.503 and 0.885, respectively. Similarly, a confusion matrix for B. tectorum had the highest overall accuracy of 55%, while the overall accuracy for T. chinensis was 85%. Models for the generalist species had varying performances, poor evaluations, and inconsistent results. This may be a result of a generalist's capability to persist in a wide range of environmental conditions that are not easily defined by the data, independent variables or model design. Models for the specialist species had consistently strong performances, high evaluations, and similar results among different model applications. This is likely a consequence of the specialist's requirement for explicit environmental resources and ecological barriers that are easily defined by predictive models. Although defining new invaders as generalist or specialist species can be challenging, model performances and evaluations may provide valuable information on a species' potential invasiveness.  相似文献   

8.
Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.  相似文献   

9.
Tiffin P  Hacker R  Gaut BS 《Genetics》2004,168(1):425-434
Two patterns of plant defense gene evolution are emerging from molecular population genetic surveys. One is that specialist defenses experience stronger selection than generalist defenses. The second is that specialist defenses are more likely to be subject to balancing selection, i.e., evolve in a manner consistent with balanced-polymorphism or trench-warfare models of host-parasite coevolution. Because most of the data of specialist defenses come from Arabidopsis thaliana, we examined the genetic diversity and evolutionary history of three defense genes in two outcrossing species, the autotetraploid Zea perennis and its most closely related extant relative the diploid Z. diploperennis. Intraspecific diversity at two generalist defenses, the protease inhibitors wip1 and mpi, were consistent with a neutral model. Like previously studied genes in these taxa, wip1 and mpi harbored similar levels of diversity in Z. diploperennis and Z. perennis. In contrast, the specialist defense hm2 showed strong although distinctly different departures from a neutral model in the two species. Z. diploperennis appears to have experienced a strong and recent selective sweep. Using a rejection-sampling coalescent method, we estimate the strength of selection on Z. diploperennis hm2 to be approximately 3.0%, which is approximately equal to the strength of selection on tb1 during maize domestication. Z. perennis hm2 harbors three highly diverged alleles, two of which are found at high frequency. The distinctly different patterns of diversity may be due to differences in the phase of host-parasite coevolutionary cycles, although higher hm2 diversity in Z. perennis may also reflect reduced efficacy of selection in the autotetraploid relative to its diploid relative.  相似文献   

10.
Sequence divergence among orthologous proteins was characterized with 34 amino acid replacement matrices, sequence context analysis, and a phylogenetic tree. The model was trained on very large datasets of aligned protein sequences drawn from 15 organisms including protists, plants, Dictyostelium, fungi, and animals. Comparative tests with models currently used in phylogeny, i.e., with JTT+Γ±F and WAG+Γ±F, made on a test dataset of 380 multiple alignments containing protein sequences from all five of the major taxonomic groups mentioned, indicate that our model should be preferred over the JTT+Γ±F and WAG+Γ±F models on datasets similar to the test dataset. The strong performance of our model of orthologous protein sequence divergence can be attributed to its ability to better approximate amino acid equilibrium frequencies to compositions found in alignment columns. Electronic Supplementary Material Electronic Supplementary material is available for this article at and accessible for authorised users. [Reviewing Editor : Dr. Martin Kreitman]  相似文献   

11.
ABSTRACT: BACKGROUND: A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages). RESULTS: We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm produces DNA alignments of desired length. GenNon-h is publicly available for download. CONCLUSION: The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree. GenNon-h provides the user with the nonstationary or nonhomogeneous phylogenetic data that is well suited for testing complex biological hypotheses, exploring the limits of the reconstruction algorithms and their robustness to such models.  相似文献   

12.
MALDI mass spectrometry can simultaneously measure hundreds of biomolecules directly from tissue. Using essentially the same technique but different sample preparation strategies, metabolites, lipids, peptides and proteins can be analyzed. Spatially correlated analysis, imaging MS, enables the distributions of these biomolecular ions to be simultaneously measured in tissues. A key advantage of imaging MS is that it can annotate tissues based on their MS profiles and thereby distinguish biomolecularly distinct regions even if they were unexpected or are not distinct using established histological and histochemical methods e.g. neuropeptide and metabolite changes following transient electrophysiological events such as cortical spreading depression (CSD), which are spreading events of massive neuronal and glial depolarisations that occur in one hemisphere of the brain and do not pass to the other hemisphere , enabling the contralateral hemisphere to act as an internal control. A proof-of-principle imaging MS study, including 2D and 3D datasets, revealed substantial metabolite and neuropeptide changes immediately following CSD events which were absent in the protein imaging datasets. The large high dimensionality 3D datasets make even rudimentary contralateral comparisons difficult to visualize. Instead non-negative matrix factorization (NNMF), a multivariate factorization tool that is adept at highlighting latent features, such as MS signatures associated with CSD events, was applied to the 3D datasets. NNMF confirmed that the protein dataset did not contain substantial contralateral differences, while these were present in the neuropeptide dataset.  相似文献   

13.
Intraguild predation (IGP) is a combination of competition and predation which is the most basic system in food webs that contains three species where two species that are involved in a predator/prey relationship are also competing for a shared resource or prey. We formulate two intraguild predation (IGP: resource, IG prey and IG predator) models: one has generalist predator while the other one has specialist predator. Both models have Holling-Type I functional response between resource-IG prey and resource-IG predator; Holling-Type III functional response between IG prey and IG predator. We provide sufficient conditions of the persistence and extinction of all possible scenarios for these two models, which give us a complete picture on their global dynamics. In addition, we show that both IGP models can have multiple interior equilibria under certain parameters range. These analytical results indicate that IGP model with generalist predator has “top down” regulation by comparing to IGP model with specialist predator. Our analysis and numerical simulations suggest that: (1) Both IGP models can have multiple attractors with complicated dynamical patterns; (2) Only IGP model with specialist predator can have both boundary attractor and interior attractor, i.e., whether the system has the extinction of one species or the coexistence of three species depending on initial conditions; (3) IGP model with generalist predator is prone to have coexistence of three species.  相似文献   

14.
Protein homology detection using string alignment kernels   总被引:2,自引:0,他引:2  
MOTIVATION: Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVMs) are currently the most effective methods for the problem of superfamily recognition in the Structural Classification Of Proteins (SCOP) database. The performance of SVMs depends critically on the kernel function used to quantify the similarity between sequences. RESULTS: We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the-art methods for remote homology detection. AVAILABILITY: Software and data available upon request.  相似文献   

15.
Substitution matrices have been useful for sequence alignment and protein sequence comparisons. The BLOSUM series of matrices, which had been derived from a database of alignments of protein blocks, improved the accuracy of alignments previously obtained from the PAM-type matrices estimated from only closely related sequences. Although BLOSUM matrices are scoring matrices now widely used for protein sequence alignments, they do not describe an evolutionary model. BLOSUM matrices do not permit the estimation of the actual number of amino acid substitutions between sequences by correcting for multiple hits. The method presented here uses the Blocks database of protein alignments, along with the additivity of evolutionary distances, to approximate the amino acid substitution probabilities as a function of actual evolutionary distance. The PMB (Probability Matrix from Blocks) defines a new evolutionary model for protein evolution that can be used for evolutionary analyses of protein sequences. Our model is directly derived from, and thus compatible with, the BLOSUM matrices. The model has the additional advantage of being easily implemented.  相似文献   

16.
17.
18.
MOTIVATION: In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the 'twilight zone' of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure. RESULTS: HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.  相似文献   

19.
Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta.  相似文献   

20.
Unhealthy alcohol use is one of the leading causes of morbidity and mortality in the United States. Brief interventions with high‐risk drinkers during an emergency department (ED) visit are of great interest due to their possible efficacy and low cost. In a collaborative study with patients recruited at 14 academic ED across the United States, we examined the self‐reported number of drinks per week by each patient following the exposure to a brief intervention. Count data with overdispersion have been mostly analyzed with generalized linear mixed models (GLMMs), of which only a limited number of link functions are available. Different choices of link function provide different fit and predictive power for a particular dataset. We propose a class of link functions from an alternative way to incorporate random effects in a GLMM, which encompasses many existing link functions as special cases. The methodology is naturally implemented in a Bayesian framework, with competing links selected with Bayesian model selection criteria such as the conditional predictive ordinate (CPO). In application to the ED intervention study, all models suggest that the intervention was effective in reducing the number of drinks, but some new models are found to significantly outperform the traditional model as measured by CPO. The validity of CPO in link selection is confirmed in a simulation study that shared the same characteristics as the count data from high‐risk drinkers. The dataset and the source code for the best fitting model are available in Supporting Information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号