首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.  相似文献   

3.
Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.  相似文献   

4.
5.
6.
7.
A probabilistic generative model for GO enrichment analysis   总被引:1,自引:0,他引:1  
The Gene Ontology (GO) is extensively used to analyze all types of high-throughput experiments. However, researchers still face several challenges when using GO and other functional annotation databases. One problem is the large number of multiple hypotheses that are being tested for each study. In addition, categories often overlap with both direct parents/descendents and other distant categories in the hierarchical structure. This makes it hard to determine if the identified significant categories represent different functional outcomes or rather a redundant view of the same biological processes. To overcome these problems we developed a generative probabilistic model which identifies a (small) subset of categories that, together, explain the selected gene set. Our model accommodates noise and errors in the selected gene set and GO. Using controlled GO data our method correctly recovered most of the selected categories, leading to dramatic improvements over current methods for GO analysis. When used with microarray expression data and ChIP-chip data from yeast and human our method was able to correctly identify both general and specific enriched categories which were overlooked by other methods.  相似文献   

8.
We propose and study a new approach for the analysis of families of protein sequences. This method is related to the LogDet distances used in phylogenetic reconstructions; it can be viewed as an attempt to embed these distances into a multidimensional framework. The proposed method starts by associating a Markov matrix to each pairwise alignment deduced from a given multiple alignment. The central objects under consideration here are matrix-valued logarithms L of these Markov matrices, which exist under conditions that are compatible with fairly large divergence between the sequences. These logarithms allow us to compare data from a family of aligned proteins with simple models (in particular, continuous reversible Markov models) and to test the adequacy of such models. If one neglects fluctuations arising from the finite length of sequences, any continuous reversible Markov model with a single rate matrix Q over an arbitrary tree predicts that all the observed matrices L are multiples of Q. Our method exploits this fact, without relying on any tree estimation. We test this prediction on a family of proteins encoded by the mitochondrial genome of 26 multicellular animals, which include vertebrates, arthropods, echinoderms, molluscs, and nematodes. A principal component analysis of the observed matrices L shows that a single rate model can be used as a rough approximation to the data, but that systematic deviations from any such model are unmistakable and related to the evolutionary history of the species under consideration.  相似文献   

9.
10.
11.
12.
13.
MicroRNAs are endogenous small RNA molecules that regulate gene expression. Although the biogenesis of microRNAs and their regulation have been thoroughly elucidated, the degradation of microRNAs has not been fully understood. Here by using the pulse-chase approach, we performed the direct measurement of microRNA lifespan. Five representative microRNAs demonstrated a general feature of relatively long lifespan. However, the decay dynamic varies considerably between these individual microRNAs. Mutation analysis of miR-29b sequence revealed that uracils at nucleotide position 9-11 are required for its rapid decay, in that both specific nucleotides and their position are critical. The effect of uracil-rich element on miR-29b decay dynamic occurs in duplex but not in single strand RNA. Moreover, analysis of published data on microRNA expression profile during development reveals that a substantial subset of microRNAs with the uracil-rich sequence tends to be down-regulated compared to those without the sequence. Among them, Northern blotting shows that miR-29c and fruit fly bantam possess a relatively rapid turnover rate. The effect of uracil-rich sequence on microRNA turnover depends on the sequence context. The present work indicates that microRNAs contain sequence information in the middle region besides the sequence element at both ends.  相似文献   

14.
A model is presented for the evolution and control of generative apomixis—a collective term for apomixis in animals and diplosporous apomixis in flowering plants. Its development takes into account data obtained from studies of apomictic-like processes in sexual organisms and in non-apomictic parthenogens, as well as data obtained from studies of generative apomicts. This approach provides insights into the evolution and control of generative apomixis that cannot be obtained from studies of generative apomicts alone. It is argued that the control of the avoidance of meiotic reduction during egg production in generative apomicts resides at a single locus, the identity of which can vary between lineages. This variation accounts for the observed variation between taxa in the pattern of avoidance of meiotic reduction. The affected locus contains a wild-type allele that codes for meiotic reduction and excess copies of a mutant allele that codes for its avoidance. The dominance relationship between these is determined by their ratio and by the environment. Environmental differences between female generative cells and somatic cells are such that the phenotypic expression of the mutant allele is favoured in the former, while that of the wild-type allele is favoured in the latter. This is important, for the locus is also involved in the control of mitosis which would be disrupted by the expression of the mutant allele in somatic cells. The requirement to maintain a viable pattern of growth and development explains why the wild-type allele is retained by generative apomicts, and this in turn explains why the ability to produce meiotically reduced eggs is retained by facultative forms and why it appears to be suppressed in, rather than absent from, obligate forms. The requirement for excess copies of the mutant allele in generative cells explains why generative apomicts are typically polyploid, as this condition provides a simple and effective means of generating the correct balance of mutant and wild-type alleles. Environmental effects can also lead to the dominance relationship between wild-type and mutant alleles varying between generative cells. In plants, this can lead to the apomixis gene being expressed, and thus to meiotic reduction being avoided, in only some ovules. Meiotically reduced, as well as meiotically unreduced, eggs are produced when this occurs. If compatible and viable pollen is available the meiotically reduced eggs may be fertilized, resulting in these organisms reproducing as facultative apomicts. It is argued that the control and evolution of parthenogenesis in generative apomicts varies between taxa. In some, the parthenogenetic initiation of embryos may result from the acquisition of a parthenogenesis gene or genes; but there is no reason to believe that this is either a general or a common requirement. Indeed, in some it may be an ancestral trait, these apomicts differing from their sexual ancestors in the ability to mature, rather than in the ability to initiate, embryos from unfertilized eggs; or it may result from physiological or developmental changes induced, for example, by polyploidization, hybridization, or the avoidance of meiotic reduction. In some plants it may be induced by pollination (without fertilization) or by the activity of a developing endosperm. Although it is argued that most generatively apomictic lineages may have acquired this form of reproduction relatively easily, by the acquisition of a mutation at a single locus, it is argued that newly initiated lineages may often be reproductively inefficient. These will begin to accumulate mutations that improve the efficiency of apomictic reproduction. Thus several loci may be involved in the control of generative apomixis in established lineages, even though only a single locus was involved in its initiation in these lineages. Care must be taken to distinguish between these initiator and modifier genes when considering the evolution of generative apomixis. Finally, it is argued that although generatively apomictic lineages have easily acquired this form of reproduction, its evolution in some taxa may be so difficult, requiring the acquisition of mutations simultaneously at two or more loci, that these may never acquire it. Thus, evidence obtained from taxa that have successfully made the transition from sexual reproduction to generative apomixis that its evolution was straightforward should not be used as evidence that its evolution will always be relatively easily achieved. Its uneven taxonomic distribution indicates that it is much more easily evolved by some taxonomic groups than by others.  相似文献   

15.
16.
MOTIVATION: Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are currently being released. Within the next 2 months, our group will release a microarray data set containing over 12,000,000 microarray measurements taken from 37 mouse tissues. A problem that will become increasingly significant in the upcoming era of genome-wide exon-tiling microarray experiments is the removal of cross-hybridization noise. We present a probabilistic generative model for cross-hybridization in microarray data and a corresponding variational learning method for cross-hybridization compensation, GenXHC, that reduces cross-hybridization noise by taking into account multiple sources for each mRNA expression level measurement, as well as prior knowledge of hybridization similarities between the nucleotide sequences of microarray probes and their target cDNAs. RESULTS: The algorithm is applied to a subset of an exon-resolution genome-wide Agilent microarray data set for chromosome 16 of Mus musculus and is found to produce statistically significant reductions in cross-hybridization noise. The denoised data is found to produce enrichment in multiple gene ontology-biological process (GO-BP) functional groups. The algorithm is found to outperform robust multi-array analysis, another method for cross-hybridization compensation.  相似文献   

17.
MOTIVATION: One of the more challenging problems in biology is to determine the cellular protein interaction network. Progress has been made to predict protein-protein interactions based on structural information, assuming that structural similar proteins interact in a similar way. In a previous publication, we have determined a genome-wide Ras-effector interaction network based on homology models, with a high accuracy of predicting binding and non-binding domains. However, for a prediction on a genome-wide scale, homology modelling is a time-consuming process. Therefore, we here successfully developed a faster method using position energy matrices, where based on different Ras-effector X-ray template structures, all amino acids in the effector binding domain are sequentially mutated to all other amino acid residues and the effect on binding energy is calculated. Those pre-calculated matrices can then be used to score for binding any Ras or effector sequences. RESULTS: Based on position energy matrices, the sequences of putative Ras-binding domains can be scanned quickly to calculate an energy sum value. By calibrating energy sum values using quantitative experimental binding data, thresholds can be defined and thus non-binding domains can be excluded quickly. Sequences which have energy sum values above this threshold are considered to be potential binding domains, and could be further analysed using homology modelling. This prediction method could be applied to other protein families sharing conserved interaction types, in order to determine in a fast way large scale cellular protein interaction networks. Thus, it could have an important impact on future in silico structural genomics approaches, in particular with regard to increasing structural proteomics efforts, aiming to determine all possible domain folds and interaction types. AVAILABILITY: All matrices are deposited in the ADAN database (http://adan-embl.ibmc.umh.es/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

18.
19.
We propose a model for generating "artificial" nucleotide sequences and, by the method of mapping those sequences onto a "DNA-walk," we analyze the presence of correlation between nucleotides. Artificial sequences are constructed considering, basically, interactions between first neighbors and between more distant units. We show that long-range correlations may be favored by the occurrence of intrastrand interactions, which give a nonlinear characteristic to the sequence.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号