首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. RESULTS: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.  相似文献   

2.
Probabilistic automata are compared with deterministic ones in simulations of growing networks made of dividing interconnected cells. On examples of chains, wheels and tree-like structures made of large numbers of cells it is shown that the number of necessary states in the initial generating cell automaton is reduced drastically when the automaton is probabilistic rather than deterministic. Since the price being paid is a decrease in the accuracy of the generated network, conditions under which reasonable compromises can be achieved are studied. They depend on the degree of redundancy of the final network (defined from the complexity of a deterministic automaton capable of generating it with maximum accuracy), on the "entropy" of the generating probabilistic automaton, and on the effects of different inputs on its transition probabilities (as measured by its "'capacity" in the sense of Shannon's information theory). The results are used to discuss and make more precise the notion of biological specificity. It is suggested that the weak metaphor of a genetic program, classically used to account for the role of DNA in specific genetic determinations, is replaced by that of inputs to biochemical probabilistic automata.  相似文献   

3.
Modelled as finite homogeneous Markov chains, probabilistic cellular automata with local transition probabilities in (0, 1) always posses a stationary distribution. This result alone is not very helpful when it comes to predicting the final configuration; one needs also a formula connecting the probabilities in the stationary distribution to some intrinsic feature of the lattice configuration. Previous results on the asynchronous cellular automata have showed that such feature really exists. It is the number of zero-one borders within the automaton''s binary configuration. An exponential formula in the number of zero-one borders has been proved for the 1-D, 2-D and 3-D asynchronous automata with neighborhood three, five and seven, respectively. We perform computer experiments on a synchronous cellular automaton to check whether the empirical distribution obeys also that theoretical formula. The numerical results indicate a perfect fit for neighbourhood three and five, which opens the way for a rigorous proof of the formula in this new, synchronous case.  相似文献   

4.
5.
Chen PC 《Bio Systems》2005,81(2):155-163
This article presents an approach for synthesizing target strings in a class of computational models of DNA recombination. The computational models are formalized as splicing systems in the context of formal languages. Given a splicing system (of a restricted type) and a target string to be synthesized, we construct (i) a rule-embedded splicing automaton that recognizes languages containing strings embedded with symbols representing splicing rules, and (ii) an automaton that implicitly recognizes the target string. By manipulating these two automata, we extract all rule sequences that lead to the production of the target string (if that string belongs to the splicing language). An algorithm for synthesizing a certain type of target strings based on such rule sequences is presented.  相似文献   

6.
7.
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the "local" version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified "semi-probabilistic" alignment consisting of a hybrid of Smith-Waterman and probabilistic alignment is then proposed and studied in detail. It is predicted that the score statistics of the hybrid algorithm is of the Gumbel universal form, with the key Gumbel parameter lambda taking on a fixed asymptotic value for a wide variety of scoring systems and parameters. A simple recipe for the computation of the "relative entropy," and from it the finite size correction to lambda, is also given. These predictions compare well with direct numerical simulations for sequences of lengths between 100 and 1,000 examined using various PAM substitution scores and affine gap functions. The sensitivity of the hybrid method in the detection of sequence homology is also studied using correlated sequences generated from toy mutation models. It is found to be comparable to that of the Smith-Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment.  相似文献   

8.
MOTIVATION: Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable. RESULTS: We apply a novel method to the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of similar statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt an MSA. Several representative cases are analyzed. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences. CONTACT: jill@cs.huji.ac.il; tishby@cs.huji.ac.il.  相似文献   

9.
The singing behavior of songbirds has been investigated as a model of sequence learning and production. The song of the Bengalese finch, Lonchura striata var. domestica, is well described by a finite state automaton including a stochastic transition of the note sequence, which can be regarded as a higher-order Markov process. Focusing on the neural structure of songbirds, we propose a neural network model that generates higher-order Markov processes. The neurons in the robust nucleus of the archistriatum (RA) encode each note; they are activated by RA-projecting neurons in the HVC (used as a proper name). We hypothesize that the same note included in different chunks is encoded by distinct RA-projecting neuron groups. From this assumption, the output sequence of RA is a higher-order Markov process, even though the RA-projecting neurons in the HVC fire on first-order Markov processes. We developed a neural network model of the local circuits in the HVC that explains the mechanism by which RA-projecting neurons transit stochastically on first-order Markov processes. Numerical simulation showed that this model can generate first-order Markov process song sequences.  相似文献   

10.
Modeling splice sites with Bayes networks   总被引:6,自引:0,他引:6  
  相似文献   

11.
Length and sequence heterogeneity in 5S rDNA of Populus deltoides.   总被引:1,自引:0,他引:1  
The 5S rRNA genes and their associated non-transcribed spacer (NTS) regions are present as repeat units arranged in tandem arrays in plant genomes. Length heterogeneity in 5S rDNA repeats was previously identified in Populus deltoides and was also observed in the present study. Primers were designed to amplify the 5S rDNA NTS variants from the P. deltoides genome. The PCR-amplified products from the two accessions of P. deltoides (G3 and G48) suggested the presence of length heterogeneity of 5S rDNA units within and among accessions, and the size of the spacers ranged from 385 to 434 bp. Sequence analysis of the non-transcribed spacer (NTS) revealed two distinct classes of 5S rDNA within both accessions: class 1, which contained GAA trinucleotide microsatellite repeats, and class 2, which lacked the repeats. The class 1 spacer shows length variation owing to the microsatellite, with two clones exhibiting 10 GAA repeat units and one clone exhibiting 16 such repeat units. However, distance analysis shows that class 1 spacer sequences are highly similar inter se, yielding nucleotide diversity (pi) estimates that are less than 0.15% of those obtained for class 2 spacers (pi = 0.0183 vs. 0.1433, respectively). The presence of microsatellite in the NTS region leading to variation in spacer length is reported and discussed for the first time in P. deltoides.  相似文献   

12.
Comparative ab initio prediction of gene structures using pair HMMs   总被引:3,自引:0,他引:3  
We present a novel comparative method for the ab initio prediction of protein coding genes in eukaryotic genomes. The method simultaneously predicts the gene structures of two un-annotated input DNA sequences which are homologous to each other and retrieves the subsequences which are conserved between the two DNA sequences. It is capable of predicting partial, complete and multiple genes and can align pairs of genes which differ by events of exon-fusion or exon-splitting. The method employs a probabilistic pair hidden Markov model. We generate annotations using our model with two different algorithms: the Viterbi algorithm in its linear memory implementation and a new heuristic algorithm, called the stepping stone, for which both memory and time requirements scale linearly with the sequence length. We have implemented the model in a computer program called DOUBLESCAN. In this article, we introduce the method and confirm the validity of the approach on a test set of 80 pairs of orthologous DNA sequences from mouse and human. More information can be found at: http://www.sanger.ac.uk/Software/analysis/doublescan/  相似文献   

13.
In this paper we introduce a simple model based on probabilistic finite state automata to describe an emotional interaction between a robot and a human user, or between simulated agents. Based on the agent’s personality, attitude, and nature, and on the emotional inputs it receives, the model will determine the next emotional state displayed by the agent itself. The probabilistic and time-varying nature of the model yields rich and dynamic interactions, and an autonomous adaptation to the interlocutor. In addition, a reinforcement learning technique is applied to have one agent drive its partner’s behavior toward desired states. The model may also be used as a tool for behavior analysis, by extracting high probability patterns of interaction and by resorting to the ergodic properties of Markov chains. An early stage part of this work was presented at the 11th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (KES 2007).  相似文献   

14.
Very often, living beings seem able to change their functioning when external conditions vary. In order to study this property, we have devised abstract machines whose internal organisation changes whenever the external conditions vary. The internal organisations of these machines (or programs), are as simple as possible, functions of discrete variables. We call such machines self-modifying automata.These machines stabilise after any transient steps when they go indefinitely through a loop called p-cycle or limit cycle of length p. More often than not, the p in the cycle is equal to one and the cycle reduces to a fixed point.In this case the external value (v) can be considered as the index of function f such as: fv(v)v and the machine has the property of self-replication and to be self-referential. Many authors, in computer and natural science, consider that self-referential objects are a main concept in comprehension of perception, behaviour and associations.In the third part, we have studied chains of automata. Only one automaton changes its internal organisation at each step. Chains of automata have better performances than single self-modifying automata: Higher frequency of fixed point occurrence and a shorter transient length. The performances of the chains of automata improve when the value of their internal states increases whereas the performances of single automata decrease.  相似文献   

15.
We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods.  相似文献   

16.
Hidden Markov models (HMMs) are probabilistic models that are well adapted to many tasks in bioinformatics, for example, for predicting the occurrence of specific motifs in biological sequences. MAMOT is a command-line program for Unix-like operating systems, including MacOS X, that we developed to allow scientists to apply HMMs more easily in their research. One can define the architecture and initial parameters of the model in a text file and then use MAMOT for parameter optimization on example data, decoding (like predicting motif occurrence in sequences) and the production of stochastic sequences generated according to the probabilistic model. Two examples for which models are provided are coiled-coil domains in protein sequences and protein binding sites in DNA. A wealth of useful features include the use of pseudocounts, state tying and fixing of selected parameters in learning, and the inclusion of prior probabilities in decoding. AVAILABILITY: MAMOT is implemented in C++, and is distributed under the GNU General Public Licence (GPL). The software, documentation, and example model files can be found at http://bcf.isb-sib.ch/mamot  相似文献   

17.
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.  相似文献   

18.
A cellular automaton that is related to the "mosaic cycle concept" is considered. We explain why such automata sustain very often, but not always, n-periodic trajectories (n being the number of states of the automaton). Our work is a first step in the direction of a theory of these type of automata which might be useful in modeling mosaic successions.  相似文献   

19.
Immunoglobulin class switch involves a unique recombination event that takes place at the switch (S) region which is located 5' to each constant region (C) gene of the heavy (H) chain. For example, differentiation of the B lymphocyte from a mu-chain producer to an epsilon-chain producer is mediated by the switch recombination between the S mu and S epsilon regions. In order to elucidate the molecular mechanism for the switch recombination, we have determined nucleotide sequences surrounding the class switch recombination sites of the C epsilon and C gamma 3 genes and those in the 5' flanking regions of the C gamma 2a and C delta genes. The results indicate that the 5' flanking regions of all the CH genes except for the C delta gene contain the S regions which comprise tandem repetition of short unit sequences in agreement with the previous analyses of the S gamma 1, S gamma 2b, S mu, and S alpha regions. Comparison of the nucleotide sequences of all the S regions revealed that length as well as nucleotide sequences of the S regions vary among different classes of the CH gene, but they share short common sequences, (G)AGCT and TGGG(G). The nucleotide sequence of the S mu region is homologous to those of the other S regions in the decreasing order of the S epsilon, S alpha, S gamma 3, and (S gamma 1, S gamma 2b, s gamma 2a) regions. We have compared the nucleotide sequences immediately adjacent to the recombination sites of seven rearranged genes and have always fund tetranucleotides TGAG and/or TGGG, except for one case. Such tetranucleotides may constitute a part of the recognition sequence of a putative recombinase. These results provide further support for our previous proposal that the switch recombination may be facilitated by short common sequences dispersed in all the S regions.  相似文献   

20.
A large number of familial Alzheimer disease (FAD) kindreds were examined to determine whether mutations in the amyloid precursor protein (APP) gene could be responsible for the disease. Previous studies have identified three mutations at APP codon 717 which are pathogenic for Alzheimer disease (AD). Samples from affected subjects were examined for mutations in exons 16 and 17 of the APP gene. A combination of direct sequencing and single-strand conformational polymorphism analysis was used. Sporadic AD and normal controls were also examined by the same methods. Five sequence variants were identified. One variant at APP codon 693 resulted in a Glu-->Gly change. This is the same codon as the hereditary cerebral hemorrhage with amyloidosis-Dutch type Glu-->Gln mutation. Another single-base change at APP codon 708 did not alter the amino acid encoded at this site. Two point mutations and a 6-bp deletion were identified in the intronic sequences surrounding exon 17. None of the variants could be unambiguously determined to be responsible for FAD. The larger families were also analyzed by testing for linkage of FAD to a highly polymorphic short tandem repeat marker (D21S210) that is tightly linked to APP. Highly negative LOD scores were obtained for the family groups tested, and linkage was formally excluded beyond theta = .10 for the Volga German kindreds, theta = .20 for early-onset non-Volga Germans, and theta = .10 for late-onset families. LOD scores for linkage of FAD to markers centromeric to APP (D21S1/S11, D21S13, and D21S215) were also negative in the three family groups. These studies show that APP mutations account for AD in only a small fraction of FAD kindreds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号