首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.

Background  

Secondary structure prediction is a useful first step toward 3D structure prediction. A number of successful secondary structure prediction methods use neural networks, but unfortunately, neural networks are not intuitively interpretable. On the contrary, hidden Markov models are graphical interpretable models. Moreover, they have been successfully used in many bioinformatic applications. Because they offer a strong statistical background and allow model interpretation, we propose a method based on hidden Markov models.  相似文献   

2.
《IRBM》2021,42(5):345-352
Available clinical methods for heart failure (HF) diagnosis are expensive and require a high-level of experts intervention. Recently, various machine learning models have been developed for the prediction of HF where most of them have an issue of over-fitting. Over-fitting occurs when machine learning based predictive models show better performance on the training data yet demonstrate a poor performance on the testing data and the other way around. Developing a machine learning model which is able to produce generalization capabilities (such that the model exhibits better performance on both the training and the testing data sets) could overall minimize the prediction errors. Hence, such prediction models could potentially be helpful to cardiologists for the effective diagnose of HF. This paper proposes a two-stage decision support system to overcome the over-fitting issue and to optimize the generalization factor. The first stage uses a mutual information based statistical model while the second stage uses a neural network. We applied our approach to the HF subset of publicly available Cleveland heart disease database. Our experimental results show that the proposed decision support system has optimized the generalization capabilities and has reduced the mean percent error (MPE) to 8.8% which is significantly less than the recently published studies. In addition, our model exhibits a 93.33% accuracy rate which is higher than twenty eight recently developed HF risk prediction models that achieved accuracy in the range of 57.85% to 92.31%. We can hope that our decision support system will be helpful to cardiologists if deployed in clinical setup.  相似文献   

3.
Fu LM  Fu-Liu CS 《FEBS letters》2004,561(1-3):186-190
Differential diagnosis among a group of histologically similar cancers poses a challenging problem in clinical medicine. Constructing a classifier based on gene expression signatures comprising multiple discriminatory molecular markers derived from microarray data analysis is an emerging trend for cancer diagnosis. To identify the best genes for classification using a small number of samples relative to the genome size remains the bottleneck of this approach, despite its promise. We have devised a new method of gene selection with reliability analysis, and demonstrated that this method can identify a more compact set of genes than other methods for constructing a classifier with optimum predictive performance for both small round blue cell tumors and leukemia. High consensus between our result and the results produced by methods based on artificial neural networks and statistical techniques confers additional evidence of the validity of our method. This study suggests a way for implementing a reliable molecular cancer classifier based on gene expression signatures.  相似文献   

4.
Advocates of cladistic parsimony methods have invoked the philosophy of Karl Popper in an attempt to argue for the superiority of those methods over phylogenetic methods based on Ronald Fisher's statistical principle of likelihood. We argue that the concept of likelihood in general, and its application to problems of phylogenetic inference in particular, are highly compatible with Popper's philosophy. Examination of Popper's writings reveals that his concept of corroboration is, in fact, based on likelihood. Moreover, because probabilistic assumptions are necessary for calculating the probabilities that define Popper's corroboration, likelihood methods of phylogenetic inference--with their explicit probabilistic basis--are easily reconciled with his concept. In contrast, cladistic parsimony methods, at least as described by certain advocates of those methods, are less easily reconciled with Popper's concept of corroboration. If those methods are interpreted as lacking probabilistic assumptions, then they are incompatible with corroboration. Conversely, if parsimony methods are to be considered compatible with corroboration, then they must be interpreted as carrying implicit probabilistic assumptions. Thus, the non-probabilistic interpretation of cladistic parsimony favored by some advocates of those methods is contradicted by an attempt by the same authors to justify parsimony methods in terms of Popper's concept of corroboration. In addition to being compatible with Popperian corroboration, the likelihood approach to phylogenetic inference permits researchers to test the assumptions of their analytical methods (models) in a way that is consistent with Popper's ideas about the provisional nature of background knowledge.  相似文献   

5.
Growing interest in conservation and biodiversity increased the demand for accurate and consistent identification of biological objects, such as insects, at the level of individual or species. Among the identification issues, butterfly identification at the species level has been strongly addressed because it is directly connected to the crop plants for human food and animal feed products. However, so far, the widely-used reliable methods were not suggested due to the complicated butterfly shape. In the present study, we propose a novel approach based on a back-propagation neural network to identify butterfly species. The neural network system was designed as a multi-class pattern classifier to identify seven different species. We used branch length similarity (BLS) entropies calculated from the boundary pixels of a butterfly shape as the input feature to the neural network. We verified the accuracy and efficiency of our method by comparing its performance to that of another single neural network system in which the binary values (0 or 1) of all pixels on an image shape are used as a feature vector. Experimental results showed that our method outperforms the binary image network in both accuracy and efficiency.  相似文献   

6.
MOTIVATION: Comparative genomics in general and orthology analysis in particular are becoming increasingly important parts of gene function prediction. Previously, orthology analysis and reconciliation has been performed only with respect to the parsimony model. This discards many plausible solutions and sometimes precludes finding the correct one. In many other areas in bioinformatics probabilistic models have proven to be both more realistic and powerful than parsimony models. For instance, they allow for assessing solution reliability and consideration of alternative solutions in a uniform way. There is also an added benefit in making model assumptions explicit and therefore making model comparisons possible. For orthology analysis, uncertainty has recently been addressed using parsimonious reconciliation combined with bootstrap techniques. However, until now no probabilistic methods have been available. RESULTS: We introduce a probabilistic gene evolution model based on a birth-death process in which a gene tree evolves 'inside' a species tree. Based on this model, we develop a tool with the capacity to perform practical orthology analysis, based on Fitch's original definition, and more generally for reconciling pairs of gene and species trees. Our gene evolution model is biologically sound (Nei et al., 1997) and intuitively attractive. We develop a Bayesian analysis based on MCMC which facilitates approximation of an a posteriori distribution for reconciliations. That is, we can find the most probable reconciliations and estimate the probability of any reconciliation, given the observed gene tree. This also gives a way to estimate the probability that a pair of genes are orthologs. The main algorithmic contribution presented here consists of an algorithm for computing the likelihood of a given reconciliation. To the best of our knowledge, this is the first successful introduction of this type of probabilistic methods, which flourish in phylogeny analysis, into reconciliation and orthology analysis. The MCMC algorithm has been implemented and, although not yet being in its final form, tests show that it performs very well on synthetic as well as biological data. Using standard correspondences, our results carry over to allele trees as well as biogeography.  相似文献   

7.
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.  相似文献   

8.
An evaluation of methods for modelling species distributions   总被引:28,自引:1,他引:27  
Aim Various statistical techniques have been used to model species probabilities of occurrence in response to environmental conditions. This paper provides a comprehensive assessment of methods and investigates whether errors in model predictions are associated to specific kinds of geographical and environmental distributions of species. Location Portugal, Western Europe. Methods Probabilities of occurrence for 44 species of amphibians and reptiles in Portugal were modelled using seven modelling techniques: Gower metric, Ecological Niche Factor Analysis, classification trees, neural networks, generalized linear models, generalized additive models and spatial interpolators. Generalized linear and additive models were constructed with and without a term accounting for spatial autocorrelation. Model performance was measured using two methods: sensitivity and Kappa index. Species were grouped according to their spatial (area of occupancy and extent of occurrence) and environmental (marginality and tolerance) distributions. Two‐way comparison tests were performed to detect significant interactions between models and species groups. Results Interaction between model and species groups was significant for both sensitivity and Kappa index. This indicates that model performance varied for species with different geographical and environmental distributions. Artificial neural networks performed generally better, immediately followed by generalized additive models including a covariate term for spatial autocorrelation. Non‐parametric methods were preferred to parametric approaches, especially when modelling distributions of species with a greater area of occupancy, a larger extent of occurrence, lower marginality and higher tolerance. Main conclusions This is a first attempt to relate performance of modelling techniques with species spatial and environmental distributions. Results indicate a strong relationship between model performance and the kinds of species distributions being modelled. Some methods performed generally better, but no method was superior in all circumstances. A suggestion is made that choice of the appropriate method should be contingent on the goals and kinds of distributions being modelled.  相似文献   

9.
Insect pests pose a significant and increasing threat to agricultural production worldwide. However, most existing recognition methods are built upon well-known convolutional neural networks, which limits the possibility of improving pest recognition accuracies. This research attempts to overcome this challenge from a novel perspective, constructing a simplified but very useful network for effective insect pest recognition by combining transformer architecture and convolution blocks. First, the representative features are extracted from the input image using a backbone convolutional neural network. Second, a new transformer attention-based classification head is proposed to sufficiently utilize spatial data from the features. With that, we explore different combinations for each module in our model and abstract our model into a simple and scalable architecture; we introduce more effective training strategies, pretrained models and data augmentation methods. Our models performance was evaluated on the IP102 benchmark dataset and achieved classification accuracies of 74.897% and 75.583% with minimal implementation costs at image resolutions of 224 × 224 pixels and 480 × 480 pixels, respectively. Our model also attains accuracies of 99.472% and 97.935% on the D0 dataset and Li's dataset, respectively, with an image resolution of 224 × 224 pixels. The experimental results demonstrate that our method is superior to the state-of-the-art methods on these datasets. Accordingly, the proposed model can be deployed in practice and provides additional insights into the related research.  相似文献   

10.
Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.  相似文献   

11.
MOTIVATION: For several decades, free energy minimization methods have been the dominant strategy for single sequence RNA secondary structure prediction. More recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for modeling RNA structure. Unlike physics-based methods, which rely on thousands of experimentally-measured thermodynamic parameters, SCFGs use fully-automated statistical learning algorithms to derive model parameters. Despite this advantage, however, probabilistic methods have not replaced free energy minimization methods as the tool of choice for secondary structure prediction, as the accuracies of the best current SCFGs have yet to match those of the best physics-based models. RESULTS: In this paper, we present CONTRAfold, a novel secondary structure prediction method based on conditional log-linear models (CLLMs), a flexible class of probabilistic models which generalize upon SCFGs by using discriminative training and feature-rich scoring. In a series of cross-validation experiments, we show that grammar-based secondary structure prediction methods formulated as CLLMs consistently outperform their SCFG analogs. Furthermore, CONTRAfold, a CLLM incorporating most of the features found in typical thermodynamic models, achieves the highest single sequence prediction accuracies to date, outperforming currently available probabilistic and physics-based techniques. Our result thus closes the gap between probabilistic and thermodynamic models, demonstrating that statistical learning procedures provide an effective alternative to empirical measurement of thermodynamic parameters for RNA secondary structure prediction. AVAILABILITY: Source code for CONTRAfold is available at http://contra.stanford.edu/contrafold/.  相似文献   

12.
Tao T  Zhai CX  Lu X  Fang H 《Applied bioinformatics》2004,3(2-3):115-124
Automatic discovery of new protein motifs (i.e. amino acid patterns) is one of the major challenges in bioinformatics. Several algorithms have been proposed that can extract statistically significant motif patterns from any set of protein sequences. With these methods, one can generate a large set of candidate motifs that may be biologically meaningful. This article examines methods to predict the functions of these candidate motifs. We use several statistical methods: a popularity method, a mutual information method and probabilistic translation models. These methods capture, from different perspectives, the correlations between the matched motifs of a protein and its assigned Gene Ontology terms that characterise the function of the protein. We evaluate these different methods using the known motifs in the InterPro database. Each method is used to rank candidate terms for each motif. We then use the expected mean reciprocal rank to evaluate the performance. The results show that, in general, all these methods perform well, suggesting that they can all be useful for predicting the function of an unknown motif. Among the methods tested, a probabilistic translation model with a popularity prior performs the best.  相似文献   

13.
Probabilistic models over strings have played a key role in developing methods that take into consideration indels as phylogenetically informative events. There is an extensive literature on using automata and transducers on phylogenies to do inference on these probabilistic models, in which an important theoretical question is the complexity of computing the normalization of a class of string-valued graphical models. This question has been investigated using tools from combinatorics, dynamic programming, and graph theory, and has practical applications in Bayesian phylogenetics. In this work, we revisit this theoretical question from a different point of view, based on linear algebra. The main contribution is a set of results based on this linear algebra view that facilitate the analysis and design of inference algorithms on string-valued graphical models. As an illustration, we use this method to give a new elementary proof of a known result on the complexity of inference on the “TKF91” model, a well-known probabilistic model over strings. Compared to previous work, our proving method is easier to extend to other models, since it relies on a novel weak condition, triangular transducers, which is easy to establish in practice. The linear algebra view provides a concise way of describing transducer algorithms and their compositions, opens the possibility of transferring fast linear algebra libraries (for example, based on GPUs), as well as low rank matrix approximation methods, to string-valued inference problems.  相似文献   

14.
The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases.  相似文献   

15.
This paper compares kernel-based probabilistic neural networks for speaker verification based on 138 speakers of the YOHO corpus. Experimental evaluations using probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs) and elliptical basis function networks (EBFNs) as speaker models were conducted. The original training algorithm of PDBNNs was also modified to make PDBNNs appropriate for speaker verification. Results show that the equal error rate obtained by PDBNNs and GMMs is less than that of EBFNs (0.33% vs. 0.48%), suggesting that GMM- and PDBNN-based speaker models outperform the EBFN ones. This work also finds that the globally supervised learning of PDBNNs is able to find decision thresholds that not only maintain the false acceptance rates to a low level but also reduce their variation, whereas the ad-hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the error rates. This property makes the performance of PDBNN-based systems more predictable.  相似文献   

16.
We demonstrate the applicability of our previously developed Bayesian probabilistic approach for predicting residue solvent accessibility to the problem of predicting secondary structure. Using only single-sequence data, this method achieves a three-state accuracy of 67% over a database of 473 non-homologous proteins. This approach is more amenable to inspection and less likely to overlearn specifics of a dataset than "black box" methods such as neural networks. It is also conceptually simpler and less computationally costly. We also introduce a novel method for representing and incorporating multiple-sequence alignment information within the prediction algorithm, achieving 72% accuracy over a dataset of 304 non-homologous proteins. This is accomplished by creating a statistical model of the evolutionarily derived correlations between patterns of amino acid substitution and local protein structure. This model consists of parameter vectors, termed "substitution schemata," which probabilistically encode the structure-based heterogeneity in the distributions of amino acid substitutions found in alignments of homologous proteins. The model is optimized for structure prediction by maximizing the mutual information between the set of schemata and the database of secondary structures. Unlike "expert heuristic" methods, this approach has been demonstrated to work well over large datasets. Unlike the opaque neural network algorithms, this approach is physicochemically intelligible. Moreover, the model optimization procedure, the formalism for predicting one-dimensional structural features and our previously developed method for tertiary structure recognition all share a common Bayesian probabilistic basis. This consistency starkly contrasts with the hybrid and ad hoc nature of methods that have dominated this field in recent years.  相似文献   

17.
SUMMARY: The genomic abundance and pharmacological importance of membrane proteins have fueled efforts to identify them based solely on sequence information. Previous methods based on the physicochemical principle of a sliding window of hydrophobicity (hydropathy analysis) have been replaced by approaches based on hidden Markov models or neural networks which prevail due to their probabilistic orientation. In the current study, an optimization of the hydrophobicity tables used in hydropathy analysis is performed using a genetic algorithm. As such, the approach can be viewed as a synthesis between the physicochemically and statistically based methods. The resulting hydrophobicity tables lead to significant improvement in the prediction accuracy of hydropathy analysis. Furthermore, since hydropathy analysis is less dependent on the basis set of membrane proteins is used to hone the statistically based methods, as well as being faster, it may be valuable in the analysis of new genomes. Finally, the values obtained for each of the amino acids in the new hydrophobicity tables are discussed.  相似文献   

18.
Ecological diffusion is a theory that can be used to understand and forecast spatio‐temporal processes such as dispersal, invasion, and the spread of disease. Hierarchical Bayesian modelling provides a framework to make statistical inference and probabilistic forecasts, using mechanistic ecological models. To illustrate, we show how hierarchical Bayesian models of ecological diffusion can be implemented for large data sets that are distributed densely across space and time. The hierarchical Bayesian approach is used to understand and forecast the growth and geographic spread in the prevalence of chronic wasting disease in white‐tailed deer (Odocoileus virginianus). We compare statistical inference and forecasts from our hierarchical Bayesian model to phenomenological regression‐based methods that are commonly used to analyse spatial occurrence data. The mechanistic statistical model based on ecological diffusion led to important ecological insights, obviated a commonly ignored type of collinearity, and was the most accurate method for forecasting.  相似文献   

19.
MOTIVATION: Many computational methods for identifying regulatory elements use a likelihood ratio between motif and background models. Often, the methods use a background model of independent bases. At least two different Markov background models have been proposed with the aim of increasing the accuracy of predicting regulatory elements. Both Markov background models suffer theoretical drawbacks, so this article develops a third, context-dependent Markov background model from fundamental statistical principles. RESULTS: Datasets containing known regulatory elements in eukaryotes provided a basis for comparing the predictive accuracies of the different background models. Non-parametric statistical tests indicated that Markov models of order 3 constituted a statistically significant improvement over the background model of independent bases. Our model performed slightly better than the previous Markov background models. We also found that for discriminating between the predictive accuracies of competing background models, the correlation coefficient is a more sensitive measure than the performance coefficient. AVAILABILITY: Our C++ program is available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/2006-07-19  相似文献   

20.
Various statistical classification methods, including discriminant analysis, logistic regression, and cluster analysis, have been used with antibiotic resistance analysis (ARA) data to construct models for bacterial source tracking (BST). We applied the statistical method known as classification trees to build a model for BST for the Anacostia Watershed in Maryland. Classification trees have more flexibility than other statistical classification approaches based on standard statistical methods to accommodate complex interactions among ARA variables. This article describes the use of classification trees for BST and includes discussion of its principal parameters and features. Anacostia Watershed ARA data are used to illustrate the application of classification trees, and we report the BST results for the watershed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号