Evaluation of a sophisticated SCFG design for RNA secondary structure prediction |
| |
Authors: | Markus E Nebel Anika Scheid |
| |
Institution: | (1) Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany |
| |
Abstract: | Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational
structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of
RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA
models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method.
Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and
base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods
and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is
limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are
still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding
generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to
learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted
RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art
physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy
rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a
single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some
new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the
predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons
to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield
higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of
predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF
sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a
greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more
adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental
differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type. |
| |
Keywords: | |
本文献已被 PubMed SpringerLink 等数据库收录! |
|