共查询到20条相似文献,搜索用时 11 毫秒
1.
2.
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences. 相似文献
3.
Osamu Gotoh 《Journal of molecular biology》1982,162(3):705-708
The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M2N steps necessary in the original algorithm. The limitations do not seriously reduce the generality of the original method, and the present method is available for most practical uses. The algorithm can be executed on a small computer with a limited capacity of core memory. 相似文献
4.
5.
Jian Zhang 《Journal of computational biology》2002,9(3):487-503
Decomposing a biological sequence into modular domains is a basic prerequisite to identify functional units in biological molecules. The commonly used segmentation procedures usually have two steps. First, collect and align a set of sequences that are homologous to the target sequence. Then, parse this multiple alignment into several blocks and identify the functionally important ones by using a semi-automatic method, which combines manual analysis and expert knowledge. In this paper, we present a novel exploratory approach to parsing and analyzing such kinds of multiple alignments. It is based on a type of analysis-of-variance (ANOVA) decomposition of the sequence information content. Unlike the traditional change-point method, this approach takes into account not only the composition biases but also the overdispersion effects among the blocks. The new approach is tested on the families of ribosomal proteins and has a promising performance. It is shown that the new approach provides a better way for judging some important residues in these proteins. This allows one to find some subsets of residues, which are critical to these proteins. 相似文献
6.
Equivalence of two Fourier methods for biological sequences 总被引:1,自引:0,他引:1
Eivind Coward 《Journal of mathematical biology》1997,36(1):64-70
Two methods for defining Fourier power spectra for DNA sequences or other biological sequences are compared. The first method
uses indicator sequences for each letter. The second method by Silverman and Linsker assigns to each letter a vertex of a
regular tetrahedron in space, and this can be generalized to any dimension. While giving different Fourier transforms, it
is shown that the power spectra of the two methods are essentially the same. This is also true if one replaces the Fourier
transform in both methods with another linear transform, such as the Walsh transform.
Received 4 December 1995 相似文献
7.
8.
MOTIVATION: Algorithm development for finding typical patterns in sequences, especially multiple pseudo-repeats (pseudo-periodic regions), is at the core of many problems arising in biological sequence and structure analysis. In fact, one of the most significant features of biological sequences is their high quasi-repetitiveness. Variation in the quasi-repetitiveness of genomic and proteomic texts demonstrates the presence and density of different biologically important information. It is very important to develop sensitive automatic computational methods for the identification of pseudo-periodic regions of sequences through which we can infer, describe and understand biological properties, and seek precise molecular details of biological structures, dynamics, interactions and evolution. RESULTS: We develop a novel, powerful computational tool for partitioning a sequence to pseudo-periodic regions. The pseudo-periodic partition is defined as a partition, which intuitively has the minimal bias to some perfect-periodic partition of the sequence based on the evolutionary distance. We devise a quadratic time and space algorithm for detecting a pseudo-periodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed (acyclic) weighted graph constructed by the Smith-Waterman self-alignment of the sequence. We use several typical examples to demonstrate the utilization of our algorithm and software system in detecting functional or structural domains and regions of proteins. A big advantage of our software program is that there is a parameter, the granularity factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best parameter. In general, we choose all repeats (including many pseudo-repeats) in the SWISS-PROT amino acid sequence database as a typical training set. We show that the granularity factor is 0.52 and the average agreement accuracy of pseudo-periodic partitions, detected by our software for all pseudo-repeats in the SWISS-PROT database, is as high as 97.6%. 相似文献
9.
生物序列相似性(或差异性)分析是生物信息学研究的一种重要的方法。其中基于对齐的生物序列相似性分析方法,重点介绍基于隐马尔可夫模型的比较方法,并比较基于对齐的各种生物序列分析方法的优缺点。 相似文献
10.
Iyer VR 《Genome biology》2003,4(6):324
A report on the Genomics, Proteomics and Bioinformatics Thematic Meeting during the 2003 American Society for Biochemistry and Molecular Biology (ASBMB) Annual Meeting, San Diego, USA, 11-15 April 2003. 相似文献
11.
The problem of alignment of two symbol sequences is considered. The validity of the available algorithms for constructing optimal alignment depends on the weighting coefficients which are frequently difficult to choose. A new approach to the problem is proposed, which is based on the use of vector weighting functions (instead of tradionally used scalar ones) and Pareto-optimal alignment (an alignment that is optimal at any choice of weighting coefficient will always be Pareto-optimal). An efficient algorithm for constructing all Pareto-optimal alignments of two sequences is proposed. An approach to choosing a "biologically correct" alignment among all Pareto-optimal alignments is suggested. 相似文献
12.
Loni Philip Tabb Eric J. Tchetgen Tchetgen Greg A. Wellenius Brent A. Coull 《Statistics in biosciences》2016,8(2):181-203
Count data often exhibit more zeros than predicted by common count distributions like the Poisson or negative binomial. In recent years, there has been considerable interest in methods for analyzing zero-inflated count data in longitudinal or other correlated data settings. A common approach has been to extend zero-inflated Poisson models to include random effects that account for correlation among observations. However, these models have been shown to have a few drawbacks, including interpretability of regression coefficients and numerical instability of fitting algorithms even when the data arise from the assumed model. To address these issues, we propose a model that parameterizes the marginal associations between the count outcome and the covariates as easily interpretable log relative rates, while including random effects to account for correlation among observations. One of the main advantages of this marginal model is that it allows a basis upon which we can directly compare the performance of standard methods that ignore zero inflation with that of a method that explicitly takes zero inflation into account. We present simulations of these various model formulations in terms of bias and variance estimation. Finally, we apply the proposed approach to analyze toxicological data of the effect of emissions on cardiac arrhythmias. 相似文献
13.
Summary. A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419–424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their fingerprint. It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246–255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location. 相似文献
14.
Background
Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. Biological data mining yield impact in diverse biological fields, such as discovery of co-occurring biosequences, which is important for biological data analyses. The approaches of mining sequential patterns can discover all-length motifs of biological sequences. Nevertheless, traditional approaches of mining sequential patterns inefficiently mine DNA and protein data since the data have fewer letters and lengthy sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution of biological sequences.Results
We devise an approach to efficiently mine sequential patterns (motifs) with gap constraints in biological sequences. The approach is the Depth-First Spelling algorithm for mining sequential patterns of biological sequences with Gap constraints (termed DFSG).Conclusions
PrefixSpan is one of the most efficient methods in traditional approaches of mining sequential patterns, and it is the basis of GenPrefixSpan. GenPrefixSpan is an approach built on PrefixSpan with gap constraints, and therefore we compare DFSG with GenPrefixSpan. In the experimental results, DFSG mines biological sequences much faster than GenPrefixSpan.15.
MOTIVATION: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing a greedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to a mixture model by performing a combined scheme of global and local search for appropriately initializing its parameters. In addition, a hierarchical partitioning scheme based on kd-trees is presented for partitioning the input dataset in order to speed-up the global searching procedure. The proposed method compares favorably over the well-known MEME approach and treats successfully several drawbacks of MEME. RESULTS: Experimental results indicate that the algorithm is advantageous in identifying larger groups of motifs characteristic of biological families with significant conservation. In addition, it offers better diagnostic capabilities by building more powerful statistical motif-models with improved classification accuracy. 相似文献
16.
On the validity of Shannon-information calculations for molecular biological sequences 总被引:1,自引:0,他引:1
The usefulness of information-theoretic measures of the Shannon-Weaver type, when applied to molecular biological systems such as DNA or protein sequences, has been critically evaluated. It is shown that entropy can be re-expressed in dimensionless terms, thereby making it commensurate with information. Further, we have identified processes in which entropy S and information H change in opposite directions. These processes of opposing signs for delta S and delta H demonstrate that while the Second Law of Thermodynamics mandates that entropy always increases, it places no such restrictions on changes in information. Additionally, we have developed equations permitting information calculations, incorporating conditional occurrence probabilities, on DNA and protein sequences. When the results of such calculations are compared for sequences of various general types, there are no informational content patterns. We conclude that information-theoretic calculations of the present level of sophistication do not provide any useful insights into molecular biological sequences. 相似文献
17.
18.
19.
Sohrab?P?Shah David?YM?He Jessica?N?Sawkins Jeffrey?C?Druce Gerald?Quon Drew?Lett Grace?XY?Zheng Tao?Xu BF?Francis?Ouellette
Background
We present Pegasys – a flexible, modular and customizable software system that facilitates the execution and data integration from heterogeneous biological sequence analysis tools. 相似文献20.
Eleazar Eskin William Stafford Noble Yoram Singer 《Journal of computational biology》2002,9(6):775-791
Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy. 相似文献