期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites

Peter?Meinicke Email author Maike?Tech Burkhard?Morgenstern Rainer?Merkl 《BMC bioinformatics》2004,5(1):169

相似文献

2.

Data mining tools for biological sequences

Liu H Wong L 《Journal of bioinformatics and computational biology》2003,1(1):139-167

We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences. 相似文献

3.

Learning kernels from biological networks by maximizing entropy

Tsuda K Noble WS 《Bioinformatics (Oxford, England)》2004,20(Z1):i326-i333

相似文献

4.

An improved algorithm for matching biological sequences

Osamu Gotoh 《Journal of molecular biology》1982,162(3):705-708

The algorithm of Waterman et al. (1976) for matching biological sequences was modified under some limitations to be accomplished in essentially MN steps, instead of the M²N steps necessary in the original algorithm. The limitations do not seriously reduce the generality of the original method, and the present method is available for most practical uses. The algorithm can be executed on a small computer with a limited capacity of core memory. 相似文献

5.

Equivalence of two Fourier methods for biological sequences 总被引：1，自引：0，他引：1

Eivind Coward 《Journal of mathematical biology》1997,36(1):64-70

Two methods for defining Fourier power spectra for DNA sequences or other biological sequences are compared. The first method uses indicator sequences for each letter. The second method by Silverman and Linsker assigns to each letter a vertex of a regular tetrahedron in space, and this can be generalized to any dimension. While giving different Fourier transforms, it is shown that the power spectra of the two methods are essentially the same. This is also true if one replaces the Fourier transform in both methods with another linear transform, such as the Walsh transform. Received 4 December 1995 相似文献

6.

Pseudo-periodic partitions of biological sequences

Li L Jin R Kok PL Wan H 《Bioinformatics (Oxford, England)》2004,20(3):295-306

MOTIVATION: Algorithm development for finding typical patterns in sequences, especially multiple pseudo-repeats (pseudo-periodic regions), is at the core of many problems arising in biological sequence and structure analysis. In fact, one of the most significant features of biological sequences is their high quasi-repetitiveness. Variation in the quasi-repetitiveness of genomic and proteomic texts demonstrates the presence and density of different biologically important information. It is very important to develop sensitive automatic computational methods for the identification of pseudo-periodic regions of sequences through which we can infer, describe and understand biological properties, and seek precise molecular details of biological structures, dynamics, interactions and evolution. RESULTS: We develop a novel, powerful computational tool for partitioning a sequence to pseudo-periodic regions. The pseudo-periodic partition is defined as a partition, which intuitively has the minimal bias to some perfect-periodic partition of the sequence based on the evolutionary distance. We devise a quadratic time and space algorithm for detecting a pseudo-periodic partition for a given sequence, which actually corresponds to the shortest path in the main diagonal of the directed (acyclic) weighted graph constructed by the Smith-Waterman self-alignment of the sequence. We use several typical examples to demonstrate the utilization of our algorithm and software system in detecting functional or structural domains and regions of proteins. A big advantage of our software program is that there is a parameter, the granularity factor, associated with it and we can freely choose a biological sequence family as a training set to determine the best parameter. In general, we choose all repeats (including many pseudo-repeats) in the SWISS-PROT amino acid sequence database as a typical training set. We show that the granularity factor is 0.52 and the average agreement accuracy of pseudo-periodic partitions, detected by our software for all pseudo-repeats in the SWISS-PROT database, is as high as 97.6%. 相似文献

7.

The ontology of biological sequences

Robert Hoehndorf Janet Kelso Heinrich Herre 《BMC bioinformatics》2009,10(1):377

相似文献

8.

Marginalized Zero-Altered Models for Longitudinal Count Data

Loni Philip Tabb Eric J. Tchetgen Tchetgen Greg A. Wellenius Brent A. Coull 《Statistics in biosciences》2016,8(2):181-203

Count data often exhibit more zeros than predicted by common count distributions like the Poisson or negative binomial. In recent years, there has been considerable interest in methods for analyzing zero-inflated count data in longitudinal or other correlated data settings. A common approach has been to extend zero-inflated Poisson models to include random effects that account for correlation among observations. However, these models have been shown to have a few drawbacks, including interpretability of regression coefficients and numerical instability of fitting algorithms even when the data arise from the assumed model. To address these issues, we propose a model that parameterizes the marginal associations between the count outcome and the covariates as easily interpretable log relative rates, while including random effects to account for correlation among observations. One of the main advantages of this marginal model is that it allows a basis upon which we can directly compare the performance of standard methods that ignore zero inflation with that of a method that explicitly takes zero inflation into account. We present simulations of these various model formulations in terms of bias and variance estimation. Finally, we apply the proposed approach to analyze toxicological data of the effect of emissions on cardiac arrhythmias. 相似文献

9.

Pareto-optimal alignment of biological sequences]

M A Ro?tberg M N Semionenkov O Iu Tabolina 《Biofizika》1999,44(4):581-594

The problem of alignment of two symbol sequences is considered. The validity of the available algorithms for constructing optimal alignment depends on the weighting coefficients which are frequently difficult to choose. A new approach to the problem is proposed, which is based on the use of vector weighting functions (instead of tradionally used scalar ones) and Pareto-optimal alignment (an alignment that is optimal at any choice of weighting coefficient will always be Pareto-optimal). An efficient algorithm for constructing all Pareto-optimal alignments of two sequences is proposed. An approach to choosing a "biologically correct" alignment among all Pareto-optimal alignments is suggested. 相似文献

10.

基于对齐的生物序列相似性分析

张少宏戴宪华《生物信息学》2005,3(2):81-84

生物序列相似性(或差异性)分析是生物信息学研究的一种重要的方法。其中基于对齐的生物序列相似性分析方法,重点介绍基于隐马尔可夫模型的比较方法,并比较基于对齐的各种生物序列分析方法的优缺点。相似文献

11.

Translating genome sequences into biological understanding

Iyer VR 《Genome biology》2003,4(6):324

A report on the Genomics, Proteomics and Bioinformatics Thematic Meeting during the 2003 American Society for Biochemistry and Molecular Biology (ASBMB) Annual Meeting, San Diego, USA, 11-15 April 2003. 相似文献

12.

Greedy mixture learning for multiple motif discovery in biological sequences 总被引：4，自引：0，他引：4

Blekas K Fotiadis DI Likas A 《Bioinformatics (Oxford, England)》2003,19(5):607-617

MOTIVATION: This paper studies the problem of discovering subsequences, known as motifs, that are common to a given collection of related biosequences, by proposing a greedy algorithm for learning a mixture of motifs model through likelihood maximization. The approach adds sequentially a new motif to a mixture model by performing a combined scheme of global and local search for appropriately initializing its parameters. In addition, a hierarchical partitioning scheme based on kd-trees is presented for partitioning the input dataset in order to speed-up the global searching procedure. The proposed method compares favorably over the well-known MEME approach and treats successfully several drawbacks of MEME. RESULTS: Experimental results indicate that the algorithm is advantageous in identifying larger groups of motifs characteristic of biological families with significant conservation. In addition, it offers better diagnostic capabilities by building more powerful statistical motif-models with improved classification accuracy. 相似文献

13.

On the validity of Shannon-information calculations for molecular biological sequences 总被引：1，自引：0，他引：1

A Hariri B Weber J Olmsted 《Journal of theoretical biology》1990,147(2):235-254

The usefulness of information-theoretic measures of the Shannon-Weaver type, when applied to molecular biological systems such as DNA or protein sequences, has been critically evaluated. It is shown that entropy can be re-expressed in dimensionless terms, thereby making it commensurate with information. Further, we have identified processes in which entropy S and information H change in opposite directions. These processes of opposing signs for delta S and delta H demonstrate that while the Second Law of Thermodynamics mandates that entropy always increases, it places no such restrictions on changes in information. Additionally, we have developed equations permitting information calculations, incorporating conditional occurrence probabilities, on DNA and protein sequences. When the results of such calculations are compared for sequences of various general types, there are no informational content patterns. We conclude that information-theoretic calculations of the present level of sophistication do not provide any useful insights into molecular biological sequences. 相似文献

14.

Efficient mining gapped sequential patterns for motifs in biological sequences

Vance Chiang-Chi Liao Ming-Syan Chen 《BMC systems biology》2013,7(Z4):S7

Background

Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. Biological data mining yield impact in diverse biological fields, such as discovery of co-occurring biosequences, which is important for biological data analyses. The approaches of mining sequential patterns can discover all-length motifs of biological sequences. Nevertheless, traditional approaches of mining sequential patterns inefficiently mine DNA and protein data since the data have fewer letters and lengthy sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution of biological sequences.

Results

We devise an approach to efficiently mine sequential patterns (motifs) with gap constraints in biological sequences. The approach is the Depth-First Spelling algorithm for mining sequential patterns of biological sequences with Gap constraints (termed DFSG).

Conclusions

PrefixSpan is one of the most efficient methods in traditional approaches of mining sequential patterns, and it is the basis of GenPrefixSpan. GenPrefixSpan is an approach built on PrefixSpan with gap constraints, and therefore we compare DFSG with GenPrefixSpan. In the experimental results, DFSG mines biological sequences much faster than GenPrefixSpan.

相似文献

15.

Using cellular automata to generate image representation for biological sequences 总被引：8，自引：0，他引：8

Xiao X Shao S Ding Y Huang Z Chen X Chou KC 《Amino acids》2005,28(1):29-35

Summary. A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419–424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their fingerprint. It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246–255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location. 相似文献

16.

Marginalized in the Middle

Michael F. Brown 《American anthropologist》1998,100(1):202-203

相似文献

17.

Identifying discriminative classification-based motifs in biological sequences

Vens C Rosso MN Danchin EG 《Bioinformatics (Oxford, England)》2011,27(9):1231-1238

相似文献

18.

BLogo: a tool for visualization of bias in biological sequences

Li W Yang B Liang S Wang Y Whiteley C Cao Y Wang X 《Bioinformatics (Oxford, England)》2008,24(19):2254-2255

Blogo is a web-based tool that detects and displays statistically significant position-specific sequence bias with reduced background noise. The over-represented and under-represented symbols in a particular position are shown above and below the zero line. When the sequences are in open reading frames, the background frequency of nucleotides could be calculated separately for the three positions of a codon, thus greatly reducing the background noise. The chi(2)-test or Fisher's exact test is used to evaluate the statistical significance of every symbol in every position and only those that are significant are highlighted in the resulting logo. The perl source code of the program is freely available and can be run locally. AVAILABILITY: http://acephpx.cropdb.org/blogo/, http://www.bioinformatics.org/blogo/. 相似文献

19.

Using substitution matrices to estimate probability distributions for biological sequences.

Eleazar Eskin William Stafford Noble Yoram Singer 《Journal of computational biology》2002,9(6):775-791

Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy. 相似文献

20.

Pegasys: software for executing and integrating analyses of biological sequences

Sohrab?P?Shah David?YM?He Jessica?N?Sawkins Jeffrey?C?Druce Gerald?Quon Drew?Lett Grace?XY?Zheng Tao?Xu BF?Francis?Ouellette Email author 《BMC bioinformatics》2004,5(1):40

Background

We present Pegasys – a flexible, modular and customizable software system that facilitates the execution and data integration from heterogeneous biological sequence analysis tools. 相似文献