首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The Gibbs sampling method has been widely used for sequence analysis after it was successfully applied to the problem of identifying regulatory motif sequences upstream of genes. Since then, numerous variants of the original idea have emerged: however, in all cases the application has been to finding short motifs in collections of short sequences (typically less than 100 nucleotides long). In this paper, we introduce a Gibbs sampling approach for identifying genes in multiple large genomic sequences up to hundreds of kilobases long. This approach leverages the evolutionary relationships between the sequences to improve the gene predictions, without explicitly aligning the sequences. We have applied our method to the analysis of genomic sequence from 14 genomic regions, totaling roughly 1.8 Mb of sequence in each organism. We show that our approach compares favorably with existing ab initio approaches to gene finding, including pairwise comparison based gene prediction methods which make explicit use of alignments. Furthermore, excellent performance can be obtained with as little as four organisms, and the method overcomes a number of difficulties of previous comparison based gene finding approaches: it is robust with respect to genomic rearrangements, can work with draft sequence, and is fast (linear in the number and length of the sequences). It can also be seamlessly integrated with Gibbs sampling motif detection methods.  相似文献   

2.
The design of slice selective pulses for magnetic resonance imaging can be cast as an optimal control problem. The Fourier synthesis method is an existing approach to solve these optimal control problems. In this method the gradient field as well as the excitation field are switched rapidly and their amplitudes are calculated based on a Fourier series expansion. Here, we provide a novel insight into the Fourier synthesis method via representing the Bloch equation in spherical coordinates. Based on the spherical Bloch equation, we propose an alternative sequence of pulses that can be used for slice selection which is more time efficient compared to the original method. Simulation results demonstrate that while the performance of both methods is approximately the same, the required time for the proposed sequence of pulses is half of the original sequence of pulses. Furthermore, the slice selectivity of both sequences of pulses changes with radio frequency field inhomogeneities in a similar way. We also introduce a measure, referred to as gradient complexity, to compare the performance of both sequences of pulses. This measure indicates that for a desired level of uniformity in the excited slice, the gradient complexity for the proposed sequence of pulses is less than the original sequence.  相似文献   

3.
The problem of finding an optimal structural alignment for a pair of superimposed proteins is often amenable to the Smith-Waterman dynamic programming algorithm, which runs in time proportional to the product of lengths of the sequences being aligned. While the quadratic running time is acceptable for computing a single alignment of two fixed protein structures, the time complexity becomes a bottleneck when running the Smith-Waterman routine multiple times in order to find a globally optimal superposition and alignment of the input proteins. We present a subquadratic running time algorithm capable of computing an alignment that optimizes one of the most widely used measures of protein structure similarity, defined as the number of pairs of residues in two proteins that can be superimposed under a predefined distance cutoff. The algorithm presented in this article can be used to significantly improve the speed-accuracy tradeoff in a number of popular protein structure alignment methods.  相似文献   

4.
Given a set D of input sequences, a genealogy for D can be constructed backward in time using such evolutionary events as mutation, coalescent, and recombination. An ancestral configuration (AC) can be regarded as the multiset of all sequences present at a particular point in time in a possible genealogy for D. The complexity of computing the likelihood of observing D depends heavily on the total number of distinct ACs of D and, therefore, it is of interest to estimate that number. For D consisting of binary sequences of finite length, we consider the problem of enumerating exactly all distinct ACs. We assume that the root sequence type is known and that the mutation process is governed by the infinite-sites model. When there is no recombination, we construct a general method of obtaining closed-form formulas for the total number of ACs. The enumeration problem becomes much more complicated when recombination is involved. In that case, we devise a method of enumeration based on counting contingency tables and construct a dynamic programming algorithm for the approach. Last, we describe a method of counting the number of ACs that can appear in genealogies with less than or equal to a given number R of recombinations. Of particular interest is the case in which R is close to the minimum number of recombinations for D.  相似文献   

5.
Constraint-based approaches recently brought new insight into our understanding of metabolism. By making very simple assumptions such as that the system is at steady-state and some reactions are irreversible, and without requiring kinetic parameters, general properties of the system can be derived. A central concept in this methodology is the notion of an elementary mode (EM for short) which represents a minimal functional subsystem. The computation of EMs still forms a limiting step in metabolic studies and several algorithms have been proposed to address this problem leading to increasingly faster methods. However, although a theoretical upper bound on the number of elementary modes that a network may possess has been established, surprisingly, the complexity of this problem has never been systematically studied. In this paper, we give a systematic overview of the complexity of optimisation problems related to modes. We first establish results regarding network consistency. Most consistency problems are easy, i.e., they can be solved in polynomial time. We then establish the complexity of finding and counting elementary modes. We show in particular that finding one elementary mode is easy but that this task becomes hard when a specific EM (i.e. an EM containing some specified reactions) is sought. We then show that counting the number of elementary modes is musical sharpP-complete. We emphasize that the easy problems can be solved using currently existing software packages. We then analyse the complexity of a closely related task which is the computation of so-called minimum reaction cut sets and we show that this problem is hard. We then present two positive results which both allow to avoid computing EMs as a prior to the computation of reaction cuts. The first one is a polynomial approximation algorithm for finding a minimum reaction cut set. The second one is a test for verifying whether a set of reactions constitutes a reaction cut; this test can be readily included in existing algorithms to improve their performance. Finally, we discuss the complexity of other cut-related problems.  相似文献   

6.
An algorithm to enumerate sorting reversals for signed permutations.   总被引:1,自引:0,他引:1  
The rearrangement distance between single-chromosome genomes can be estimated as the minimum number of inversions required to transform the gene ordering observed in one into that observed in the other. This measure, known as "inversion distance," can be computed as the reversal distance between signed permutations. During the past decade, much progress has been made both on the problem of computing reversal distance and on the related problem of finding a minimum-length sequence of reversals, which is known as "sorting by reversals." For most problem instances, however, many minimum-length sequences of reversals exist, and in the absence of auxiliary information, no one is of greater value than the others. The problem of finding all minimum-length sequences of reversals is thus a natural generalization of sorting by reversals, yet it has received little attention. This problem reduces easily to the problem of finding all "sorting reversals" of one permutation with respect to another - that is, all reversals rho such that, if rho is applied to one permutation, then the reversal distance of that permutation from the other is decreased. In this paper, an efficient algorithm is derived to solve the problem of finding all sorting reversals, and experimental results are presented indicating that, while the new algorithm does not represent a significant improvement in asymptotic terms (it takes O(n(3)) time, for permutations of size n; the problem can now be solved by brute force in Theta(n(3)) time), it performs dramatically better in practice than the best known alternative. An implementation of the algorithm is available at www.cse.ucsc.edu/~acs.  相似文献   

7.
MOTIVATION: Orthologous proteins in different species are likely to have similar biochemical function and biological role. When annotating a newly sequenced genome by sequence homology, the most precise and reliable functional information can thus be derived from orthologs in other species. A standard method of finding orthologs is to compare the sequence tree with the species tree. However, since the topology of phylogenetic tree is not always reliable one might get incorrect assignments. RESULTS: Here we present a novel method that resolves this problem by analyzing a set of bootstrap trees instead of the optimal tree. The frequency of orthology assignments in the bootstrap trees can be interpreted as a support value for the possible orthology of the sequences. Our method is efficient enough to analyze data in the scale of whole genomes. It is implemented in Java and calculates orthology support levels for all pairwise combinations of homologous sequences of two species. The method was tested on simulated datasets and on real data of homologous proteins.  相似文献   

8.
We describe a set of computational problems motivated by certain analysis tasks in genome resequencing. These are assembly problems for which multiple distinct sequences must be assembled, but where the relative positions of reads to be assembled are already known. This information is obtained from a common reference genome and is characteristic of resequencing experiments. The simplest variant of the problem aims at determining a minimum set of superstrings such that each sequenced read matches at least one superstring. We give an algorithm with time complexity O(N), where N is the sum of the lengths of reads, substantially improving on previous algorithms for solving the same problem. We also examine the problem of finding the smallest number of reads to remove such that the remaining reads are consistent with k superstrings. By exploiting a surprising relationship with the minimum cost flow problem, we show that this problem can be solved in polynomial time when nested reads are excluded. If nested reads are permitted, this problem of removing the minimum number of reads becomes NP-hard. We show that permitting mismatches between reads and their nearest superstrings generally renders these problems NP-hard.  相似文献   

9.
Advances in the field of bioinformatics have led to reconstruction of genome-scale networks for a number of key organisms. The application of physicochemical constraints to these stoichiometric networks allows researchers, through methods such as flux balance analysis, to highlight key sets of reactions necessary to achieve particular objectives. The key benefits of constraint-based analysis lie in the minimal knowledge required to infer systemic properties. However, network degeneracy leads to a large number of flux distributions that satisfy any objective; moreover, these distributions may be dominated by biologically irrelevant internal cycles. By examining the geometry underlying the problem, we define two methods for finding a unique solution within the space of all possible flux distributions; such a solution contains no internal cycles, and is representative of the space as a whole. The first method draws on typical geometric knowledge, but cannot be applied to large networks because of the high computational complexity of the problem. Thus a second method, an iteration of linear programs which scales easily to the genome scale, is defined. The algorithm is run on four recent genome-scale models, and unique flux solutions are found. The algorithm set out here will allow researchers in flux balance analysis to exchange typical solutions to their models in a reproducible format. Moreover, having found a single solution, statistical analyses such as correlations may be performed.  相似文献   

10.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.  相似文献   

11.
MOTIVATION: Computationally identifying non-coding RNA regions on the genome has much scope for investigation and is essentially harder than gene-finding problems for protein-coding regions. Since comparative sequence analysis is effective for non-coding RNA detection, efficient computational methods are expected for structural alignments of RNA sequences. On the other hand, Hidden Markov Models (HMMs) have played important roles for modeling and analysing biological sequences. Especially, the concept of Pair HMMs (PHMMs) have been examined extensively as mathematical models for alignments and gene finding. RESULTS: We propose the pair HMMs on tree structures (PHMMTSs), which is an extension of PHMMs defined on alignments of trees and provides a unifying framework and an automata-theoretic model for alignments of trees, structural alignments and pair stochastic context-free grammars. By structural alignment, we mean a pairwise alignment to align an unfolded RNA sequence into an RNA sequence of known secondary structure. First, we extend the notion of PHMMs defined on alignments of 'linear' sequences to pair stochastic tree automata, called PHMMTSs, defined on alignments of 'trees'. The PHMMTSs provide various types of alignments of trees such as affine-gap alignments of trees and an automata-theoretic model for alignment of trees. Second, based on the observation that a secondary structure of RNA can be represented by a tree, we apply PHMMTSs to the problem of structural alignments of RNAs. We modify PHMMTSs so that it takes as input a pair of a 'linear' sequence and a 'tree' representing a secondary structure of RNA to produce a structural alignment. Further, the PHMMTSs with input of a pair of two linear sequences is mathematically equal to the pair stochastic context-free grammars. We demonstrate some computational experiments to show the effectiveness of our method for structural alignments, and discuss a complexity issue of PHMMTSs.  相似文献   

12.
Discovering simple DNA sequences by the algorithmic significance method   总被引:6,自引:1,他引:5  
A new method, ‘algorithmic significance’, is proposedas a tool for discovery of patterns in DNA sequences. The mainidea is that patterns can be discovered by finding ways to encodethe observed data concisely. In this sense, the method can beviewed as a formal version of the Occam's Razor principle. Inthis paper the method is applied to discover significantly simpleDNA sequences. We define DNA sequences to be simple if theycontain repeated occurrences of certain ‘words’and thus can be encoded in a small number of bits. Such definitionincludes minisatellites and microsatellites. A standard dynamicprogramming algorithm for data compression is applied to computethe minimal encoding lengths of sequences in linear time. Anelectronic mail server for identification of simple sequencesbased on the proposed method has been installed at the Internetaddress pythia@anl.gov.  相似文献   

13.
Inference of haplotypes from PCR-amplified samples of diploid populations   总被引:51,自引:0,他引:51  
Direct sequencing of genomic DNA from diploid individuals leads to ambiguities on sequencing gels whenever there is more than one mismatching site in the sequences of the two orthologous copies of a gene. While these ambiguities cannot be resolved from a single sample without resorting to other experimental methods (such as cloning in the traditional way), population samples may be useful for inferring haplotypes. For each individual in the sample that is homozygous for the amplified sequence, there are no ambiguities in the identification of the allele's sequence. The sequences of other alleles can be inferred by taking the remaining sequence after "subtracting off" the sequencing ladder of each known site. Details of the algorithm for extracting allelic sequences from such data are presented here, along with some population-genetic considerations that influence the likelihood for success of the method. The algorithm also applies to the problem of inferring haplotype frequencies of closely linked restriction-site polymorphisms.  相似文献   

14.
Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.  相似文献   

15.
Recent studies of 16S rRNA sequences through next-generation sequencing have revolutionized our understanding of the microbial community composition and structure. One common approach in using these data to explore the genetic diversity in a microbial community is to cluster the 16S rRNA sequences into Operational Taxonomic Units (OTUs) based on sequence similarities. The inferred OTUs can then be used to estimate species, diversity, composition, and richness. Although a number of methods have been developed and commonly used to cluster the sequences into OTUs, relatively little guidance is available on their relative performance and the choice of key parameters for each method. In this study, we conducted a comprehensive evaluation of ten existing OTU inference methods. We found that the appropriate dissimilarity value for defining distinct OTUs is not only related with a specific method but also related with the sample complexity. For data sets with low complexity, all the algorithms need a higher dissimilarity threshold to define OTUs. Some methods, such as, CROP and SLP, are more robust to the specific choice of the threshold than other methods, especially for shorter reads. For high-complexity data sets, hierarchical cluster methods need a more strict dissimilarity threshold to define OTUs because the commonly used dissimilarity threshold of 3% often leads to an under-estimation of the number of OTUs. In general, hierarchical clustering methods perform better at lower dissimilarity thresholds. Our results show that sequence abundance plays an important role in OTU inference. We conclude that care is needed to choose both a threshold for dissimilarity and abundance for OTU inference.  相似文献   

16.
Motivation: Differential detection on symptom-related pathogens(SRP) is critical for fast identification and accurate controlagainst epidemic diseases. Conventional polymerase chain reaction(PCR) requires a large number of unique primers to amplify selectedSRP target sequences. With multiple-use primers (mu-primers),multiple targets can be amplified and detected in one PCR experimentunder standard reaction condition and reduced detection complexity.However, the time complexity of designing mu-primers with thebest heuristic method available is too vast. We have formulatedminimum-set mu-primer design problem as a set covering problem(SCP), and used modified compact genetic algorithm (MCGA) tosolve this problem optimally and efficiently. We have also proposednew strategies of primer/probe design algorithm (PDA) on combiningboth minimum-set (MS) mu-primers and unique (UniQ) probes. Designedprimer/probe set by PDA-MS/UniQ can amplify multiple genes simultaneouslyupon physical presence with minimum-set mu-primer amplification(MMA) before intended differential detection with probes-arrayhybridization (PAH) on the selected target set of SRP. Results: The proposed PDA-MS/UniQ method pursues a much smallernumber of primers set compared with conventional PCR. In thesimulation experiment for amplifying 12 669 target sequences,the performance of our method with 68% reduction on requiredmu-primers number seems to be superior to the compared heuristicapproaches in both computation efficiency and reduction percentage.Our integrated PDA-MS/UniQ method is applied to the differentialdetection on 9 plant viruses from 4 genera with MMA and PAHof 11 mu-primers instead of 18 unique ones in conventional PCRwhile amplifying overall 9 target sequences. The results ofwet lab experiments with integrated MMA-PAH system have successfullyvalidated the specificity and sensitivity of the primers/probesdesigned with our integrated PDA-MS/UniQ method. Contact: cykao{at}csie.ntu.edu.tw Supplementary information: http://www.csie.ntu.edu.tw/~cykao/pda/  相似文献   

17.
The application of Needleman-Wunsch alignment techniques to biological sequences is complicated by two serious problems when the sequences are long: the running time, which scales as the product of the lengths of sequences, and the difficulty in obtaining suitable parameters that produce meaningful alignments. The running time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high-scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman-Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated nonconserved segments. We present a solution to the problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relatives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, successfully reducing the Viterbi algorithm search space of alignment PHMMs by three orders of magnitude.  相似文献   

18.
Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.  相似文献   

19.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9. Supported by the National Natural Science Foundation of China (Grant Nos. 90403120, 10474041 and 10021001) and the Nonlinear Project (973) of the NSM  相似文献   

20.
As genome-scale metabolic reconstructions emerge, tools to manage their size and complexity will be increasingly important. Flux balance analysis (FBA) is a constraint-based approach widely used to study the metabolic capabilities of cellular or subcellular systems. FBA problems are highly underdetermined and many different phenotypes can satisfy any set of constraints through which the metabolic system is represented.Two of the main concerns in FBA are exploring the space of solutions for a given metabolic network and finding a specific phenotype which is representative for a given task such as maximal growth rate. Here, we introduce a recursive algorithm suitable for overcoming both of these concerns. The method proposed is able to find the alternate optimal patterns of active reactions of an FBA problem and identify the minimal subnetwork able to perform a specific task as optimally as the whole.Our method represents an alternative to and an extension of other approaches conceived for exploring the space of solutions of an FBA problem. It may also be particularly helpful in defining a scaffold of reactions upon which to build up a dynamic model, when the important pathways of the system have not yet been well-defined.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号