首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Genomic rearrangement operations can be very useful to infer the phylogenetic relationship of gene orders representing species. We study the problem of finding potential ancestral gene orders for the gene orders of given taxa, such that the corresponding rearrangement scenario has a minimal number of reversals, and where each of the reversals has to preserve the common intervals of the given input gene orders. Common intervals identify sets of genes that occur consecutively in all input gene orders. The problem of finding such an ancestral gene order is called the preserving reversal median problem (pRMP). A tree-based data structure for the representation of the common intervals of all input gene orders is used in our exact algorithm TCIP for solving the pRMP. It is known that the minimum number of reversals to transform one gene order into another can be computed in polynomial time, whereas the corresponding problem with the restriction that common intervals should not be destroyed is already NP-hard. It is shown theoretically that TCIP can solve a large class of pRMP instances in polynomial time. Empirically we show the good performance of TCIP on biological and artificial data.  相似文献   

2.
I present an algorithm that determines the longest path between every gene pair in an arbitrarily large genetic network from large scale gene perturbation data. The algorithm's computational complexity is O(nk(2)), where n is the number of genes in the network and k is the average number of genes affected by a genetic perturbation. The algorithm is able to distinguish a large fraction of direct regulatory interactions from indirect interactions, even if the accuracy of its input data is substantially compromised.  相似文献   

3.
We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97d) time for DNA strings and in O(nL + nd · 61.86d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92d) time for DNA strings and in O(nL + nd · 47.21d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])) time for DNA strings and in O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too.  相似文献   

4.
We study the Simplified Partial Digest Problem (SPDP), which is a mathematical model for a new simplified partial digest method of genome mapping. This method is easy for laboratory implementation and robust with respect to the experimental errors. SPDP is NP-hard in the strong sense. We present an $O(n2;n)$ time enumerative algorithm and an O(n(2q)) time dynamic programming algorithm for the error-free SPDP, where $n$ is the number of restriction sites and n is the number of distinct intersite distances. We also give examples of the problem, in which there are 2(n+2)/(3)-1 non-congruent solutions. These examples partially answer a question recently posed in the literature about the number of solutions of SPDP. We adapt our enumerative algorithm for handling SPDP with imprecise input data. Finally, we describe and discuss the results of the computer experiments with our algorithms.  相似文献   

5.
A large class of neural network models have their units organized in a lattice with fixed topology or generate their topology during the learning process. These network models can be used as neighborhood preserving map of the input manifold, but such a structure is difficult to manage since these maps are graphs with a number of nodes that is just one or two orders of magnitude less than the number of input points (i.e., the complexity of the map is comparable with the complexity of the manifold) and some hierarchical algorithms were proposed in order to obtain a high-level abstraction of these structures. In this paper a general structure capable to extract high order information from the graph generated by a large class of self-organizing networks is presented. This algorithm will allow to build a two layers hierarchical structure starting from the results obtained by using the suitable neural network for the distribution of the input data. Moreover the proposed algorithm is also capable to build a topology preserving map if it is trained using a graph that is also a topology preserving map.  相似文献   

6.
Important desired properties of an algorithm to construct a supertree (species tree) by reconciling input trees are its low complexity and applicability to large biological data. In its common statement the problem is proved to be NP-hard, i.e. to have an exponential complexity in practice. We propose a reformulation of the supertree building problem that allows a computationally effective solution. We introduce a biologically natural requirement that the supertree is sought for such that it does not contain clades incompatible with those existing in the input trees. The algorithm was tested with simulated and biological trees and was shown to possess an almost square complexity even if horizontal transfers are allowed. If HGTs are not assumed, the algorithm is mathematically correct and possesses the longest running time of n3 x[V0]3, where n is the number of input trees and [V0] is the total number of species. The authors are unaware of analogous solutions in published evidence. The corresponding inferring program, its usage examples and manual are freely available at http://lab6.iitp.ru/en/super3gl. The available program does not implement HGTs. The generalized case is described in the publication "A tree nearest in average to a set of trees" (Information Transmission Problems, 2011).  相似文献   

7.
Phylogeny reconstruction is the process of inferring evolutionary relationships from molecular sequences, and methods that are expected to accurately reconstruct trees from sequences of reasonable length are highly desirable. To formalize this concept, the property of fast-convergence has been introduced to describe phylogeny reconstruction methods that, with high probability, recover the true tree from sequences that grow polynomially in the number of taxa n. While provably fast-converging methods have been developed, the neighbor-joining (NJ) algorithm of Saitou and Nei remains one of the most popular methods used in practice. This algorithm is known to converge for sequences that are exponential in n, but no lower bound for its convergence rate has been established. To address this theoretical question, we analyze the performance of the NJ algorithm on a type of phylogeny known as a 'caterpillar tree'. We find that, for sequences of polynomial length in the number of taxa n, the variability of the NJ criterion is sufficiently high that the algorithm is likely to fail even in the first step of the phylogeny reconstruction process, regardless of the degree of polynomial considered. This result demonstrates that, for general n-taxa trees, the exponential bound cannot be improved.  相似文献   

8.
When a data set is repeatedly clustered using unsupervised techniques, the resulting clusterings, even if highly similar, may list their clusters in different orders. This so-called ‘label-switching’ phenomenon obscures meaningful differences between clusterings, complicating their comparison and summary. The problem often arises in the context of population structure analysis based on multilocus genotype data. In this field, a variety of popular tools apply model-based clustering, assigning individuals to a prespecified number of ancestral populations. Since such methods often involve stochastic components, it is a common practice to perform multiple replicate analyses based on the same input data and parameter settings. Available postprocessing tools allow to mitigate label switching, but leave room for improvements, in particular, regarding large input data sets. In this work, I present Crimp , a lightweight command-line tool, which offers a relatively fast and scalable heuristic to align clusters across replicate clusterings consisting of the same number of clusters. For small problem sizes, an exact algorithm can be used as an alternative. Additional features include row-specific weights, input and output files similar to those of CLUMPP (Jakobsson & Rosenberg, 2007) and the evaluation of a given solution in terms of CLUMPP as well as its own objective functions. Benchmark analyses show that Crimp , especially when applied to larger data sets, tends to outperform alternative tools considering runtime requirements and various quality measures. While primarily targeting population structure analysis, Crimp can be used as a generic tool to correct multiple clusterings for label switching. This facilitates their comparison and allows to generate an averaged clustering. Crimp 's computational efficiency makes it even applicable to relatively large data sets while offering competitive solution quality.  相似文献   

9.
Lorenz WA  Clote P 《PloS one》2011,6(1):e16178
An RNA secondary structure is locally optimal if there is no lower energy structure that can be obtained by the addition or removal of a single base pair, where energy is defined according to the widely accepted Turner nearest neighbor model. Locally optimal structures form kinetic traps, since any evolution away from a locally optimal structure must involve energetically unfavorable folding steps. Here, we present a novel, efficient algorithm to compute the partition function over all locally optimal secondary structures of a given RNA sequence. Our software, RNAlocopt runs in O(n3) time and O(n2) space. Additionally, RNAlocopt samples a user-specified number of structures from the Boltzmann subensemble of all locally optimal structures. We apply RNAlocopt to show that (1) the number of locally optimal structures is far fewer than the total number of structures--indeed, the number of locally optimal structures approximately equal to the square root of the number of all structures, (2) the structural diversity of this subensemble may be either similar to or quite different from the structural diversity of the entire Boltzmann ensemble, a situation that depends on the type of input RNA, (3) the (modified) maximum expected accuracy structure, computed by taking into account base pairing frequencies of locally optimal structures, is a more accurate prediction of the native structure than other current thermodynamics-based methods. The software RNAlocopt constitutes a technical breakthrough in our study of the folding landscape for RNA secondary structures. For the first time, locally optimal structures (kinetic traps in the Turner energy model) can be rapidly generated for long RNA sequences, previously impossible with methods that involved exhaustive enumeration. Use of locally optimal structure leads to state-of-the-art secondary structure prediction, as benchmarked against methods involving the computation of minimum free energy and of maximum expected accuracy. Web server and source code available at http://bioinformatics.bc.edu/clotelab/RNAlocopt/.  相似文献   

10.
K-ary clustering with optimal leaf ordering for gene expression data   总被引:2,自引:0,他引:2  
MOTIVATION: A major challenge in gene expression analysis is effective data organization and visualization. One of the most popular tools for this task is hierarchical clustering. Hierarchical clustering allows a user to view relationships in scales ranging from single genes to large sets of genes, while at the same time providing a global view of the expression data. However, hierarchical clustering is very sensitive to noise, it usually lacks of a method to actually identify distinct clusters, and produces a large number of possible leaf orderings of the hierarchical clustering tree. In this paper we propose a new hierarchical clustering algorithm which reduces susceptibility to noise, permits up to k siblings to be directly related, and provides a single optimal order for the resulting tree. RESULTS: We present an algorithm that efficiently constructs a k-ary tree, where each node can have up to k children, and then optimally orders the leaves of that tree. By combining k clusters at each step our algorithm becomes more robust against noise and missing values. By optimally ordering the leaves of the resulting tree we maintain the pairwise relationships that appear in the original method, without sacrificing the robustness. Our k-ary construction algorithm runs in O(n(3)) regardless of k and our ordering algorithm runs in O(4(k)n(3)). We present several examples that show that our k-ary clustering algorithm achieves results that are superior to the binary tree results in both global presentation and cluster identification. AVAILABILITY: We have implemented the above algorithms in C++ on the Linux operating system.  相似文献   

11.
The majority of proteins function when associated in multimolecular assemblies. Yet, prediction of the structures of multimolecular complexes has largely not been addressed, probably due to the magnitude of the combinatorial complexity of the problem. Docking applications have traditionally been used to predict pairwise interactions between molecules. We have developed an algorithm that extends the application of docking to multimolecular assemblies. We apply it to predict quaternary structures of both oligomers and multi-protein complexes. The algorithm predicted well a near-native arrangement of the input subunits for all cases in our data set, where the number of the subunits of the different target complexes varied from three to ten. In order to simulate a more realistic scenario, unbound cases were tested. In these cases the input conformations of the subunits are either unbound conformations of the subunits or a model obtained by a homology modeling technique. The successful predictions of the unbound cases, where the input conformations of the subunits are different from their conformations within the target complex, suggest that the algorithm is robust. We expect that this type of algorithm should be particularly useful to predict the structures of large macromolecular assemblies, which are difficult to solve by experimental structure determination.  相似文献   

12.
DNA amplifications and deletions characterize cancer genome and are often related to disease evolution. Microarray-based techniques for measuring these DNA copy-number changes use fluorescence ratios at arrayed DNA elements (BACs, cDNA, or oligonucleotides) to provide signals at high resolution, in terms of genomic locations. These data are then further analyzed to map aberrations and boundaries and identify biologically significant structures. We develop a statistical framework that enables the casting of several DNA copy number data analysis questions as optimization problems over real-valued vectors of signals. The simplest form of the optimization problem seeks to maximize phi(I) = Sigmanu(i)/radical|I| over all subintervals I in the input vector. We present and prove a linear time approximation scheme for this problem, namely, a process with time complexity O (nepsilon(-2)) that outputs an interval for which phi(I) is at least Opt/alpha(epsilon), where Opt is the actual optimum and alpha(epsilon) --> 1 as epsilon --> 0. We further develop practical implementations that improve the performance of the naive quadratic approach by orders of magnitude. We discuss properties of optimal intervals and how they apply to the algorithm performance. We benchmark our algorithms on synthetic as well as publicly available DNA copy number data. We demonstrate the use of these methods for identifying aberrations in single samples as well as common alterations in fixed sets and subsets of breast cancer samples.  相似文献   

13.
Quantitative time-series observation of gene expression is becoming possible, for example by cell array technology. However, there are no practical methods with which to infer network structures using only observed time-series data. As most computational models of biological networks for continuous time-series data have a high degree of freedom, it is almost impossible to infer the correct structures. On the other hand, it has been reported that some kinds of biological networks, such as gene networks and metabolic pathways, may have scale-free properties. We hypothesize that the architecture of inferred biological network models can be restricted to scale-free networks. We developed an inference algorithm for biological networks using only time-series data by introducing such a restriction. We adopt the S-system as the network model, and a distributed genetic algorithm to optimize models to fit its simulated results to observed time series data. We have tested our algorithm on a case study (simulated data). We compared optimization under no restriction, which allows for a fully connected network, and under the restriction that the total number of links must equal that expected from a scale free network. The restriction reduced both false positive and false negative estimation of the links and also the differences between model simulation and the given time-series data.  相似文献   

14.
MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.  相似文献   

15.
Multiple sequence alignment (MSA) is one of the most fundamental problems in computational molecular biology. The running time of the best known scheme for finding an optimal alignment, based on dynamic programming, increases exponentially with the number of input sequences. Hence, many heuristics were suggested for the problem. We consider a version of the MSA problem where the goal is to find an optimal alignment in which matches are restricted to positions in predefined matching segments. We present several techniques for making the dynamic programming algorithm more efficient, while still finding an optimal solution under these restrictions. We prove that it suffices to find an optimal alignment of the predefined sequence segments, rather than single letters, thereby reducing the input size and thus improving the running time. We also identify "shortcuts" that expedite the dynamic programming scheme. Empirical study shows that, taken together, these observations lead to an improved running time over the basic dynamic programming algorithm by 4 to 12 orders of magnitude, while still obtaining an optimal solution. Under the additional assumption that matches between segments are transitive, we further improve the running time for finding the optimal solution by restricting the search space of the dynamic programming algorithm  相似文献   

16.
The recent interest sparked due to the discovery of a variety of functions for non-coding RNA molecules has highlighted the need for suitable tools for the analysis and the comparison of RNA sequences. Many trans-acting non-coding RNA genes and cis-acting RNA regulatory elements present motifs, conserved both in structure and sequence, that can be hardly detected by primary sequence analysis alone. We present an algorithm that takes as input a set of unaligned RNA sequences expected to share a common motif, and outputs the regions that are most conserved throughout the sequences, according to a similarity measure that takes into account both the sequence of the regions and the secondary structure they can form according to base-pairing and thermodynamic rules. Only a single parameter is needed as input, which denotes the number of distinct hairpins the motif has to contain. No further constraints on the size, number and position of the single elements comprising the motif are required. The algorithm can be split into two parts: first, it extracts from each input sequence a set of candidate regions whose predicted optimal secondary structure contains the number of hairpins given as input. Then, the regions selected are compared with each other to find the groups of most similar ones, formed by a region taken from each sequence. To avoid exhaustive enumeration of the search space and to reduce the execution time, a greedy heuristic is introduced for this task. We present different experiments, which show that the algorithm is capable of characterizing and discovering known regulatory motifs in mRNA like the iron responsive element (IRE) and selenocysteine insertion sequence (SECIS) stem–loop structures. We also show how it can be applied to corrupted datasets in which a motif does not appear in all the input sequences, as well as to the discovery of more complex motifs in the non-coding RNA.  相似文献   

17.
MOTIVATION: Algorithms for phylogenetic tree reconstruction based on gene order data typically repeatedly solve instances of the reversal median problem (RMP) which is to find for three given gene orders a fourth gene order (called median) with a minimal sum of reversal distances. All existing algorithms of this type consider only one median for each RMP instance even when a large number of medians exist. A careful selection of one of the medians might lead to better phylogenetic trees. RESULTS: We propose a heuristic algorithm amGRP for solving the multiple genome rearrangement problem (MGRP) by repeatedly solving instances of the RMP taking all medians into account. Algorithm amGRP uses a branch-and-bound method that branches over medians from a selected subset of all medians for each RMP instance. Different heuristics for selecting the subsets have been investigated. To show that the medians for RMP vary strongly with respect to different properties that are likely to be relevant for phylogenetic tree reconstruction, the set of all medians has been investigated for artificial datasets and mitochondrial DNA (mtDNA) gene orders. Phylogenetic trees have been computed for a large set of randomly generated gene orders and two sets of mtDNA gene order data for different animal taxa with amGRP and with two standard approaches for solving the MGRP (GRAPPA-DCM and MGR). The results show that amGRP outperforms both other methods with respect to solution quality and computation time on the test data. AVAILABILITY: The source code of amGRP, additional results and the test instances used in this paper are freely available from the authors.  相似文献   

18.
A genetic map is an ordering of genetic markers calculated from a population of known lineage.While traditionally a map has been generated from a single population for each species, recently researchers have created maps from multiple populations. In the face of these new data, we address the need to find a consensus map--a map that combines the information from multiple partial and possibly inconsistent input maps. We model each input map as a partial order and formulate the consensus problem as finding a median partial order. Finding the median of multiple total orders (preferences or rankings)is a well studied problem in social choice. We choose to find the median using the weighted symmetric difference distance, a more general version of both the symmetric difference distance and the Kemeny distance. Finding a median order using this distance is NP-hard. We show that for our chosen weight assignment, a median order satisfies the positive responsiveness, extended Condorcet,and unanimity criteria. Our solution involves finding the maximum acyclic subgraph of a weighted directed graph.We present a method that dynamically switches between an exact branch and bound algorithm and a heuristic algorithm, and show that for real data from closely related organisms, an exact median can often be found.We present experimental results using seven populations of the crop plant Zea mays.  相似文献   

19.
Electron cryomicroscopy (cryo-EM) has emerged as a powerful structural biology instrument to solve near-atomic three-dimensional structures. Despite the fast growth in the number of density maps generated from cryo-EM data, comparison tools among these reconstructions are still lacking. Current proposals to compare cryo-EM data derived volumes perform map subtraction based on adjustment of each volume grey level to the same scale. We present here a more sophisticated way of adjusting the volumes before comparing, which implies adjustment of grey level scale and spectrum energy, but keeping phases intact inside a mask and imposing the results to be strictly positive. The adjustment that we propose leaves the volumes in the same numeric frame, allowing to perform operations among the adjusted volumes in a more reliable way. This adjustment can be a preliminary step for several applications such as comparison through subtraction, map sharpening, or combination of volumes through a consensus that selects the best resolved parts of each input map. Our development might also be used as a sharpening method using an atomic model as a reference. We illustrate the applicability of this algorithm with the reconstructions derived of several experimental examples. This algorithm is implemented in Xmipp software package and its applications are user-friendly accessible through the cryo-EM image processing framework Scipion.  相似文献   

20.
We examined the factors controlling fish species richness and taxa-habitat relationships in the Malmanoury and Karouabo coastal streams in French Guiana between the short and long rainy seasons. The aims were to evaluate the environmental factors that describe species richness on different scales and to define the ecological requirements of fish taxa in the two streams at that period of the year. We sampled ten regularly spaced freshwater sites in each stream with rotenone. We caught a total of 7725 individuals representing 52 taxa from 21 families and 6 orders. More taxa were caught in the Malmanoury (n=46) than in the Karouabo stream (n=37). These values augmented by the number of fish taxa caught only by gill nets in a parallel survey fitted very well to a log-log model of fish richness versus catchment area in Guianese rivers. Most of the fish taxa encountered in the Malmanoury and Karouabo streams were of freshwater origin and nearly all the fish species caught in these two small coastal streams were also found in the nearby Sinnamary River with the exceptions of the cichlid Heros severus and the characid Crenuchus spirulus. Moreover, no significant relationship was found between a size-independent estimate of fish richness and distance from the Ocean. Thus, despite their coastal position, the Malmanoury and Karouabo streams contained fish assemblages with strong continental affinities. At a local scale, independently of site size, those with relatively more habitat types harbored a relatively greater number of fish taxa. Canopy cover, water conductivity and bank length were the most important environmental variables for fish assemblage composition at that period of the year. Oxygen and vegetation participated also in defining fish habitat requirements but to a lesser extent. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号