首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing a most parsimonious scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, called the Indel Parsimony Problem, is a crucial component of the problem of ancestral genome reconstruction, and its solution provides valuable information to many genome functional annotation approaches. We first show that the problem is NP-complete. Second, we provide an algorithm, based on the fractional relaxation of an integer linear programming formulation. The algorithm is fast in practice, and the solutions it produces are, in most cases, provably optimal. We describe a divide-and-conquer approach that makes it possible to solve very large instances on a simple desktop machine, while retaining guaranteed optimality. Our algorithms are tested and shown efficient and accurate on a set of 1.8 Mb mammalian orthologous sequences in the CFTR region.  相似文献   

2.
Inferring haplotype data from genotype data is a crucial step in linking SNPs to human diseases. Given n genotypes over m SNP sites, the haplotype inference (HI) problem deals with finding a set of haplotypes so that each given genotype can be formed by a combining a pair of haplotypes from the set. The perfect phylogeny haplotyping (PPH) problem is one of the many computational approaches to the HI problem. Though it was conjectured that the complexity of the PPH problem was O(nm), the complexity of all the solutions presented until recently was O(nm (2)). In this paper, we make complete use of the column-ordering that was presented earlier and show that there must be some interdependencies among the pairwise relationships between SNP sites in order for the given genotypes to allow a perfect phylogeny. Based on these interdependencies, we introduce the FlexTree (flexible tree) data structure that represents all the pairwise relationships in O(m) space. The FlexTree data structure provides a compact representation of all the perfect phylogenies for the given set of genotypes. We also introduce an ordering of the genotypes that allows the genotypes to be added to the FlexTree sequentially. The column ordering, the FlexTree data structure, and the row ordering we introduce make the O(nm) OPPH algorithm possible. We present some results on simulated data which demonstrate that the OPPH algorithm performs quiet impressively when compared to the previous algorithms. The OPPH algorithm is one of the first O(nm) algorithms presented for the PPH problem.  相似文献   

3.
4.
Signal transduction is an important process that transmits signals from the outside of a cell to the inside to mediate sophisticated biological responses. Effective computational models to unravel such a process by taking advantage of high-throughput genomic and proteomic data are needed to understand the essential mechanisms underlying the signaling pathways. In this article, we propose a novel method for uncovering signal transduction networks (STNs) by integrating protein interaction with gene expression data. Specifically, we formulate STN identification problem as an integer linear programming (ILP) model, which can be actually solved by a relaxed linear programming algorithm and is flexible for handling various prior information without any restriction on the network structures. The numerical results on yeast MAPK signaling pathways demonstrate that the proposed ILP model is able to uncover STNs or pathways in an efficient and accurate manner. In particular, the prediction results are found to be in high agreement with current biological knowledge and available information in literature. In addition, the proposed model is simple to be interpreted and easy to be implemented even for a large-scale system.  相似文献   

5.
Haplotyping as perfect phylogeny: a direct approach.   总被引:4,自引:0,他引:4  
A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed.  相似文献   

6.
Accurate phylogenetic reconstruction methods are currently limited to a maximum of few dozens of taxa. Supertree methods construct a large tree over a large set of taxa, from a set of small trees over overlapping subsets of the complete taxa set. Hence, in order to construct the tree of life over a million and a half different species, the use of a supertree method over the product of accurate methods, is inevitable. Perhaps the simplest version of this task that is still widely applicable, yet quite challenging, is quartet-based reconstruction. This problem lies at the root of many tree reconstruction methods and theoretical as well as experimental results have been reported. Nevertheless, dealing with false, conflicting quartet trees remains problematic. In this paper, we describe an algorithm for constructing a tree from a set of input quartet trees even with a significant fraction of errors. We show empirically that conflicts in the inputs are handled satisfactorily and that it significantly outperforms and outraces the Matrix Representation with Parsimony (MRP) methods that have previously been most successful in dealing with supertrees. Our algorithm is based on a divide and conquer algorithm where our divide step uses a semidefinite programming (SDP) formulation of MaxCut. We remark that this builds on previous work of ours for piecing together trees from rooted triplet trees. The recursion for unrooted quartets, however, is more complicated in that even with completely consistent set of quartet trees the problem is NP-hard, as opposed to the problem for triples where there is a linear time algorithm. This complexity leads to several issues and some solutions of possible independent interest.  相似文献   

7.
This paper studies haplotype inference by maximum parsimony using population data. We define the optimal haplotype inference (OHI) problem as given a set of genotypes and a set of related haplotypes, find a minimum subset of haplotypes that can resolve all the genotypes. We prove that OHI is NP-hard and can be formulated as an integer quadratic programming (IQP) problem. To solve the IQP problem, we propose an iterative semidefinite programming-based approximation algorithm, (called SDPHapInfer). We show that this algorithm finds a solution within a factor of O(log n) of the optimal solution, where n is the number of genotypes. This algorithm has been implemented and tested on a variety of simulated and biological data. In comparison with three other methods, (1) HAPAR, which was implemented based on the branching and bound algorithm, (2) HAPLOTYPER, which was implemented based on the expectation-maximization algorithm, and (3) PHASE, which combined the Gibbs sampling algorithm with an approximate coalescent prior, the experimental results indicate that SDPHapInfer and HAPLOTYPER have similar error rates. In addition, the results generated by PHASE have lower error rates on some data but higher error rates on others. The error rates of HAPAR are higher than the others on biological data. In terms of efficiency, SDPHapInfer, HAPLOTYPER, and PHASE output a solution in a stable and consistent way, and they run much faster than HAPAR when the number of genotypes becomes large.  相似文献   

8.
The problem Parsimony Haplotyping (PH) asks for the smallest set of haplotypes which can explain a given set of genotypes, and the problem Minimum Perfect Phylogeny Haplotyping (MPPH) asks for the smallest such set which also allows the haplotypes to be embedded in a perfect phylogeny, an evolutionary tree with biologically-motivated restrictions. For PH, we extend recent work by further mapping the interface between ;;easy' and ;;hard' instances, within the framework of (k,l)-bounded instances where the number of 2's per column and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of MPPH we provide the first concrete, positive results for this problem. In addition, we construct for both PH and MPPH polynomial time approximation algorithms, based on properties of the columns of the input matrix.  相似文献   

9.
A new optimization model is discussed for the design of cellular manufacturing systems. It is based on an integer programming formulation that updates some other models by eliminating redundant machine assignment and cost coefficients dependent on cell configuration. To reduce computational burdens, a simplified integer programming model and a decomposition algorithm are proposed. Several computer solutions were performed to evaluate the performance of the new model. The computational results are discussed.  相似文献   

10.
MOTIVATION: Extracting useful information from expression levels of thousands of genes generated with microarray technology needs a variety of analytical techniques. Mathematical programming approaches for classification analysis outperform parametric methods when the data depart from assumptions underlying these methods. Therefore, a mathematical programming approach is developed for gene selection and tissue classification using gene expression profiles. RESULTS: A new mixed integer programming model is formulated for this purpose. The mixed integer programming model simultaneously selects genes and constructs a classification model to classify two groups of tissue samples as accurately as possible. Very encouraging results were obtained with two data sets from the literature as examples. These results show that the mathematical programming approach can rival or outperform traditional classification methods.  相似文献   

11.
Identification of communities in complex networks is an important topic and issue in many fields such as sociology, biology, and computer science. Communities are often defined as groups of related nodes or links that correspond to functional subunits in the corresponding complex systems. While most conventional approaches have focused on discovering communities of nodes, some recent studies start partitioning links to find overlapping communities straightforwardly. In this paper, we propose a new quantity function for link community identification in complex networks. Based on this quantity function we formulate the link community partition problem into an integer programming model which allows us to partition a complex network into overlapping communities. We further propose a genetic algorithm for link community detection which can partition a network into overlapping communities without knowing the number of communities. We test our model and algorithm on both artificial networks and real-world networks. The results demonstrate that the model and algorithm are efficient in detecting overlapping community structure in complex networks.  相似文献   

12.
Yeh CW  Chu CP  Wu KR 《Bio Systems》2006,83(1):56-66
Binary optimization is a widely investigated topic in integer linear programming. This study proposes a DNA-based computing algorithm for solving the significantly large binary integer programming (BIP) problem. The proposed approach is based upon Adleman and Lipton's DNA operations to solve the BIP problem. The potential of DNA computation for the BIP problem is promising given the operational time complexity of O(nxk).  相似文献   

13.
The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). A Haplotype Map project has been announced by NIH to identify and characterize populations in terms of these haplotypes. Recently, Gusfield introduced the perfect phylogeny haplotyping problem, as an algorithmic implication of the no-recombination in long blocks observation, together with the standard population-genetic assumption of infinite sites. Gusfield's solution based on matroid theory was followed by direct theta(nm2) solutions that use simpler techniques (Bafna et al., 2003; Eskin et al., 2003), and also bound the number of solutions to the PPH problem. In this short note, we address two questions that were left open. First, can the algorithms of Bafna et al. (2003) and Eskin et al. (2003) be sped-up to O(nm + m2) time, which would imply an O(nm) time-bound for the PPH problem? Second, if there are multiple solutions, can we find one that is most parsimonious in terms of the number of distinct haplotypes. We give reductions that suggests that the answer to both questions is "no." For the first problem, we show that computing the output of the first step (in either method) is equivalent to Boolean matrix multiplication. Therefore, the best bound we can presently achieve is O(nm(omega-1)), where omega < or = 2.52 is the exponent of matrix multiplication. Thus, any linear time solution to the PPH problem likely requires a different approach. For the second problem of computing a PPH solution that minimizes the number of distinct haplotypes, we show that the problem is NP-hard using a reduction from Vertex Cover (Garey and Johnson, 1979).  相似文献   

14.

Background

Optimal selection of multiple regulatory genes, known as targets, for deletion to enhance or suppress the activities of downstream genes or metabolites is an important problem in genetic engineering. Such problems become more feasible to address in silico due to the availability of more realistic dynamical system models of gene regulatory and metabolic networks. The goal of the computational problem is to search for a subset of genes to knock out so that the activity of a downstream gene or a metabolite is optimized.

Methodology/Principal Findings

Based on discrete dynamical system modeling of gene regulatory networks, an integer programming problem is formulated for the optimal in silico target gene deletion problem. In the first result, the integer programming problem is proved to be NP-hard and equivalent to a nonlinear programming problem. In the second result, a heuristic algorithm, called GKONP, is designed to approximate the optimal solution, involving an approach to prune insignificant terms in the objective function, and the parallel differential evolution algorithm. In the third result, the effectiveness of the GKONP algorithm is demonstrated by applying it to a discrete dynamical system model of the yeast pheromone pathways. The empirical accuracy and time efficiency are assessed in comparison to an optimal, but exhaustive search strategy.

Significance

Although the in silico target gene deletion problem has enormous potential applications in genetic engineering, one must overcome the computational challenge due to its NP-hardness. The presented solution, which has been demonstrated to approximate the optimal solution in a practical amount of time, is among the few that address the computational challenge. In the experiment on the yeast pheromone pathways, the identified best subset of genes for deletion showed advantage over genes that were selected empirically. Once validated in vivo, the optimal target genes are expected to achieve higher genetic engineering effectiveness than a trial-and-error procedure.  相似文献   

15.
The sugarcane transport system is very complex and uses a daily schedule, consisting of a set of locomotives runs, to satisfy the requirements of the mill and harvesters. The total cost of sugarcane transport operations is very high; over 35% of the total cost of sugarcane production in Australia is incurred in cane transport. Producing efficient schedules for sugarcane transport can reduce the cost and limit the negative effects that this system can have on the raw sugar production system. In this paper, the sugarcane rail operations are formulated as a blocking job shop scheduling problem. A mixed integer programming approach is used to formulate the shop job scheduling problem. Mixed integer programming and constraint programming search techniques are integrated for solving the problem. A case study is solved to test the approach.  相似文献   

16.
In automated production systems like flexible manufacturing systems (FMSs), an important issue is to find an adequate workload for each machine for each time period. Many integer linear programming (ILP) models have been proposed to solve the FMS loading problems, but not all of them take tools into account. Those that do not consider tooling are quite unrealistic, especially when setup times are important with respect to processing times. When tool loading has to be handled by the model, the load assignment may have to be changed completely. In this article we consider FMSs with a tool management of the following type: the system works in time periods whose durations are fixed or not; and tools are loaded on the machines at the beginning of each time period and stay there for the whole time period. Tool changes may occur only at the end of each time period when the system is stopped. We present some integer programming models for handling these situations with several types of objectives. Emphasis is laid on the ILP formulations. Computational complexities are discussed.  相似文献   

17.
A potentially large number of molecular markers are available for identifying genotypes in various species. For wheat, cultivar identity is an important determinant for end-use segregation and for payment of end-point royalties and grower premiums. A number of dominant DNA markers, that give either a positive or negative response, have been developed previously for wheat cultivar identification. This paper gives a method for identifying minimal marker sets for a given cultivar group, for example those grown in a specific geographical zone. It is based on an integer linear programming formulation of the problem, and can find all minimal marker sets for the group if required. The paper then describes the production of two software packages, GGDS and GGIP, that incorporate this methodology. Various practical issues are also discussed. These packages enable the rapid selection of minimal marker sets for the efficient discrimination of any sample set where the marker responses of the samples are known. They are already being used by the Australian wheat industry.  相似文献   

18.
We developed and tested a quantitative geographic information system (GIS)-based approach for selecting wetland restoration sites. Our approach uses a combination of an existing wetland function evaluation program, a GIS and integer programming methodology with an objective to minimize cost of restoration subject to meet environmental requirements. Investigations were conducted on the formulation to examine the effects of problem size, site ordering for input, and restoration targets. The formulation could be solved for the largest problem size tested of 996 integer variables. The larger the problem, the more time it took to solve. Larger restoration targets usually took more sites and more time to solve. Sorting sites by size was found to lead to inefficient and often unfeasible solutions. Random sorting of sites was found to be the more efficient method of inputting restoration sites into analysis.  相似文献   

19.
MOTIVATION: Side-chain positioning is a central component of homology modeling and protein design. In a common formulation of the problem, the backbone is fixed, side-chain conformations come from a rotamer library, and a pairwise energy function is optimized. It is NP-complete to find even a reasonable approximate solution to this problem. We seek to put this hardness result into practical context. RESULTS: We present an integer linear programming (ILP) formulation of side-chain positioning that allows us to tackle large problem sizes. We relax the integrality constraint to give a polynomial-time linear programming (LP) heuristic. We apply LP to position side chains on native and homologous backbones and to choose side chains for protein design. Surprisingly, when positioning side chains on native and homologous backbones, optimal solutions using a simple, biologically relevant energy function can usually be found using LP. On the other hand, the design problem often cannot be solved using LP directly; however, optimal solutions for large instances can still be found using the computationally more expensive ILP procedure. While different energy functions also affect the difficulty of the problem, the LP/ILP approach is able to find optimal solutions. Our analysis is the first large-scale demonstration that LP-based approaches are highly effective in finding optimal (and successive near-optimal) solutions for the side-chain positioning problem.  相似文献   

20.
Mitochondrial cytochrome b gene sequences were obtained from representatives of five populations of the Cape galaxias Galaxias zebratus . Four highly distinct genotypes were revealed, with sequence divergence values ranging from 5.8 to 13.8%. Some of the pairwise divergences are the highest yet reported for cytochrome b within a fish species, and are more typical of interspecific and intergeneric comparisons. In contrast, the low genetic divergence (0.3%) between two recently recorded populations from the Krom and Gamtoos (Kouga) Rivers indicates recent gene flow. Parsimony analysis consistently produced a tree with the population from the Olifants River on the west coast as ancestral and eastern populations as progressively more derived. Estimated divergence times for the Olifants population range from 4.4 to 6.6 million years ago. The findings suggest that another detailed revision is required to determine whether these divergent populations represent a single species or a species complex.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号