首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Given two genomes with duplicate genes, Zero Exemplar Distance is the problem of deciding whether the two genomes can be reduced to the same genome without duplicate genes by deleting all but one copy of each gene in each genome. Blin, Fertin, Sikora, and Vialette recently proved that Zero Exemplar Distance for monochromosomal genomes is NP-hard even if each gene appears at most two times in each genome, thereby settling an important open question on genome rearrangement in the exemplar model. In this article, we give a very simple alternative proof of this result. We also study the problem Zero Exemplar Distance for multichromosomal genomes without gene order, and prove the analogous result that it is also NP-hard even if each gene appears at most two times in each genome. For the positive direction, we show that both variants of Zero Exemplar Distance admit polynomial-time algorithms if each gene appears exactly once in one genome and at least once in the other genome. In addition, we present a polynomial-time algorithm for the related problem Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence. This answers an open question of Bonizzoni et al. We also show that Zero Exemplar Distance for multichromosomal genomes without gene order is fixed-parameter tractable in the general case if the parameter is the maximum number of chromosomes in each genome.  相似文献   

2.
Parameterized complexity analysis in computational biology   总被引:2,自引:0,他引:2  
Many computational problems in biology involve par–ametersfor which a small range of values cover important applications.We argue that for many problems in this setting, parameterizedcomputational complexity rather than NP-completeness is theappropriate tool for studying apparent intractability. At issuein the theory of parameter–ized complexity is whethera problem can be solved in time O(n)for each fixed parametervalue, where a is a constant independent of the parameter. Inaddition to surveying this complexity framework, we describea new result for the Longest Common Subsequence problem. Inparticular, we show that the problem is hard for W[t] for allI when parameterized by the number of strings and the size ofthe alphabet. Lower bounds on the complexity of this basic combinatorialproblem imply lower bounds on more general sequence alignmentand consensus discovery problems. We also describe a numberof open problems pertaining to the parameterized complexityof problems in computational biology where small parameter valuesare important  相似文献   

3.
When two strings of symbols are aligned it is important to know whether the observed number of matches is better than that expected between two independent sequences with the same frequency of symbols. When strings are of different lengths, nulls need to be inserted in order to align the sequences. One approach is to use simple approximations of sampling for replacement. We describe an algorithm for exactly determining the frequencies of given numbers of matches, sampling without replacement. This does not lead to a simple closed form expression. However we show examples where sampling with, or without, replacement give very similar results and the simple approach may be adequate for all but the smallest cases.  相似文献   

4.
MOTIVATION:The popular BLAST algorithm is based on a local similarity search strategy, so its high-scoring segment pairs (HSPs) do not have global alignment information. When scientists use BLAST to search for a target protein or DNA sequence in a huge database like the human genome map, the existence of repeated fragments, homologues or pseudogenes in the genome often makes the BLAST result filled with redundant HSPs. Therefore, we need a computational strategy to alleviate this problem. RESULTS: In the gene discovery group of Celera Genomics, I developed a two-step method, i.e. a BLAST step plus an LIS step, to align thousands of cDNA and protein sequences into the human genome map. The LIS step is based on a mature computational algorithm, Longest Increasing Subsequence (LIS) algorithm. The idea is to use the LIS algorithm to find the longest series of consecutive HSPs in the BLAST output. Such a BLAST+LIS strategy can be used as an independent alignment tool or as a complementary tool for other alignment programs like Sim4 and GenWise. It can also work as a general purpose BLAST result processor in all sorts of BLAST searches. Two examples from Celera were shown in this paper.  相似文献   

5.
MOTIVATION: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. RESULTS: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies--prediction of HLA binding, and alternative splicing conservation between human and mouse--we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.  相似文献   

6.
W Saurin  P Marlière 《Biochimie》1985,67(5):517-521
A set of sequences can be defined by their common subsequences, and the length of these is a measure of the overall resemblance of the set. Each subsequence corresponds to a succession of symbols embedded in every sequence, following the same order but not necessarily contiguous. Determining the longest common subsequence (LCS) requires the exhaustive testing of all possible common subsequences, which sum up to about 2L, if L is the length of the shortest sequence. We present a polynomial algorithm (O(n X L4), where n is the number of sequences) for generating strings related to the LCS and constructed with the sequence alphabet and an indetermination symbol. Such strings are iteratively improved by deleting indetermination symbols and concomitantly introducing the greatest number of alphabet symbols. Processed accordingly, nucleic acid and protein sequences lead to key-words encompassing the salient positions of homologous chains, which can be used for aligning or classifying them, as well as for finding related sequences in data banks.  相似文献   

7.
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing a most parsimonious scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, called the Indel Parsimony Problem, is a crucial component of the problem of ancestral genome reconstruction, and its solution provides valuable information to many genome functional annotation approaches. We first show that the problem is NP-complete. Second, we provide an algorithm, based on the fractional relaxation of an integer linear programming formulation. The algorithm is fast in practice, and the solutions it produces are, in most cases, provably optimal. We describe a divide-and-conquer approach that makes it possible to solve very large instances on a simple desktop machine, while retaining guaranteed optimality. Our algorithms are tested and shown efficient and accurate on a set of 1.8 Mb mammalian orthologous sequences in the CFTR region.  相似文献   

8.
In this paper, we propose new solution methods for designing tag sets for use in universal DNA arrays. First, we give integer linear programming formulations for two previous formalizations of the tag set design problem. We show that these formulations can be solved to optimality for problem instances of moderate size by using general purpose optimization packages and also give more scalable algorithms based on an approximation scheme for packing linear programs. Second, we note the benefits of periodic tags and establish an interesting connection between the tag design problem and the problem of packing the maximum number of vertex-disjoint directed cycles in a given graph. We show that combining a simple greedy cycle packing algorithm with a previously proposed alphabetic tree search strategy yields an increase of over 40% in the number of tags compared to previous methods.  相似文献   

9.
Phylogenetic diversity (PD) is a useful metric for selecting taxa in a range of biological applications, for example, bioconservation and genomics, where the selection is usually constrained by the limited availability of resources. We formalize taxon selection as a conceptually simple optimization problem, aiming to maximize PD subject to resource constraints. This allows us to take into account the different amounts of resources required by the different taxa. Although this is a computationally difficult problem, we present a dynamic programming algorithm that solves it in pseudo-polynomial time. Our algorithm can also solve many instances of the Noah's Ark Problem, a more realistic formulation of taxon selection for biodiversity conservation that allows for taxon-specific extinction risks. These instances extend the set of problems for which solutions are available beyond previously known greedy-tractable cases. Finally, we discuss the relevance of our results to real-life scenarios.  相似文献   

10.
A variety of analytical methods is available for branch testing in distance-based phylogenies. However, these methods are rarely used, possibly because the estimation of some of their statistics, especially the covariances, is not always feasible. We show that these difficulties can be overcome if some simplifying assumptions are made, namely distance independence. The weighted least-squares likelihood ratio test (WLS-LRT) we propose is easy to perform, using only the distances and some of their associated variances. If no variances are known, the use of the Felsenstein F-test, also based on weighted least squares, is discussed. Using simulated data and a data set of 43 mammalian mitochondrial sequences we demonstrate that the WLS-LRT performs as well as the generalized least-squares test, and indeed better for a large number of taxa data set. We thus show that the assumption of independence does not negatively affect the reliability or the accuracy of the least-squares approach. The results of the WLS-LRT are no worse than the results of the bootstrap methods, such as the Felsenstein bootstrap selection probability test and the Dopazo test. We also show that WLS-LRT can be applied in instances where other analytical methods are inappropriate. This point is illustrated by analyzing the relationships between human immunodeficiency virus type 1 (HIV-1) sequences isolated from various organs of different individuals.  相似文献   

11.
Rasmussen TK  Krink T 《Bio Systems》2003,72(1-2):5-17
Multiple sequence alignment (MSA) is one of the basic problems in computational biology. Realistic problem instances of MSA are computationally intractable for exact algorithms. One way to tackle MSA is to use Hidden Markov Models (HMMs), which are known to be very powerful in the related problem domain of speech recognition. However, the training of HMMs is computationally hard and there is no known exact method that can guarantee optimal training within reasonable computing time. Perhaps the most powerful training method is the Baum-Welch algorithm, which is fast, but bears the problem of stagnation at local optima. In the study reported in this paper, we used a hybrid algorithm combining particle swarm optimization with evolutionary algorithms to train HMMs for the alignment of protein sequences. Our experiments show that our approach yields better alignments for a set of benchmark protein sequences than the most commonly applied HMM training methods, such as Baum-Welch and Simulated Annealing.  相似文献   

12.
Computational protein design can be used to select sequences that are compatible with a fixed-backbone template. This strategy has been used in numerous instances to engineer novel proteins. However, the fixed-backbone assumption severely restricts the sequence space that is accessible via design. For challenging problems, such as the design of functional proteins, this may not be acceptable. Here, we present a method for introducing backbone flexibility into protein design calculations and apply it to the design of diverse helical BH3 ligands that bind to the anti-apoptotic protein Bcl-xL, a member of the Bcl-2 protein family. We demonstrate how normal mode analysis can be used to sample different BH3 backbones, and show that this leads to a larger and more diverse set of low-energy solutions than can be achieved using a native high-resolution Bcl-xL complex crystal structure as a template. We tested several of the designed solutions experimentally and found that this approach worked well when normal mode calculations were used to deform a native BH3 helix structure, but less well when they were used to deform an idealized helix. A subsequent round of design and testing identified a likely source of the problem as inadequate sampling of the helix pitch. In all, we tested 17 designed BH3 peptide sequences, including several point mutants. Of these, eight bound well to Bcl-xL and four others showed weak but detectable binding. The successful designs showed a diversity of sequences that would have been difficult or impossible to achieve using only a fixed backbone. Thus, introducing backbone flexibility via normal mode analysis effectively broadened the set of sequences identified by computational design, and provided insight into positions important for binding Bcl-xL.  相似文献   

13.
In the literature, various discrete-time and continuous-time mixed-integer linear programming (MIP) formulations for project scheduling problems have been proposed. The performance of these formulations has been analyzed based on generic test instances. The objective of this study is to analyze the performance of discrete-time and continuous-time MIP formulations for a real-life application of project scheduling in human resource management. We consider the problem of scheduling assessment centers. In an assessment center, candidates for job positions perform different tasks while being observed and evaluated by assessors. Because these assessors are highly qualified and expensive personnel, the duration of the assessment center should be minimized. Complex rules for assigning assessors to candidates distinguish this problem from other scheduling problems discussed in the literature. We develop two discrete-time and three continuous-time MIP formulations, and we present problem-specific lower bounds. In a comparative study, we analyze the performance of the five MIP formulations on four real-life instances and a set of 240 instances derived from real-life data. The results indicate that good or optimal solutions are obtained for all instances within short computational time. In particular, one of the real-life instances is solved to optimality. Surprisingly, the continuous-time formulations outperform the discrete-time formulations in terms of solution quality.  相似文献   

14.
Nuclear envelope-limited chromatin sheets are part of mitotic death   总被引:1,自引:1,他引:0  
Nuclear envelope-limited chromatin sheets (ELCS) are enigmatic membranous structures of uncertain function. This study describes the induction of ELCS in p53 mutated Burkitt's lymphoma cell lines after treatment with irradiation or the microtubule inhibitor, SK&F 96365. Both treatments evoked similar mitotic death, involving metaphase arrest followed by extensive endopolyploidisation and delayed apoptosis, although the kinetics were different. We found that induction of ELCS and nuclear segmentation correlated with the amount and kinetics of M-phase arrest, mitosis restitution and delayed apoptosis of endopolyploid cells. In metaphases undergoing restitution, ELCS are seen participating in the restoration of the nuclear envelope, mediating the attachment of peripheral chromatin to it. In interphase cells, ELCS join nuclear segments, ectopically linking and fusing with heterochromatin regions. In cells with segmented nuclei, continued DNA replication was observed, along with activation and redistribution of Ku70, suggestive of non-homologous DNA end-joining. Induction of ELCS also parallels the induction of cytoplasmic stacked membrane structures, such as confronting cisternae and annulate lamellae, which participate in the turnover and degeneration of ELCS. The results suggest that arrest at a spindle checkpoint and the uncoupling of mitosis from DNA replication lead to the emergence of ELCS in the resulting endopolyploid cells.  相似文献   

15.
Sequence comparison with concave weighting functions   总被引:2,自引:0,他引:2  
We consider efficient methods for computing a difference metric between two sequences of symbols, where the cost of an operation to insert or delete a block of symbols is a concave function of the block's length. Alternatively, sequences can be optimally aligned when gap penalties are a concave function of the gap length. Two algorithms based on the ‘candidate list paradigm’ first used by Waterman (1984) are presented. The first computes significantly more parsimonious candidate lists than Waterman's method. The second method refines the first to the point of guaranteeingO(N 2 lgN) worst-case time complexity, and under certain conditionsO(N 2). Experimental data show how various properties of the comparison problem affect the methods' relative performance. A number of extensions are discussed, among them a technique for constructing optimal alignments inO(N) space in expectation. This variation gives a practical method for comparing long amino sequences on a small computer. This work was supported in part by NSF Grant DCR-8511455.  相似文献   

16.
MOTIVATION: Algorithms for phylogenetic tree reconstruction based on gene order data typically repeatedly solve instances of the reversal median problem (RMP) which is to find for three given gene orders a fourth gene order (called median) with a minimal sum of reversal distances. All existing algorithms of this type consider only one median for each RMP instance even when a large number of medians exist. A careful selection of one of the medians might lead to better phylogenetic trees. RESULTS: We propose a heuristic algorithm amGRP for solving the multiple genome rearrangement problem (MGRP) by repeatedly solving instances of the RMP taking all medians into account. Algorithm amGRP uses a branch-and-bound method that branches over medians from a selected subset of all medians for each RMP instance. Different heuristics for selecting the subsets have been investigated. To show that the medians for RMP vary strongly with respect to different properties that are likely to be relevant for phylogenetic tree reconstruction, the set of all medians has been investigated for artificial datasets and mitochondrial DNA (mtDNA) gene orders. Phylogenetic trees have been computed for a large set of randomly generated gene orders and two sets of mtDNA gene order data for different animal taxa with amGRP and with two standard approaches for solving the MGRP (GRAPPA-DCM and MGR). The results show that amGRP outperforms both other methods with respect to solution quality and computation time on the test data. AVAILABILITY: The source code of amGRP, additional results and the test instances used in this paper are freely available from the authors.  相似文献   

17.
Bioinspired algorithms, such as evolutionary algorithms and ant colony optimization, are widely used for different combinatorial optimization problems. These algorithms rely heavily on the use of randomness and are hard to understand from a theoretical point of view. This paper contributes to the theoretical analysis of ant colony optimization and studies this type of algorithm on one of the most prominent combinatorial optimization problems, namely the traveling salesperson problem (TSP). We present a new construction graph and show that it has a stronger local property than one commonly used for constructing solutions of the TSP. The rigorous runtime analysis for two ant colony optimization algorithms, based on these two construction procedures, shows that they lead to good approximation in expected polynomial time on random instances. Furthermore, we point out in which situations our algorithms get trapped in local optima and show where the use of the right amount of heuristic information is provably beneficial.  相似文献   

18.
We describe a set of computational problems motivated by certain analysis tasks in genome resequencing. These are assembly problems for which multiple distinct sequences must be assembled, but where the relative positions of reads to be assembled are already known. This information is obtained from a common reference genome and is characteristic of resequencing experiments. The simplest variant of the problem aims at determining a minimum set of superstrings such that each sequenced read matches at least one superstring. We give an algorithm with time complexity O(N), where N is the sum of the lengths of reads, substantially improving on previous algorithms for solving the same problem. We also examine the problem of finding the smallest number of reads to remove such that the remaining reads are consistent with k superstrings. By exploiting a surprising relationship with the minimum cost flow problem, we show that this problem can be solved in polynomial time when nested reads are excluded. If nested reads are permitted, this problem of removing the minimum number of reads becomes NP-hard. We show that permitting mismatches between reads and their nearest superstrings generally renders these problems NP-hard.  相似文献   

19.
Pattern discovery in unaligned DNA sequences is a challenging problem in both computer science and molecular biology. Several different methods and techniques have been proposed so far, but in most of the cases signals in DNA sequences are very complicated and avoid detection. Exact exhaustive methods can solve the problem only for short signals with a limited number of mutations. In this work, we extend exhaustive enumeration also to longer patterns. More in detail, the basic version of algorithm presented in this paper, given as input a set of sequences and an error ratio epsilon < 1, finds all patterns that occur in at least q sequences of the set with at most epsilonm mutations, where m is the length of the pattern. The only restriction is imposed on the location of mutations along the signal. That is, a valid occurrence of a pattern can present at most [epsiloni] mismatches in the first i nucleotides, and so on. However, we show how the algorithm can be used also when no assumption can be made on the position of mutations. In this case, it is also possible to have an estimate of the probability of finding a signal according to the signal length, the error ratio, and the input parameters. Finally, we discuss some significance measures that can be used to sort the patterns output by the algorithm.  相似文献   

20.
A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. In a seminal paper, Wang et al.(1) studied the problem of constructing a phylogenetic network, allowing recombination between sequences, with the constraint that the resulting cycles must be disjoint. We call such a phylogenetic network a "galled-tree". They gave a polynomial-time algorithm that was intended to determine whether or not a set of sequences could be generated on galled-tree. Unfortunately, the algorithm by Wang et al.(1) is incomplete and does not constitute a necessary test for the existence of a galled-tree for the data. In this paper, we completely solve the problem. Moreover, we prove that if there is a galled-tree, then the one produced by our algorithm minimizes the number of recombinations over all phylogenetic networks for the data, even allowing multiple-crossover recombinations. We also prove that when there is a galled-tree for the data, the galled-tree minimizing the number of recombinations is "essentially unique". We also note two additional results: first, any set of sequences that can be derived on a galled tree can be derived on a true tree (without recombination cycles), where at most one back mutation per site is allowed; second, the site compatibility problem (which is NP-hard in general) can be solved in polynomial time for any set of sequences that can be derived on a galled tree. Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in general phylogenetic networks. This paper greatly extends the conference version that appears in an earlier work.(8) PowerPoint slides of the conference talk can be found at our website.(7).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号