期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Silverbush D Elberfeld M Sharan R 《Journal of computational biology》2011,18(11):1437-1448

相似文献

2.

ABC-X: a generalized,automatically configurable artificial bee colony framework

Doğan Aydın Gürcan Yavuz Thomas Stützle 《Swarm Intelligence》2017,11(1):1-38

The artificial bee colony (ABC) algorithm is a popular metaheuristic that was originally conceived for tackling continuous function optimization tasks. Over the last decade, a large number of variants of ABC have been proposed, making it by now a well-studied swarm intelligence algorithm. Typically, in a paper on algorithmic variants of ABC algorithms, one or at most two of its algorithmic components are modified. Possible changes include variations on the search equations, the selection of candidate solutions to be explored, or the adoption of features from other algorithmic techniques. In this article, we propose to follow a different direction and to build a generalized ABC algorithm, which we call ABC-X. ABC-X collects algorithmic components available from known ABC algorithms into a common algorithm framework that allows not only to instantiate known ABC variants but, more importantly, also many ABC algorithm variants that have never been explored before in the literature. Automatic algorithm configuration techniques can generate from this template new ABC variants that perform better than known ABC algorithms, even when their numerical parameters are fine-tuned using the same automatic configuration process. 相似文献

3.

Probits of mixtures 总被引：2，自引：0，他引：2

T Lwin P J Martin 《Biometrics》1989,45(3):721-732

The tolerances of individuals (insects, parasites) in a population have a frequency or probability distribution called a tolerance distribution. Many tolerance distributions in bioassay studies can be the result of a rather heterogeneous population of individuals and can often be modelled as a mixture of a number of standard unimodal distributions. A probit analysis can be generalized to the case where the tolerance distribution is a mixture of location and scale parameter distributions. In this article, the existence and determination of the maximum likelihood estimates are investigated. An expectation-maximization (EM) algorithm for probits of mixtures is developed and it is shown that by application of the EM algorithm, the problem of probits of mixtures can be separated into a series of probits of individual component tolerance distributions. 相似文献

4.

The use of the expectation-maximization (EM) algorithm for maximum likelihood estimation of gametic frequencies of multilocus polymorphic codominant systems based on sampled population data

Sergeev AS Arapova RK 《Genetika》2002,38(3):407-418

Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample ("analysis without pedigrees") is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible ("complete data"), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage ("incomplete data"). Such "incomplete data" are analyzed based on the corresponding genetic models using the expectation-maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the same h are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic values g = 0, 1, 2, ..., 2m. The unknown frequencies of the m + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (Pm(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (Fm(g) of the genotypes with genotypic values g are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (ng) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for ng values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the ng values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals' genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, Cw, c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., [0, min(PN,Ps)] for PNs). The chi 2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the "gametic algebra" is complied with (none of the frequencies is negative). 相似文献

5.

Algorithms for phylogenetic footprinting. 总被引：9，自引：0，他引：9

Mathieu Blanchette Benno Schwikowski Martin Tompa 《Journal of computational biology》2002,9(2):211-223

Phylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species. We introduce a new motif-finding problem, the Substring Parsimony Problem, which is a formalization of the ideas behind phylogenetic footprinting, and we present an exact dynamic programming algorithm to solve it. We then present a number of algorithmic optimizations that allow our program to run quickly on most biologically interesting datasets. We show how to handle data sets in which only an unknown subset of the sequences contains the regulatory element. Finally, we describe how to empirically assess the statistical significance of the motifs found. Each technique is implemented and successfully identifies a number of known binding sites, as well as several highly conserved but uncharacterized regions. The program is available at http://bio.cs.washington.edu/software.html. 相似文献

6.

dConsensus: a tool for displaying domain assignments by multiple structure-based algorithms and for construction of a consensus assignment

Kieran Alden Stella Veretnik Philip E Bourne 《BMC bioinformatics》2010,11(1):310

Background

Partitioning of a protein into structural components, known as domains, is an important initial step in protein classification and for functional and evolutionary studies. While the systematic assignments of domains by human experts exist (CATH and SCOP), the introduction of high throughput technologies for structure determination threatens to overwhelm expert approaches. A variety of algorithmic methods have been developed to expedite this process, allowing almost instant structural decomposition into domains. The performance of algorithmic methods can approach 85% agreement on the number of domains with the consensus reached by experts. However, each algorithm takes a somewhat different conceptual approach, each with unique strengths and weaknesses. Currently there is no simple way to automatically compare assignments from different structure-based domain assignment methods, thereby providing a comprehensive understanding of possible structure partitioning as well as providing some insight into the tendencies of particular algorithms. Most importantly, a consensus assignment drawn from multiple assignment methods can provide a singular and presumably more accurate view. 相似文献

7.

Haplotyping as perfect phylogeny: a direct approach. 总被引：4，自引：0，他引：4

Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph 《Journal of computational biology》2003,10(3-4):323-340

A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed. 相似文献

8.

High diversity of alpha-globin haplotypes in a Senegalese population, including many previously unreported variants. 总被引：2，自引：2，他引：0

J J Martinson L Excoffier C Swinburn A J Boyce R M Harding A Langaney J B Clegg 《American journal of human genetics》1995,57(5):1186-1198

RFLP haplotypes at the alpha-globin gene complex have been examined in 190 individuals from the Niokolo Mandenka population of Senegal: haplotypes were assigned unambiguously for 210 chromosomes. The Mandenka share with other African populations a sample size-independent haplotype diversity that is much greater than that in any non-African population: the number of haplotypes observed in the Mandenka is typically twice that seen in the non-African populations sampled to date. Of these haplotypes, 17.3% had not been observed in any previous surveys, and a further 19.1% have previously been reported only in African populations. The haplotype distribution shows clear differences between African and non-African peoples, but this is on the basis of population-specific haplotypes combined with haplotypes common to all. The relationship of the newly reported haplotypes to those previously recorded suggests that several mutation processes, particularly recombination as homologous exchange or gene conversion, have been involved in their production. A computer program based on the expectation-maximization (EM) algorithm was used to obtain maximum-likelihood estimates of haplotype frequencies for the entire data set: good concordance between the unambiguous and EM-derived sets was seen for the overall haplotype frequencies. Some of the low-frequency haplotypes reported by the estimation algorithm differ greatly, in structure, from those haplotypes known to be present in human populations, and they may not represent haplotypes actually present in the sample. 相似文献

9.

Ontogenetic changes in the pattern of androgen accumulation in song-control nuclei of male zebra finches 总被引：3，自引：0，他引：3

S W Bottjer 《Journal of neurobiology》1987,18(2):125-139

The present study examines the development of androgen accumulation in cells of two brain nuclei that are involved in controlling vocal behavior in zebra finches (Poephila guttata). HVc (caudal nucleus of the ventral hyperstriatum) is involved with vocal production in adult birds, and MAN (magnocellular nucleus of the anterior neostriatum) is involved with the initial ability to learn song. In both of these nuclei there is an increase in the proportion of cells that are labeled by systemic injections of tritiated dihydrotestosterone in juvenile male zebra finches during the time when production of song is becoming stereotyped (25-60 days). Within MAN there is an overall loss of cells during this time, such that the absolute number of androgen target cells in MAN remains at a constant level. However, it does not appear to be the case that unlabeled cells are selectively lost from MAN. Rather it appears that both labeled and unlabeled cells are lost, and the absolute number of labeled cells is maintained at a constant level via recruitment of additional labeled cells from the unlabeled population (i.e., some MAN cells that are unlabeled in young birds become labeled in older birds). In line with this hypothesis, there is a large increase in the density of labeling in individual MAN cells, indicating that these cells have an enhanced ability to concentrate androgen. In contrast to the situation in MAN, there is an increase in the overall number of cells within HVc during this time; this increase in total cell number combines with the increased proportion of labeled cells such that the absolute number of androgen target cells in HVc increases threefold. The ability of individual HVc cells to accumulate androgen remains constant. The relationship of these changes in the pattern of androgen accumulation to other aspects of neural and behavioral development related to song in zebra finches are discussed. 相似文献

10.

Maternity length of stay modelling by gamma mixture regression with random effects

Lee AH Wang K Yau KK McLachlan GJ Ng SK 《Biometrical journal. Biometrische Zeitschrift》2007,49(5):750-764

Maternity length of stay (LOS) is an important measure of hospital activity, but its empirical distribution is often positively skewed. A two-component gamma mixture regression model has been proposed to analyze the heterogeneous maternity LOS. The problem is that observations collected from the same hospital are often correlated, which can lead to spurious associations and misleading inferences. To account for the inherent correlation, random effects are incorporated within the linear predictors of the two-component gamma mixture regression model. An EM algorithm is developed for the residual maximum quasi-likelihood estimation of the regression coefficients and variance component parameters. The approach enables the correct identification and assessment of risk factors affecting the short-stay and long-stay patient subgroups. In addition, the predicted random effects can provide information on the inter-hospital variations after adjustment for patient characteristics and health provision factors. A simulation study shows that the estimators obtained via the EM algorithm perform well in all the settings considered. Application to a set of maternity LOS data for women having obstetrical delivery with multiple complicating diagnoses is illustrated. 相似文献

11.

Two Loci Controlling Genetic Cellular Resistance to Avian Leukosis-Sarcoma Viruses 总被引：9，自引：7，他引：2

下载免费PDF全文

Lyman B. Crittenden Howard A. Stone Richard H. Reamer William Okazaki 《Journal of virology》1967,1(5):898-904

Female chickens known to be heterozygous for resistance to subgroups A and B of the avian leukosis-sarcoma viruses were mated to males known to be homozygously resistant to both. The progeny were assayed both on the chorioallantoic membrane (CAM) and in tissue culture for resistance to representative viruses of the A, B, and tentatively defined C subgroups. Segregation ratios of resistance to A and B subgroup viruses agreed with the previously suggested hypothesis of single-autosomal-recessive genes controlling resistance to each subgroup. Mixed infection on the CAM and replicate plate infection in tissue culture with subgroup A and B viruses showed that resistance to the A and B subgroups was inherited independently. Assays with viruses tentatively classified as subgroup C indicated that they were largely composed of a mixture of subgroup A and B viruses or of particles possessing the host range specificity of both. However, virus stocks of the subgroup C category, as well as some stocks classified as subgroup B, produced small numbers of pocks or foci on individuals known to be resistant to subgroup A and B viruses. It is suggested that these Rous sarcoma virus stocks carry between 1 and 10% of a true subgroup C virus. 相似文献

12.

Reconstruction of genuine pair-wise sequence alignment.

Valery Polyanovsky Mikhail A Roytberg Vladimir G Tumanyan 《Journal of computational biology》2008,15(4):379-391

In many applications, the algorithmically obtained alignment ideally should restore the "golden standard" (GS) alignment, which superimposes positions originating from the same position of the common ancestor of the compared sequences. The average similarity between the algorithmically obtained and GS alignments ("the quality") is an important characteristic of an alignment algorithm. We proposed to determine the quality of an algorithm, using sequences that were artificially generated in accordance with an appropriate evolution model; the approach was applied to the global version of the Smith-Waterman algorithm (SWA). The quality of SWA is between 97% (for a PAM distance of 60) and 70% (for a PAM distance of 300). The percentage of identical aligned residues is the same for algorithmic and GS alignments. The total length of indels in algorithmic alignments is less than in the GS-mainly due to a substantial decrease in the number of indels in algorithmic alignments. 相似文献

13.

Genea, Genabea and Gilkeya gen. nov.: ascomata and ectomycorrhiza formation in a Quercus woodland

Smith ME Trappe JM Rizzo DM 《Mycologia》2006,98(5):699-716

相似文献

14.

HLA antigens and clinical subgroups of schizophrenia

C Rudduck G Franzén B L?w B Rorsman 《Human heredity》1984,34(1):18-26

Frequencies of HLA A, B, C, and DR antigens were studied in 100 schizophrenic patients and 919 controls from South Sweden. The patients were diagnosed according to the DSM III criteria and divided into four clinical subgroups (hebephrenic, paranoid, residual, and undifferentiated). In the schizophrenic patients as a whole significant increases were found for A2, A3, B17, B27, and Cw2 and decreases for A1, A11, and B8. A previous positive association with A9 from the same population was not confirmed. A significant heterogeneity between the four clinical subgroups was found for A3 and Bw35. Most of the associations between HLA antigens and schizophrenia reported in the literature appear to be fortuitous and dependent on the large number of trials made. However, confirmed increases have been found for A9 and B17, and confirmed decreases have been observed for A1 and B7. Some evidence for a heterogeneity between clinical subgroups was found in the present as well as in previous investigations. 相似文献

15.

Clustering protein sequences--structure prediction by transitive homology. 总被引：2，自引：0，他引：2

E Bolten A Schliep S Schneckener D Schomburg R Schrader 《Bioinformatics (Oxford, England)》2001,17(10):935-941

MOTIVATION: It is widely believed that for two proteins Aand Ba sequence identity above some threshold implies structural similarity due to a common evolutionary ancestor. Since this is only a sufficient, but not a necessary condition for structural similarity, the question remains what other criteria can be used to identify remote homologues. Transitivity refers to the concept of deducing a structural similarity between proteins A and C from the existence of a third protein B, such that A and B as well as B and C are homologues, as ascertained if the sequence identity between A and B as well as that between B and C is above the aforementioned threshold. It is not fully understood if transitivity always holds and whether transitivity can be extended ad infinitum. RESULTS: We developed a graph-based clustering approach, where transitivity plays a crucial role. We determined all pair-wise similarities for the sequences in the SwissProt database using the Smith-Waterman local alignment algorithm. This data was transformed into a directed graph, where protein sequences constitute vertices. A directed edge was drawn from vertex A to vertex B if the sequences A and B showed similarity, scaled with respect to the self-similarity of A, above a fixed threshold. Transitivity was important in the clustering process, as intermediate sequences were used, limited though by the requirement of having directed paths in both directions between proteins linked over such sequences. The length dependency-implied by the self-similarity-of the scaling of the alignment scores appears to be an effective criterion to avoid clustering errors due to multi-domain proteins. To deal with the resulting large graphs we have developed an efficient library. Methods include the novel graph-based clustering algorithm capable of handling multi-domain proteins and cluster comparison algorithms. Structural Classification of Proteins (SCOP) was used as an evaluation data set for our method, yielding a 24% improvement over pair-wise comparisons in terms of detecting remote homologues. AVAILABILITY: The software is available to academic users on request from the authors. CONTACT: e.bolten@science-factory.com; schliep@zpr.uni-koeln.de; s.schneckener@science-factory.com; d.schomburg@uni-koeln.de; schrader@zpr.uni-koeln.de. SUPPLEMENTARY INFORMATION: http://www.zaik.uni-koeln.de/~schliep/ProtClust.html. 相似文献

16.

A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences 总被引：1，自引：0，他引：1

Bi Chengpeng 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2009,6(3):370-386

Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods. 相似文献

17.

A multivariate model for ordinal trait analysis

Xu S Xu C 《Heredity》2006,97(6):409-417

Many economically important characteristics of agricultural crops are measured as ordinal traits. Statistical analysis of the genetic basis of ordinal traits appears to be quite different from regular quantitative traits. The generalized linear model methodology implemented via the Newton-Raphson algorithm offers improved efficiency in the analysis of such data, but does not take full advantage of the extensive theory developed in the linear model arena. Instead, we develop a multivariate model for ordinal trait analysis and implement an EM algorithm for parameter estimation. We also propose a method for calculating the variance-covariance matrix of the estimated parameters. The EM equations turn out to be extremely similar to formulae seen in standard linear model analysis. Computer simulations are performed to validate the EM algorithm. A real data set is analyzed to demonstrate the application of the method. The advantages of the EM algorithm over other methods are addressed. Application of the method to QTL mapping for ordinal traits is demonstrated using a simulated baclcross (BC) population. 相似文献

18.

A structural EM algorithm for phylogenetic inference.

Nir Friedman Matan Ninio Itsik Pe'er Tal Pupko 《Journal of computational biology》2002,9(2):331-353

A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework. 相似文献

19.

Standard errors for EM estimates in generalized linear models with random effects

Friedl H Kauermann G 《Biometrics》2000,56(3):761-767

A procedure is derived for computing standard errors of EM estimates in generalized linear models with random effects. Quadrature formulas are used to approximate the integrals in the EM algorithm, where two different approaches are pursued, i.e., Gauss-Hermite quadrature in the case of Gaussian random effects and nonparametric maximum likelihood estimation for an unspecified random effect distribution. An approximation of the expected Fisher information matrix is derived from an expansion of the EM estimating equations. This allows for inferential arguments based on EM estimates, as demonstrated by an example and simulations. 相似文献

20.

Deconvolving sequence variation in mixed DNA populations.

Andy Wildenberg Steven Skiena Pavel Sumazin 《Journal of computational biology》2003,10(3-4):635-652

We present an original approach to identifying sequence variants in a mixed DNA population from sequence trace data. The heart of the method is based on parsimony: given a wildtype DNA sequence, a set of observed variations at each position collected from sequencing data, and a complete catalog of all possible mutations, determine the smallest set of mutations from the catalog that could fully explain the observed variations. The algorithmic complexity of the problem is analyzed for several classes of mutations, including block substitutions, single-range deletions, and single-range insertions. The reconstruction problem is shown to be NP-complete for single-range insertions and deletions, while for block substitutions, single character insertion, and single character deletion mutations, polynomial time algorithms are provided. Once a minimum set of mutations compatible with the observed sequence is found, the relative frequency of those mutations is recovered by solving a system of linear equations. Simulation results show the algorithm successfully deconvolving mutations in p53 known to cause cancer. An extension of the algorithm is proposed as a new method of high throughput screening for single nucleotide polymorphisms by multiplexing DNA. 相似文献