首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The well-known massively parallel sequencing method is efficient and it can obtain sequence data from multiple individual samples. In order to ensure that sequencing, replication, and oligonucleotide synthesis errors do not result in tags (or barcodes) that are unrecoverable or confused, the tag sequences should be abundant and sufficiently different. Recently, many design methods have been proposed for correcting errors in data using error-correcting codes. The existing tag sets contain small tag sequences, so we used a modified genetic algorithm to improve the lower bound of the tag sets in this study. Compared with previous research, our algorithm is effective for designing sets of DNA tags. Moreover, the GC content determined by existing methods includes an imprecise range. Thus, we improved the GC content determination method to obtain tag sets that control the GC content in a more precise range. Finally, previous studies have only considered perfect self-complementarity. Thus, we considered the crossover between different tags and introduced an improved constraint into the design of tag sets.  相似文献   

2.
Recent technological improvements in the field of genetic data extraction give rise to the possibility of reconstructing the historical pedigrees of entire populations from the genotypes of individuals living today. Current methods are still not practical for real data scenarios as they have limited accuracy and assume unrealistic assumptions of monogamy and synchronized generations. In order to address these issues, we develop a new method for pedigree reconstruction, , which is based on formulations of the pedigree reconstruction problem as variants of graph coloring. The new formulation allows us to consider features that were overlooked by previous methods, resulting in a reconstruction of up to 5 generations back in time, with an order of magnitude improvement of false-negatives rates over the state of the art, while keeping a lower level of false positive rates. We demonstrate the accuracy of compared to previous approaches using simulation studies over a range of population sizes, including inbred and outbred populations, monogamous and polygamous mating patterns, as well as synchronous and asynchronous mating.  相似文献   

3.
Estimating forces in muscles and joints during locomotion requires formulations consistent with available methods of solving the indeterminate problem. Direct comparisons of results between differing optimization methods proposed in the literature have been difficult owing to widely varying model formulations, algorithms, input data, and other factors. We present an application of a new optimization program which includes linear and nonlinear techniques allowing a variety of cost functions and greater flexibility in problem formulation. Unified solution methods such as the one demonstrated here, offer direct evaluations of such factors as optimization criteria and constraints. This unified method demonstrates that nonlinear formulations (of the sort reported) allow more synergistic activity and in contrast to linear formulations, allow antagonistic activity. Concurrence of EMG activity and predicted forces is better with nonlinear predictions than linear predictions. The prediction of synergistic and antagonistic activity expectedly leads to higher joint force predictions. Relaxation of the requirement that muscles resolve the entire intersegmental moment maintains muscle synergism in the nonlinear formulation while relieving muscle antagonism and reducing the predicted joint contact force. Such unified methods allow more possibilities for exploring new optimization formulations, and in comparing the solutions to previously reported formulations.  相似文献   

4.
Recent studies have shown that the human genome has a haplotype block structure such that it can be decomposed into large blocks with high linkage disequilibrium (LD) and relatively limited haplotype diversity, separated by short regions of low LD. One of the practical implications of this observation is that only a small fraction of all the single-nucleotide polymorphisms (SNPs) (referred as "tag SNPs") can be chosen for mapping genes responsible for human complex diseases, which can significantly reduce genotyping effort, without much loss of power. Algorithms have been developed to partition haplotypes into blocks with the minimum number of tag SNPs for an entire chromosome. In practice, investigators may have limited resources, and only a certain number of SNPs can be genotyped. In the present article, we first formulate this problem as finding a block partition with a fixed number of tag SNPs that can cover the maximal percentage of the whole genome, and we then develop two dynamic programming algorithms to solve this problem. The algorithms are sufficiently flexible to permit knowledge of functional polymorphisms to be considered. We apply the algorithms to a data set of SNPs on human chromosome 21, combining the information of coding and noncoding regions. We study the density of SNPs in intergenic regions, introns, and exons, and we find that the SNP density in intergenic regions is similar to that in introns and is higher than that in exons, results that are consistent with previous studies. We also calculate the distribution of block break points in intergenic regions, genes, exons, and coding regions and do not find any significant differences.  相似文献   

5.
The study of genome rearrangements is an important tool in comparative genomics. This paper revisits the problem of sorting a multichromosomal genome by translocations, i.e., exchanges of chromosome ends. We give an elementary proof of the formula for computing the translocation distance in linear time, and we give a new algorithm for sorting by translocations, correcting an error in a previous algorithm by Hannenhalli.  相似文献   

6.
Efficient enumeration of phylogenetically informative substrings.   总被引:1,自引:0,他引:1  
We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in non-mammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogenetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a tag if it is found in all of the genomes of one set and none of the genomes of the other set. We present a near-linear time algorithm that finds all tags in a given phylogeny; and a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. Under a stochastic model of evolution, we show that a simple process of tag-generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, since tags are not guaranteed to exist in a given data set, we generalize the notion of a tag from a single substring to a set of substrings. We present a linear programming-based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.  相似文献   

7.
We describe a set of computational problems motivated by certain analysis tasks in genome resequencing. These are assembly problems for which multiple distinct sequences must be assembled, but where the relative positions of reads to be assembled are already known. This information is obtained from a common reference genome and is characteristic of resequencing experiments. The simplest variant of the problem aims at determining a minimum set of superstrings such that each sequenced read matches at least one superstring. We give an algorithm with time complexity O(N), where N is the sum of the lengths of reads, substantially improving on previous algorithms for solving the same problem. We also examine the problem of finding the smallest number of reads to remove such that the remaining reads are consistent with k superstrings. By exploiting a surprising relationship with the minimum cost flow problem, we show that this problem can be solved in polynomial time when nested reads are excluded. If nested reads are permitted, this problem of removing the minimum number of reads becomes NP-hard. We show that permitting mismatches between reads and their nearest superstrings generally renders these problems NP-hard.  相似文献   

8.
The classification of cancer subtypes, which is critical for successful treatment, has been studied extensively with the use of gene expression profiles from oligonucleotide chips or cDNA microarrays. Various pattern recognition methods have been successfully applied to gene expression data. However, these methods are not optimal, rather they are high-performance classifiers that emphasize only classification accuracy. In this paper, we propose an approach for the construction of the optimal linear classifier using gene expression data. Two linear classification methods, linear discriminant analysis (LDA) and discriminant partial least-squares (DPLS), are applied to distinguish acute leukemia subtypes. These methods are shown to give satisfactory accuracy. Moreover, we determined optimally the number of genes participating in the classification (a remarkably small number compared to previous results) on the basis of the statistical significance test. Thus, the proposed method constructs the optimal classifier that is composed of a small size predictor and provides high accuracy.  相似文献   

9.
A theoretical study, using numerical methods, of the problem of packing spherical particles has been carried out in order to design an optimal chromatographic packing material for cell separation. It is concluded that a monodisperse packing of hard beads does not yield a packing in which narrow crevices, impenetrable to cells, are excluded. Allowing a certain degree of compressibility alleviates this problem somewhat.  相似文献   

10.
In comparative genomics, algorithms that sort permutations by reversals are often used to propose evolutionary scenarios of rearrangements between species. One of the main problems of such methods is that they give one solution while the number of optimal solutions is huge, with no criteria to discriminate among them. Bergeron et al. started to give some structure to the set of optimal solutions, in order to be able to deliver more presentable results than only one solution or a complete list of all solutions. However, no algorithm exists so far to compute this structure except through the enumeration of all solutions, which takes too much time even for small permutations. Bergeron et al. state as an open problem the design of such an algorithm. We propose in this paper an answer to this problem, that is, an algorithm which gives all the classes of solutions and counts the number of solutions in each class, with a better theoretical and practical complexity than the complete enumeration method. We give an example of how to reduce the number of classes obtained, using further constraints. Finally, we apply our algorithm to analyse the possible scenarios of rearrangement between mammalian sex chromosomes.  相似文献   

11.
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.  相似文献   

12.
Peptide 'Velcro': design of a heterodimeric coiled coil   总被引:14,自引:0,他引:14  
  相似文献   

13.
Nanda V  Schmiedekamp A 《Proteins》2008,70(2):489-497
Proteins fold and maintain structure through the collective contributions of a large number of weak, noncovalent interactions. The hydrogen bond is one important category of forces that acts on very short distances. As our knowledge of protein structure continues to expand, we are beginning to appreciate the role that weak carbon-donor hydrogen bonds play in structure and function. One property that differentiates hydrogen bonds from other packing forces is propensity for forming a linear donor-hydrogen-acceptor orientation. To ascertain if carbon-donor hydrogen bonds are able to direct acceptor linearity, we surveyed the geometry of interactions specifically involving aromatic sidechain ring carbons in a data set of high resolution protein structures. We found that while donor-acceptor distances for most carbon donor hydrogen bonds were tighter than expected for van der Waals packing, only the carbons of histidine showed a significant bias for linear geometry. By categorizing histidines in the data set into charged and neutral sidechains, we found only the charged subset of histidines participated in linear interactions. B3LYP/6-31G**++ level optimizations of imidazole and indole-water interactions at various fixed angles demonstrates a clear orientation dependence of hydrogen bonding capacity for both charged and neutral sidechains. We suggest that while all aromatic carbons can participate in hydrogen bonding, only charged histidines are able to overcome protein packing forces and enforce linear interactions. The implications for protein modeling and design are discussed.  相似文献   

14.
MOTIVATION: Inferring genetic networks from time-series expression data has been a great deal of interest. In most cases, however, the number of genes exceeds that of data points which, in principle, makes it impossible to recover the underlying networks. To address the dimensionality problem, we apply the subset selection method to a linear system of difference equations. Previous approaches assign the single most likely combination of regulators to each target gene, which often causes over-fitting of the small number of data. RESULTS: Here, we propose a new algorithm, named LEARNe, which merges the predictions from all the combinations of regulators that have a certain level of likelihood. LEARNe provides more accurate and robust predictions than previous methods for the structure of genetic networks under the linear system model. We tested LEARNe for reconstructing the SOS regulatory network of Escherichia coli and the cell cycle regulatory network of yeast from real experimental data, where LEARNe also exhibited better performances than previous methods. AVAILABILITY: The MATLAB codes are available upon request from the authors.  相似文献   

15.
The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.  相似文献   

16.
Summary In the previous article we try to give an explanation for the success of models for enzymatic reactions in which coenzymes play a role. However, one vital cofactor that has thus far not yielded appreciably to this approach is biotin. We attempt to show that biotin presents a fundamentally new sort of problem to bioorganic chemistry; this problem requires consideration both of the origin of life and subsequent evolution in any attempt to understand the anomalies offered by this cofactor.  相似文献   

17.
MOTIVATION: In molecular biology, sequence alignment is a crucial tool in studying the structure and function of molecules, as well as the evolution of species. In the segment-to-segment variation of the multiple alignment problem, the input can be seen as a set of non-gapped segment pairs (diagonals). Given a weight function that assigns a weight score to every possible diagonal, the goal is to choose a consistent set of diagonals of maximum weight. We show that the segment-to-segment multiple alignment problem is equivalent to a novel formulation of the Maximum Trace problem: the Generalized Maximum Trace (GMT) problem. Solving this problem to optimality, therefore, may improve upon the previous greedy strategies that are used for solving the segment-to-segment multiple sequence alignment problem. We show that the GMT can be stated in terms of an integer linear program and then solve the integer linear program using methods from polyhedral combinatorics. This leads to a branch-and-cut algorithm for segment-to-segment multiple sequence alignment. RESULTS: We report on our first computational experiences with this novel method and show that the program is able to find optimal solutions for real-world test examples.  相似文献   

18.
De novo design of the hydrophobic core of ubiquitin.   总被引:9,自引:7,他引:2       下载免费PDF全文
We have previously reported the development and evaluation of a computational program to assist in the design of hydrophobic cores of proteins. In an effort to investigate the role of core packing in protein structure, we have used this program, referred to as Repacking of Cores (ROC), to design several variants of the protein ubiquitin. Nine ubiquitin variants containing from three to eight hydrophobic core mutations were constructed, purified, and characterized in terms of their stability and their ability to adopt a uniquely folded native-like conformation. In general, designed ubiquitin variants are more stable than control variants in which the hydrophobic core was chosen randomly. However, in contrast to previous results with 434 cro, all designs are destabilized relative to the wild-type (WT) protein. This raises the possibility that beta-sheet structures have more stringent packing requirements than alpha-helical proteins. A more striking observation is that all variants, including random controls, adopt fairly well-defined conformations, regardless of their stability. This result supports conclusions from the cro studies that non-core residues contribute significantly to the conformational uniqueness of these proteins while core packing largely affects protein stability and has less impact on the nature or uniqueness of the fold. Concurrent with the above work, we used stability data on the nine ubiquitin variants to evaluate and improve the predictive ability of our core packing algorithm. Additional versions of the program were generated that differ in potential function parameters and sampling of side chain conformers. Reasonable correlations between experimental and predicted stabilities suggest the program will be useful in future studies to design variants with stabilities closer to that of the native protein. Taken together, the present study provides further clarification of the role of specific packing interactions in protein structure and stability, and demonstrates the benefit of using systematic computational methods to predict core packing arrangements for the design of proteins.  相似文献   

19.
In the literature, various discrete-time and continuous-time mixed-integer linear programming (MIP) formulations for project scheduling problems have been proposed. The performance of these formulations has been analyzed based on generic test instances. The objective of this study is to analyze the performance of discrete-time and continuous-time MIP formulations for a real-life application of project scheduling in human resource management. We consider the problem of scheduling assessment centers. In an assessment center, candidates for job positions perform different tasks while being observed and evaluated by assessors. Because these assessors are highly qualified and expensive personnel, the duration of the assessment center should be minimized. Complex rules for assigning assessors to candidates distinguish this problem from other scheduling problems discussed in the literature. We develop two discrete-time and three continuous-time MIP formulations, and we present problem-specific lower bounds. In a comparative study, we analyze the performance of the five MIP formulations on four real-life instances and a set of 240 instances derived from real-life data. The results indicate that good or optimal solutions are obtained for all instances within short computational time. In particular, one of the real-life instances is solved to optimality. Surprisingly, the continuous-time formulations outperform the discrete-time formulations in terms of solution quality.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号