首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 703 毫秒
1.
Pevzner and Sze(19) have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter regions of co-regulated genes, where l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve this motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9, 2), (11, 3), (15, 5)-motif problems but for even longer motifs, say (20, 7), (30, 11) and (40, 15), which have never been seriously attempted by other researchers because of the large time and space required. Besides, our algorithms can be extended to find more complicated motifs structure called cis-regulatory modules (CRM).  相似文献   

2.
Finding motifs using random projections.   总被引:19,自引:0,他引:19  
  相似文献   

3.
4.
The vertex coloring problem is a classical problem in combinatorial optimization that consists of assigning a color to each vertex of a graph such that no adjacent vertices share the same color, minimizing the number of colors used. Despite the various practical applications that exist for this problem, its NP-hardness still represents a computational challenge. Some of the best computational results obtained for this problem are consequences of hybridizing the various known heuristics. Automatically revising the space constituted by combining these techniques to find the most adequate combination has received less attention. In this paper, we propose exploring the heuristics space for the vertex coloring problem using evolutionary algorithms. We automatically generate three new algorithms by combining elementary heuristics. To evaluate the new algorithms, a computational experiment was performed that allowed comparing them numerically with existing heuristics. The obtained algorithms present an average 29.97% relative error, while four other heuristics selected from the literature present a 59.73% error, considering 29 of the more difficult instances in the DIMACS benchmark.  相似文献   

5.
We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97d) time for DNA strings and in O(nL + nd · 61.86d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92d) time for DNA strings and in O(nL + nd · 47.21d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])) time for DNA strings and in O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too.  相似文献   

6.
7.
MOTIVATION: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. RESULTS: Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures. AVAILABILITY: Gemoda is freely available at http://web.mit.edu/bamel/gemoda  相似文献   

8.
The paper gives an overview on the status of the theoretical analysis of Ant Colony Optimization (ACO) algorithms, with a special focus on the analytical investigation of the runtime required to find an optimal solution to a given combinatorial optimization problem. First, a general framework for studying questions of this type is presented, and three important ACO variants are recalled within this framework. Secondly, two classes of formal techniques for runtime investigations of the considered type are outlined. Finally, some available runtime complexity results for ACO variants, referring to elementary test problems that have been introduced in the theoretical literature on evolutionary algorithms, are cited and discussed.  相似文献   

9.
《IRBM》2020,41(5):267-275
Background and objectiveClustering is a widely used popular method for data analysis within many clustering algorithms for years. Today it is used in many predictions, collaborative filtering and automatic segmentation systems on different domains. Also, to be broadly used in practice, such clustering algorithms need to give both better performance and robustness when compared to the ones currently used. In recent years, evolutionary algorithms are used in many domains since they are robust and easy to implement. And many clustering problems can be easily solved with such algorithms if the problem is modeled as an optimization problem. In this paper, we present an optimization approach for clustering by using four well-known evolutionary algorithms which are Biogeography-Based Optimization (BBO), Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).Methodthe objective function has been specified to minimize the total distance from cluster centers to the data points. Euclidean distance is used for distance calculation. We have applied this objective function to the given algorithms both to find the most efficient clustering algorithm and to compare the clustering performances of algorithms against different data sizes. In order to benchmark the clustering performances of algorithms in the experiments, we have used a number of datasets with different data sizes such as some small scale, medium and big data. The clustering performances have been compared to K-means as it is a widely used clustering algorithm for years in literature. Rand Index, Adjusted Rand Index, Mirkin's Index and Hubert's Index have been considered as parameters for evaluating the clustering performances.ResultAs a result of the clustering experiments of algorithms over different datasets with varying data sizes according to the specified performance criteria, GA and GWO algorithms show better clustering performances among the others.ConclusionsThe results of the study showed that although the algorithms have shown satisfactory clustering results on small and medium scale datasets, the clustering performances on Big data need to be improved.  相似文献   

10.
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.  相似文献   

11.
MOTIVATION: Gene expression data clustering provides a powerful tool for studying functional relationships of genes in a biological process. Identifying correlated expression patterns of genes represents the basic challenge in this clustering problem. RESULTS: This paper describes a new framework for representing a set of multi-dimensional gene expression data as a Minimum Spanning Tree (MST), a concept from the graph theory. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multi-dimensional clustering problem to a tree partitioning problem. We have demonstrated that though the inter-data relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multi-dimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MST-based clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXpression data Clustering Analysis and VisualizATiOn Resource (EXCAVATOR). To demonstrate its effectiveness, we have tested it on three data sets, i.e. expression data from yeast Saccharomyces cerevisiae, expression data in response of human fibroblasts to serum, and Arabidopsis expression data in response to chitin elicitation. The test results are highly encouraging. AVAILABILITY: EXCAVATOR is available on request from the authors.  相似文献   

12.
In the present study, a novel structural motif that can be represented as a combination of the known βαβ-unit and ψ-motif is described and analyzed. In theory, there are four possible combinations of the motifs since each of them can exist in two forms, left-handed and right-handed. For this study, we have selected 140 nonhomologous proteins in which 158 combinations of such types have been found. The combination of the right-handed ψ-motif and the right-handed βαβ-unit has been shown to occur most often (87 cases out of 158) and the combination of the left-handed βαβ-unit and the left-handed ψ-motif does not occur at all. Three novel structural trees in which the commonly occurring combinations are taken as the root structures have been constructed.  相似文献   

13.
Similarity problems intensively investigated in computational molecular biology have the following two stringology models: find the longest string included in any string of a given finite language, and find the shortest string including every string of a given finite language. These two problems are exemplified by the two well-known pairs of problems, the longest common subsequence (or substring) problem and the shortest common supersequence (or superstring) problem. interpretations.

In this paper we consider opposite problems connected with string non-inclusion relations: find the shortest string included in no string of a given finite language and find the longest string including no string of a given finite language. The predicate “string is not included in string β” is interpreted either as “ is not a subsequence of β” or as “ is not a substring of β”. The main purpose is to determine the complexity status of the non-similarity problems. Using graph approaches, we present NP-hardness proofs for the first interpretation and polynomial-time algorithms for the second one. Special cases of the problems, and related issues are discussed.  相似文献   


14.
Stunning advances have been achieved in addressing the protein folding problem, providing deeper understanding of the mechanisms by which proteins navigate energy landscapes to reach their native states and enabling powerful algorithms to connect sequence to structure. However, the realities of the in vivo protein folding problem remain a challenge to reckon with. Here, we discuss the concept of the “proteome folding problem”—the problem of how organisms build and maintain a functional proteome—by admitting that folding energy landscapes are characterized by many misfolded states and that cells must deploy a network of chaperones and degradation enzymes to minimize deleterious impacts of these off-pathway species. The resulting proteostasis network is an inextricable part of in vivo protein folding and must be understood in detail if we are to solve the proteome folding problem. We discuss how the development of computational models for the proteostasis network’s actions and the relationship to the biophysical properties of the proteome has begun to offer new insights and capabilities.  相似文献   

15.
Single nucleotide polymorphism (SNP) is the most frequent form of DNA variation. The set of SNP's present in a chromosome (called the em haplotype) is of interest in a wide area of applications in molecular biology and biomedicine, including diagnostic and medical therapy. In this paper we propose a new heuristic method for the problem of haplotype reconstruction for (portions of) a pair of homologous human chromosomes from a single individual (SIH). The problem is well known in literature and exact algorithms have been proposed for the case when no (or few) gaps are allowed in the input fragments. These algorithms, though exact and of polynomial complexity, are slow in practice. When gaps are considered no exact method of polynomial complexity is known. The problem is also hard to approximate with guarantees. Therefore fast heuristics have been proposed. In this paper we describe SpeedHap, a new heuristic method that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors (up to 20%) and low coverage (as low as 3). We test SpeedHap on real data from the HapMap Project.  相似文献   

16.
The construction of a Spiking Neural Network (SNN), i.e. the choice of an appropriate topology and the configuration of its internal parameters, represents a great challenge for SNN based applications. Evolutionary Algorithms (EAs) offer an elegant solution for these challenges and methods capable of exploring both types of search spaces simultaneously appear to be the most promising ones. A variety of such heterogeneous optimization algorithms have emerged recently, in particular in the field of probabilistic optimization. In this paper, a literature review on heterogeneous optimization algorithms is presented and an example of probabilistic optimization of SNN is discussed in detail. The paper provides an experimental analysis of a novel Heterogeneous Multi-Model Estimation of Distribution Algorithm (hMM-EDA). First, practical guidelines for configuring the method are derived and then the performance of hMM-EDA is compared to state-of-the-art optimization algorithms. Results show hMM-EDA as a light-weight, fast and reliable optimization method that requires the configuration of only very few parameters. Its performance on a synthetic heterogeneous benchmark problem is highly competitive and suggests its suitability for the optimization of SNN.  相似文献   

17.
18.
Flexible transfer lines or mixed-model assembly lines are capable of diversified small-lot production due to negligible switch-over costs. With these lines, it is possible to implement just-in-time (JIT) production, which involves producing only the necessary parts in the necessary quantities at the necessary times. The problem of sequencing flexible transfer lines according to the JIT philosophy can be formulated as a nonlinear integer programming problem. Heuristic algorithms to solve the problem have appeared in the literature. In this paper, we show that the problem can be explicitly reduced to an assignment problem. Thus, we provide an efficient algorithm for an optimal solution to the JIT sequencing problem.  相似文献   

19.
A group test gives a positive (negative) outcome if it contains at least u (at most l) positive items, and an arbitrary outcome if the number of positive items is between thresholds l and u. This problem introduced by Damaschke is called threshold group testing. It is a generalization of classical group testing. Chen and Fu extended this problem to the error-tolerant version and first proposed efficient nonadaptive algorithms. In this article, we extend threshold group testing to the k-inhibitors model in which a test has a positive outcome if it contains at least u positives and at most k-1 inhibitors. By using (d + k - l, u; 2e + 1]-disjunct matrix we provide nonadaptive algorithms for the threshold group testing model with k-inhibitors and at most e-erroneous outcomes. The decoding complexity is O(n(u+k) log n) for fixed parameters (d, u, l, k, e).  相似文献   

20.
Several segmentation methods of lesion uptake in 18F-FDG PET imaging have been proposed in the literature. Their principles are presented along with their clinical results. The main approach proposed in the literature is the thresholding method. The most commonly used is a constant threshold around 40% of the maximum uptake within the lesion. This simple approach is not valid for small (< 4 or 5 mL), poorly contrasted positive tissue (SUV < 2) or lesion in movement. To limit these problems, more complex thresholding algorithms have been proposed to define the optimal threshold value to be applied to segment the lesion. The principle is to adapt the threshold following a fitting model according to one or two characteristic image parameters. Those algorithms based on iterative approaches to find the optimal threshold value are preferred as they take into account patient data. The main drawback is the need of a calibration step depending on the PET device, the acquisition conditions and the algorithm used for image reconstruction. To avoid this problem, some more sophisticated segmentation methods have been proposed in the literature: derivative methods, watershed and pattern recognition algorithms. The delineation of positive tissue on FDG-PET images is a complex problem, always under investigation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号