首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 23 毫秒
1.

Backgrounds

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.  相似文献   

2.
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.  相似文献   

3.
Missing value estimation methods for DNA microarrays   总被引:39,自引:0,他引:39  
MOTIVATION: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. RESULTS: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.  相似文献   

4.
This paper surveys the computational strategies followed to parallelise the most used software in the bioinformatics arena. The studied algorithms are computationally expensive and their computational patterns range from regular, such as database-searching applications, to very irregularly structured patterns (phylogenetic trees). Fine- and coarse-grained parallel strategies are discussed for these very diverse sets of applications. This overview outlines computational issues related to parallelism, physical machine models, parallel programming approaches and scheduling strategies for a broad range of computer architectures. In particular, it deals with shared, distributed and shared/distributed memory architectures.  相似文献   

5.
Due to the recent progress of the DNA microarray technology, a large number of gene expression profile data are being produced. How to analyze gene expression data is an important topic in computational molecular biology. Several studies have been done using the Boolean network as a model of a genetic network. This paper proposes efficient algorithms for identifying Boolean networks of bounded indegree and related biological networks, where identification of a Boolean network can be formalized as a problem of identifying many Boolean functions simultaneously. For the identification of a Boolean network, an O(mnD+1) time naive algorithm and a simple O (mnD) time algorithm are known, where n denotes the number of nodes, m denotes the number of examples, and D denotes the maximum in degree. This paper presents an improved O(momega-2nD + mnD+omega-3) time Monte-Carlo type randomized algorithm, where omega is the exponent of matrix multiplication (currently, omega < 2.376). The algorithm is obtained by combining fast matrix multiplication with the randomized fingerprint function for string matching. Although the algorithm and its analysis are simple, the result is nontrivial and the technique can be applied to several related problems.  相似文献   

6.
Cheng and Church algorithm is an important approach in biclustering algorithms. In this paper, the process of the extended space in the second stage of Cheng and Church algorithm is improved and the selections of two important parameters are discussed. The results of the improved algorithm used in the gene expression spectrum analysis show that, compared with Cheng and Church algorithm, the quality of clustering results is enhanced obviously, the mining expression models are better, and the data possess a strong consistency with fluctuation on the condition while the computational time does not increase significantly.  相似文献   

7.
8.
mRNA molecules are folded in the cells and therefore many of their substrings may actually be inaccessible to protein and microRNA binding. The need to apply an accessibility criterion to the task of genome-wide mRNA motif discovery raises the challenge of overcoming the core O(n(3)) factor imposed by the time complexity of the currently best known algorithms for RNA secondary structure prediction. We speed up the dynamic programming algorithms that are standard for RNA folding prediction. Our new approach significantly reduces the computations without sacrificing the optimality of the results, yielding an expected time complexity of O(n(2) psi(n)), where psi(n) is shown to be constant on average under standard polymer folding models. A benchmark analysis confirms that in practice the runtime ratio between the previous approach and the new algorithm indeed grows linearly with increasing sequence size. The fast new RNA folding algorithm is utilized for genome-wide discovery of accessible cis-regulatory motifs in data sets of ribosomal densities and decay rates of S. cerevisiae genes and to the mining of exposed binding sites of tissue-specific microRNAs in A. thaliana.  相似文献   

9.
MOTIVATION: Bayesian estimation of phylogeny is based on the posterior probability distribution of trees. Currently, the only numerical method that can effectively approximate posterior probabilities of trees is Markov chain Monte Carlo (MCMC). Standard implementations of MCMC can be prone to entrapment in local optima. Metropolis coupled MCMC [(MC)(3)], a variant of MCMC, allows multiple peaks in the landscape of trees to be more readily explored, but at the cost of increased execution time. RESULTS: This paper presents a parallel algorithm for (MC)(3). The proposed parallel algorithm retains the ability to explore multiple peaks in the posterior distribution of trees while maintaining a fast execution time. The algorithm has been implemented using two popular parallel programming models: message passing and shared memory. Performance results indicate nearly linear speed improvement in both programming models for small and large data sets.  相似文献   

10.
MOTIVATION: The protein side-chain conformation problem is a central problem in proteomics with wide applications in protein structure prediction and design. Computational complexity results show that the problem is hard to solve. Yet, instances from realistic applications are large and demand fast and reliable algorithms. RESULTS: We propose a new global optimization algorithm, which for the first time integrates residue reduction and rotamer reduction techniques previously developed for the protein side-chain conformation problem. We show that the proposed approach simplifies dramatically the topology of the underlining residue graph. Computations show that our algorithm solves problems using only 1-10% of the time required by the mixed-integer linear programming approach available in the literature. In addition, on a set of hard side-chain conformation problems, our algorithm runs 2-78 times faster than SCWRL 3.0, which is widely used for solving these problems. AVAILABILITY: The implementation is available as an online server at http://eudoxus.scs.uiuc.edu/r3.html  相似文献   

11.

Background

Lung cancer is an important and common cancer that constitutes a major public health problem, but early detection of small cell lung cancer can significantly improve the survival rate of cancer patients. A number of serum biomarkers have been used in the diagnosis of lung cancers; however, they exhibit low sensitivity and specificity.

Methods

We used biochemical methods to measure blood levels of lactate dehydrogenase (LDH), C-reactive protein (CRP), Na+, Cl-, carcino-embryonic antigen (CEA), and neuron specific enolase (NSE) in 145 small cell lung cancer (SCLC) patients and 155 non-small cell lung cancer and 155 normal controls. A gene expression programming (GEP) model and Receiver Operating Characteristic (ROC) curves incorporating these biomarkers was developed for the auxiliary diagnosis of SCLC.

Results

After appropriate modification of the parameters, the GEP model was initially set up based on a training set of 115 SCLC patients and 125 normal controls for GEP model generation. Then the GEP was applied to the remaining 60 subjects (the test set) for model validation. GEP successfully discriminated 281 out of 300 cases, showing a correct classification rate for lung cancer patients of 93.75% (225/240) and 93.33% (56/60) for the training and test sets, respectively. Another GEP model incorporating four biomarkers, including CEA, NSE, LDH, and CRP, exhibited slightly lower detection sensitivity than the GEP model, including six biomarkers. We repeat the models on artificial neural network (ANN), and our results showed that the accuracy of GEP models were higher than that in ANN. GEP model incorporating six serum biomarkers performed by NSCLC patients and normal controls showed low accuracy than SCLC patients and was enough to prove that the GEP model is suitable for the SCLC patients.

Conclusion

We have developed a GEP model with high sensitivity and specificity for the auxiliary diagnosis of SCLC. This GEP model has the potential for the wide use for detection of SCLC in less developed regions.  相似文献   

12.
The reverse engineering of gene regulatory networks using gene expression profile data has become crucial to gain novel biological knowledge. Large amounts of data that need to be analyzed are currently being produced due to advances in microarray technologies. Using current reverse engineering algorithms to analyze large data sets can be very computational-intensive. These emerging computational requirements can be met using parallel computing techniques. It has been shown that the Network Identification by multiple Regression (NIR) algorithm performs better than the other ready-to-use reverse engineering software. However it cannot be used with large networks with thousands of nodes - as is the case in biological networks - due to the high time and space complexity. In this work we overcome this limitation by designing and developing a parallel version of the NIR algorithm. The new implementation of the algorithm reaches a very good accuracy even for large gene networks, improving our understanding of the gene regulatory networks that is crucial for a wide range of biomedical applications.  相似文献   

13.
Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request.  相似文献   

14.
The rapid increase in collateral omics and phenotypic data has enabled data-driven studies for the fast discovery of cancer targets and biomarkers. Thus, it is necessary to develop convenient tools for general oncologists and cancer scientists to carry out customized data mining without computational expertise. For this purpose, we developed innovative software that enables user-driven analyses assisted by knowledge-based smart systems. Publicly available data on mutations, gene expression, patient survival, immune score, drug screening and RNAi screening were integrated from the TCGA, GDSC, CCLE, NCI, and DepMap databases. The optimal selection of samples and other filtering options were guided by the smart function of the software for data mining and visualization on Kaplan-Meier plots, box plots and scatter plots of publication quality. We implemented unique algorithms for both data mining and visualization, thus simplifying and accelerating user-driven discovery activities on large multiomics datasets. The present Q-omics software program (v0.95) is available at http://qomics.sookmyung.ac.kr.  相似文献   

15.
16.
Microarray technology facilitates the monitoring of the expression levels of thousands of genes over different experimental conditions simultaneously. Clustering is a popular data mining tool which can be applied to microarray gene expression data to identify co-expressed genes. Most of the traditional clustering methods optimize a single clustering goodness criterion and thus may not be capable of performing well on all kinds of datasets. Motivated by this, in this article, a multiobjective clustering technique that optimizes cluster compactness and separation simultaneously, has been improved through a novel support vector machine classification based cluster ensemble method. The superiority of MOCSVMEN (MultiObjective Clustering with Support Vector Machine based ENsemble) has been established by comparing its performance with that of several well known existing microarray data clustering algorithms. Two real-life benchmark gene expression datasets have been used for testing the comparative performances of different algorithms. A recently developed metric, called Biological Homogeneity Index (BHI), which computes the clustering goodness with respect to functional annotation, has been used for the comparison purpose.  相似文献   

17.
贝叶斯聚类在基因表达谱知识挖掘中的应用   总被引:1,自引:0,他引:1  
在大规模基因表达谱的数据分析中引入了一种全新的基于贝叶斯模型的聚类算法,从生物学背景出发,研究了该算法应用在大规模基因表达谱中的理论基础和算法优越性,并应用该算法对两个公共的基因表达数据集进行了知识再挖掘。结果表明,与其他聚类算法相比,该算法在知识发现方面具有显著的优越性。挖掘出的生物学知识对该领域研究人员的实验设计也有一定的启发性。  相似文献   

18.
MOTIVATION: Arrays allow measurements of the expression levels of thousands of mRNAs to be made simultaneously. The resulting data sets are information rich but require extensive mining to enhance their usefulness. Information theoretic methods are capable of assessing similarities and dissimilarities between data distributions and may be suited to the analysis of gene expression experiments. The purpose of this study was to investigate information theoretic data mining approaches to discover temporal patterns of gene expression from array-derived gene expression data. RESULTS: The Kullback-Leibler divergence, an information-theoretic distance that measures the relative dissimilarity between two data distribution profiles, was used in conjunction with an unsupervised self-organizing map algorithm. Two published, array-derived gene expression data sets were analyzed. The patterns obtained with the KL clustering method were found to be superior to those obtained with the hierarchical clustering algorithm using the Pearson correlation distance measure. The biological significance of the results was also examined. AVAILABILITY: Software code is available by request from the authors. All programs were written in ANSI C and Matlab (Mathworks Inc., Natick, MA).  相似文献   

19.
Summary This paper proposes a modified radial basis function classification algorithm for non-linear cancer classification. In the algorithm, a modified simulated annealing method is developed and combined with the linear least square and gradient paradigms to optimize the structure of the radial basis function (RBF) classifier. The proposed algorithm can be adopted to perform non-linear cancer classification based on gene expression profiles and applied to two microarray data sets involving various human tumor classes: (1) Normal versus colon tumor; (2) acute myeloid leukemia (AML) versus acute lymphoblastic leukemia (ALL). Finally, accuracy and stability for the proposed algorithm are further demonstrated by comparing with the other cancer classification algorithms.  相似文献   

20.
Ensemble clustering methods have become increasingly important to ease the task of choosing the most appropriate cluster algorithm for a particular data analysis problem. The consensus clustering (CC) algorithm is a recognized ensemble clustering method that uses an artificial intelligence technique to optimize a fitness function. We formally prove the existence of a subspace of the search space for CC, which contains all solutions of maximal fitness and suggests two greedy algorithms to search this subspace. We evaluate the algorithms on two gene expression data sets and one synthetic data set, and compare the result with the results of other ensemble clustering approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号