首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
High-throughput genomic measurements, interpreted as cooccurring data samples from multiple sources, open up a fresh problem for machine learning: What is in common in the different data sets, that is, what kind of statistical dependencies are there between the paired samples from the different sets? We introduce a clustering algorithm for exploring the dependencies. Samples within each data set are grouped such that the dependencies between groups of different sets capture as much of pairwise dependencies between the samples as possible. We formalize this problem in a novel probabilistic way, as optimization of a Bayes factor. The method is applied to reveal commonalities and exceptions in gene expression between organisms and to suggest regulatory interactions in the form of dependencies between gene expression profiles and regulator binding patterns.  相似文献   

2.
Optical mapping is a novel technique for determining the restriction sites on a DNA molecule by directly observing a number of partially digested copies of the molecule under a light microscope. The problem is complicated by uncertainty as to the orientation of the molecules and by erroneous detection of cuts. In this paper we study the problem of constructing a restriction map based on optical mapping data. We give several variants of a polynomial reconstruction algorithm, as well as an algorithm that is exponential in the number of cut sites, and hence is appropriate only for small number of cut sites. We give a simple probabilistic model for data generation and for the errors and prove probabilistic upper and lower bounds on the number of molecules needed by each algorithm in order to obtain a correct map, expressed as a function of the number of cut sites and the error parameters. To the best of our knowledge, this is the first probabilistic analysis of algorithms for the problem. We also provide experimental results confirming that our algorithms are highly effective on simulated data.  相似文献   

3.
High-throughput data from various omics and sequencing techniques have rendered the automated metabolic network reconstruction a highly relevant problem. Our approach reflects the inherent probabilistic nature of the steps involved in metabolic network reconstruction. Here, the goal is to arrive at networks which combine probabilistic information with the possibility to obtain a small number of disconnected network constituents by reduction of a given preliminary probabilistic metabolic network. We define automated metabolic network reconstruction as an optimization problem on four-partite graph (nodes representing genes, enzymes, reactions, and metabolites) which integrates: (1) probabilistic information obtained from the existing process for metabolic reconstruction from a given genome, (2) connectedness of the raw metabolic network, and (3) clustering of components in the reconstructed metabolic network. The practical implications of our theoretical analysis refer to the quality of reconstructed metabolic networks and shed light on the problem of finding more efficient and effective methods for automated reconstruction. Our main contributions include: a completeness result for the defined problem, polynomial-time approximation algorithm, and an optimal polynomial-time algorithm for trees. Moreover, we exemplify our approach by the reconstruction of the sucrose biosynthesis pathway in Chlamydomonas reinhardtii.  相似文献   

4.
Besides the problem of searching for effective methods for data analysis there are some additional problems with handling data of high uncertainty. Uncertainty problems often arise in an analysis of ecological data, e.g. in the cluster analysis of ecological data. Conventional clustering methods based on Boolean logic ignore the continuous nature of ecological variables and the uncertainty of ecological data. That can result in misclassification or misinterpretation of the data structure. Clusters with fuzzy boundaries reflect better the continuous character of ecological features. But the problem is, that the common clustering methods (like the fuzzy c-means method) are only designed for treating crisp data, that means they provide a fuzzy partition only for crisp data (e.g. exact measurement data). This paper presents the extension and implementation of the method of fuzzy clustering of fuzzy data proposed by Yang and Liu [Yang, M.-S. and Liu, H-H, 1999. Fuzzy clustering procedures for conical fuzzy vector data. Fuzzy Sets and Systems, 106, 189-200.]. The imprecise data can be defined as multidimensional fuzzy sets with not sharply formed boundaries (in the form of the so-called conical fuzzy vectors). They can then be used for the fuzzy clustering together with crisp data. That can be particularly useful when information is not available about the variances which describe the accuracy of the data and probabilistic approaches are impossible. The method proposed by Yang has been extended and implemented for the Fuzzy Clustering System EcoFucs developed at the University of Kiel. As an example, the paper presents the fuzzy cluster analysis of chemicals according to their ecotoxicological properties. The uncertainty and imprecision of ecotoxicological data are very high because of the use of various data sources, various investigation tests and the difficulty of comparing these data. The implemented method can be very helpful in searching for an adequate partition of ecological data into clusters with similar properties.  相似文献   

5.
Biclustering microarray data by Gibbs sampling   总被引:1,自引:0,他引:1  
MOTIVATION: Gibbs sampling has become a method of choice for the discovery of noisy patterns, known as motifs, in DNA and protein sequences. Because handling noise in microarray data presents similar challenges, we have adapted this strategy to the biclustering of discretized microarray data. RESULTS: In contrast with standard clustering that reveals genes that behave similarly over all the conditions, biclustering groups genes over only a subset of conditions for which those genes have a sharp probability distribution. We have opted for a simple probabilistic model of the biclusters because it has the key advantage of providing a transparent probabilistic interpretation of the biclusters in the form of an easily interpretable fingerprint. Furthermore, Gibbs sampling does not suffer from the problem of local minima that often characterizes Expectation-Maximization. We demonstrate the effectiveness of our approach on two synthetic data sets as well as a data set from leukemia patients.  相似文献   

6.
Models of gene regulatory networks (GRNs) attempt to explain the complex processes that determine cells' behavior, such as differentiation, metabolism, and the cell cycle. The advent of high-throughput data generation technologies has allowed researchers to fit theoretical models to experimental data on gene-expression profiles. GRNs are often represented using logical models. These models require that real-valued measurements be converted to discrete levels, such as on/off, but the discretization often introduces inconsistencies into the data. Dimitrova et al. posed the problem of efficiently finding a parsimonious resolution of the introduced inconsistencies. We show that reconstruction of a logical GRN that minimizes the errors is NP-complete, so that an efficient exact algorithm for the problem is not likely to exist. We present a probabilistic formulation of the problem that circumvents discretization of expression data. We phrase the problem of error reduction as a minimum entropy problem, develop a heuristic algorithm for it, and evaluate its performance on mouse embryonic stem cell data. The constructed model displays high consistency with prior biological knowledge. Despite the oversimplification of a discrete model, we show that it is superior to raw experimental measurements and demonstrates a highly significant level of identical regulatory logic among co-regulated genes. A software implementing the method is freely available at: http://acgt.cs.tau.ac.il/modent.  相似文献   

7.
Clustering of multivariate data is a commonly used technique in ecology, and many approaches to clustering are available. The results from a clustering algorithm are uncertain, but few clustering approaches explicitly acknowledge this uncertainty. One exception is Bayesian mixture modelling, which treats all results probabilistically, and allows comparison of multiple plausible classifications of the same data set. We used this method, implemented in the AutoClass program, to classify catchments (watersheds) in the Murray Darling Basin (MDB), Australia, based on their physiographic characteristics (e.g. slope, rainfall, lithology). The most likely classification found nine classes of catchments. Members of each class were aggregated geographically within the MDB. Rainfall and slope were the two most important variables that defined classes. The second-most likely classification was very similar to the first, but had one fewer class. Increasing the nominal uncertainty of continuous data resulted in a most likely classification with five classes, which were again aggregated geographically. Membership probabilities suggested that a small number of cases could be members of either of two classes. Such cases were located on the edges of groups of catchments that belonged to one class, with a group belonging to the second-most likely class adjacent. A comparison of the Bayesian approach to a distance-based deterministic method showed that the Bayesian mixture model produced solutions that were more spatially cohesive and intuitively appealing. The probabilistic presentation of results from the Bayesian classification allows richer interpretation, including decisions on how to treat cases that are intermediate between two or more classes, and whether to consider more than one classification. The explicit consideration and presentation of uncertainty makes this approach useful for ecological investigations, where both data and expectations are often highly uncertain.  相似文献   

8.
An improved algorithm for clustering gene expression data   总被引:1,自引:0,他引:1  
MOTIVATION: Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering. RESULTS: The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed.  相似文献   

9.
Mixture modelling of gene expression data from microarray experiments   总被引:5,自引:0,他引:5  
MOTIVATION: Hierarchical clustering is one of the major analytical tools for gene expression data from microarray experiments. A major problem in the interpretation of the output from these procedures is assessing the reliability of the clustering results. We address this issue by developing a mixture model-based approach for the analysis of microarray data. Within this framework, we present novel algorithms for clustering genes and samples. One of the byproducts of our method is a probabilistic measure for the number of true clusters in the data. RESULTS: The proposed methods are illustrated by application to microarray datasets from two cancer studies; one in which malignant melanoma is profiled (Bittner et al., Nature, 406, 536-540, 2000), and the other in which prostate cancer is profiled (Dhanasekaran et al., 2001, submitted).  相似文献   

10.
MOTIVATION: A promising and reliable approach to annotate gene function is clustering genes not only by using gene expression data but also literature information, especially gene networks. RESULTS: We present a systematic method for gene clustering by combining these totally different two types of data, particularly focusing on network modularity, a global feature of gene networks. Our method is based on learning a probabilistic model, which we call a hidden modular random field in which the relation between hidden variables directly represents a given gene network. Our learning algorithm which minimizes an energy function considering the network modularity is practically time-efficient, regardless of using the global network property. We evaluated our method by using a metabolic network and microarray expression data, changing with microarray datasets, parameters of our model and gold standard clusters. Experimental results showed that our method outperformed other four competing methods, including k-means and existing graph partitioning methods, being statistically significant in all cases. Further detailed analysis showed that our method could group a set of genes into a cluster which corresponds to the folate metabolic pathway while other methods could not. From these results, we can say that our method is highly effective for gene clustering and annotating gene function.  相似文献   

11.
MOTIVATION: This paper introduces the application of a novel clustering method to microarray expression data. Its first stage involves compression of dimensions that can be achieved by applying SVD to the gene-sample matrix in microarray problems. Thus the data (samples or genes) can be represented by vectors in a truncated space of low dimensionality, 4 and 5 in the examples studied here. We find it preferable to project all vectors onto the unit sphere before applying a clustering algorithm. The clustering algorithm used here is the quantum clustering method that has one free scale parameter. Although the method is not hierarchical, it can be modified to allow hierarchy in terms of this scale parameter. RESULTS: We apply our method to three data sets. The results are very promising. On cancer cell data we obtain a dendrogram that reflects correct groupings of cells. In an AML/ALL data set we obtain very good clustering of samples into four classes of the data. Finally, in clustering of genes in yeast cell cycle data we obtain four groups in a problem that is estimated to contain five families. AVAILABILITY: Software is available as Matlab programs at http://neuron.tau.ac.il/~horn/QC.htm.  相似文献   

12.
13.
Ensemble clustering methods have become increasingly important to ease the task of choosing the most appropriate cluster algorithm for a particular data analysis problem. The consensus clustering (CC) algorithm is a recognized ensemble clustering method that uses an artificial intelligence technique to optimize a fitness function. We formally prove the existence of a subspace of the search space for CC, which contains all solutions of maximal fitness and suggests two greedy algorithms to search this subspace. We evaluate the algorithms on two gene expression data sets and one synthetic data set, and compare the result with the results of other ensemble clustering approaches.  相似文献   

14.
In this paper, we present a novel approach of implementing a combination methodology to find appropriate neural network architecture and weights using an evolutionary least square based algorithm (GALS).1 This paper focuses on aspects such as the heuristics of updating weights using an evolutionary least square based algorithm, finding the number of hidden neurons for a two layer feed forward neural network, the stopping criterion for the algorithm and finally some comparisons of the results with other existing methods for searching optimal or near optimal solution in the multidimensional complex search space comprising the architecture and the weight variables. We explain how the weight updating algorithm using evolutionary least square based approach can be combined with the growing architecture model to find the optimum number of hidden neurons. We also discuss the issues of finding a probabilistic solution space as a starting point for the least square method and address the problems involving fitness breaking. We apply the proposed approach to XOR problem, 10 bit odd parity problem and many real-world benchmark data sets such as handwriting data set from CEDAR, breast cancer and heart disease data sets from UCI ML repository. The comparative results based on classification accuracy and the time complexity are discussed.  相似文献   

15.
MOTIVATION: Reliable identification of protein families is key to phylogenetic analysis, functional annotation and the exploration of protein function diversity in a given phylogenetic branch. As more and more complete genomes are sequenced, there is a need for powerful and reliable algorithms facilitating protein families construction. RESULTS: We have formulated the problem of protein families construction as an instance of consensus clustering, for which we designed a novel algorithm that is computationally efficient in practice and produces high quality results. Our algorithm uses an election method to construct consensus families from competing clustering computations. Our consensus clustering algorithm is tailored to serve the specific needs of comparative genomics projects. First, it provides a robust means to incorporate results from different and complementary clustering methods, thus avoiding the need for an a priori choice that may introduce computational bias in the results. Second, it is suited to large-scale projects due to the practical efficiency. And third, it produces high quality results where families tend to represent groupings by biological function. AVAILABILITY: This method has been used for Génolevures project to compute protein families of Hemiascomycetous yeasts. The data are available online at http://cbi.labri.fr/Genolevures/fam/  相似文献   

16.
17.
18.
MOTIVATION: Grouping genes having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Many clustering procedures have shown success in microarray gene clustering; most of them belong to the family of heuristic clustering algorithms. Model-based algorithms are alternative clustering algorithms, which are based on the assumption that the whole set of microarray data is a finite mixture of a certain type of distributions with different parameters. Application of the model-based algorithms to unsupervised clustering has been reported. Here, for the first time, we demonstrated the use of the model-based algorithm in supervised clustering of microarray data. RESULTS: We applied the proposed methods to real gene expression data and simulated data. We showed that the supervised model-based algorithm is superior over the unsupervised method and the support vector machines (SVM) method. AVAILABILITY: The program written in the SAS language implementing methods I-III in this report is available upon request. The software of SVMs is available in the website http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi  相似文献   

19.
Clustering, particularly hierarchical clustering, is an important method for understanding and analysing data across a wide variety of knowledge domains with notable utility in systems where the data can be classified in an evolutionary context. This paper introduces a new hierarchical clustering problem defined by a novel objective function we call the arithmetic-harmonic cut. We show that the problem of finding such a cut is NP-hard and APX-hard but is fixed-parameter tractable, which indicates that although the problem is unlikely to have a polynomial time algorithm (even for approximation), exact parameterized and local search based techniques may produce workable algorithms. To this end, we implement a memetic algorithm for the problem and demonstrate the effectiveness of the arithmetic-harmonic cut on a number of datasets including a cancer type dataset and a corona virus dataset. We show favorable performance compared to currently used hierarchical clustering techniques such as k-Means, Graclus and Normalized-Cut. The arithmetic-harmonic cut metric overcoming difficulties other hierarchical methods have in representing both intercluster differences and intracluster similarities.  相似文献   

20.
Guo Y 《Biometrics》2011,67(4):1532-1542
Independent component analysis (ICA) has become an important tool for analyzing data from functional magnetic resonance imaging (fMRI) studies. ICA has been successfully applied to single-subject fMRI data. The extension of ICA to group inferences in neuroimaging studies, however, is challenging due to the unavailability of a prespecified group design matrix and the uncertainty in between-subjects variability in fMRI data. We present a general probabilistic ICA (PICA) model that can accommodate varying group structures of multisubject spatiotemporal processes. An advantage of the proposed model is that it can flexibly model various types of group structures in different underlying neural source signals and under different experimental conditions in fMRI studies. A maximum likelihood (ML) method is used for estimating this general group ICA model. We propose two expectation-maximization (EM) algorithms to obtain the ML estimates. The first method is an exact EM algorithm, which provides an exact E-step and an explicit noniterative M-step. The second method is a variational approximation EM algorithm, which is computationally more efficient than the exact EM. In simulation studies, we first compare the performance of the proposed general group PICA model and the existing probabilistic group ICA approach. We then compare the two proposed EM algorithms and show the variational approximation EM achieves comparable accuracy to the exact EM with significantly less computation time. An fMRI data example is used to illustrate application of the proposed methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号