首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Microarrays have become a central tool in biological research. Their applications range from functional annotation to tissue classification and genetic network inference. A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns. This translates to the algorithmic problem of clustering genes based on their expression patterns. RESULTS: We present a novel clustering algorithm, called CLICK, and its applications to gene expression analysis. The algorithm utilizes graph-theoretic and statistical techniques to identify tight groups (kernels) of highly similar elements, which are likely to belong to the same true cluster. Several heuristic procedures are then used to expand the kernels into the full clusters. We report on the application of CLICK to a variety of gene expression data sets. In all those applications it outperformed extant algorithms according to several common figures of merit. We also point out that CLICK can be successfully used for the identification of common regulatory motifs in the upstream regions of co-regulated genes. Furthermore, we demonstrate how CLICK can be used to accurately classify tissue samples into disease types, based on their expression profiles. Finally, we present a new java-based graphical tool, called EXPANDER, for gene expression analysis and visualization, which incorporates CLICK and several other popular clustering algorithms. AVAILABILITY: http://www.cs.tau.ac.il/~rshamir/expander/expander.html  相似文献   

2.
Tissue classification with gene expression profiles.   总被引:29,自引:0,他引:29  
Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90% in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range.  相似文献   

3.
MOTIVATION: Methods for analyzing cancer microarray data often face two distinct challenges: the models they infer need to perform well when classifying new tissue samples while at the same time providing an insight into the patterns and gene interactions hidden in the data. State-of-the-art supervised data mining methods often cover well only one of these aspects, motivating the development of methods where predictive models with a solid classification performance would be easily communicated to the domain expert. RESULTS: Data visualization may provide for an excellent approach to knowledge discovery and analysis of class-labeled data. We have previously developed an approach called VizRank that can score and rank point-based visualizations according to degree of separation of data instances of different class. We here extend VizRank with techniques to uncover outliers, score features (genes) and perform classification, as well as to demonstrate that the proposed approach is well suited for cancer microarray analysis. Using VizRank and radviz visualization on a set of previously published cancer microarray data sets, we were able to find simple, interpretable data projections that include only a small subset of genes yet do clearly differentiate among different cancer types. We also report that our approach to classification through visualization achieves performance that is comparable to state-of-the-art supervised data mining techniques. AVAILABILITY: VizRank and radviz are implemented as part of the Orange data mining suite (http://www.ailab.si/orange). SUPPLEMENTARY INFORMATION: Supplementary data are available from http://www.ailab.si/supp/bi-cancer.  相似文献   

4.
MOTIVATION: Experimental limitations have resulted in the popularity of parametric statistical tests as a method for identifying differentially regulated genes in microarray data sets. However, these tests assume that the data follow a normal distribution. To date, the assumption that replicate expression values for any gene are normally distributed, has not been critically addressed for Affymetrix GeneChip data. RESULTS: The normality of the expression values calculated using four different commercial and academic software packages was investigated using a data set consisting of the same target RNA applied to 59 human Affymetrix U95A GeneChips using a combination of statistical tests and visualization techniques. For the majority of probe sets obtained from each analysis suite, the expression data showed a good correlation with normality. The exception was a large number of low-expressed genes in the data set produced using Affymetrix Microarray Suite 5.0, which showed a striking non-normal distribution. In summary, our data provide strong support for the application of parametric tests to GeneChip data sets without the need for data transformation.  相似文献   

5.
6.
7.
8.
9.
10.
MOTIVATION: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. RESULTS: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.  相似文献   

11.
We present a new computational technique (a software implementation, data sets, and supplementary information are available at http://www.enm.bris.ac.uk/lpd/) which enables the probabilistic analysis of cDNA microarray data and we demonstrate its effectiveness in identifying features of biomedical importance. A hierarchical Bayesian model, called Latent Process Decomposition (LPD), is introduced in which each sample in the data set is represented as a combinatorial mixture over a finite set of latent processes, which are expected to correspond to biological processes. Parameters in the model are estimated using efficient variational methods. This type of probabilistic model is most appropriate for the interpretation of measurement data generated by cDNA microarray technology. For determining informative substructure in such data sets, the proposed model has several important advantages over the standard use of dendrograms. First, the ability to objectively assess the optimal number of sample clusters. Second, the ability to represent samples and gene expression levels using a common set of latent variables (dendrograms cluster samples and gene expression values separately which amounts to two distinct reduced space representations). Third, in constrast to standard cluster models, observations are not assigned to a single cluster and, thus, for example, gene expression levels are modeled via combinations of the latent processes identified by the algorithm. We show this new method compares favorably with alternative cluster analysis methods. To illustrate its potential, we apply the proposed technique to several microarray data sets for cancer. For these data sets it successfully decomposes the data into known subtypes and indicates possible further taxonomic subdivision in addition to highlighting, in a wholly unsupervised manner, the importance of certain genes which are known to be medically significant. To illustrate its wider applicability, we also illustrate its performance on a microarray data set for yeast.  相似文献   

12.
Techniques for analyzing genome-wide expression profiles, such as the microarray technique and next-generation sequencers, have been developed. While these techniques can provide a lot of information about gene expression, selection of genes of interest is complicated because of excessive gene expression data. Thus, many researchers use statistical methods or fold change as screening tools for finding gene sets whose expression is altered between groups, which may result in the loss of important information. In the present study, we aimed to establish a combined method for selecting genes of interest with a small magnitude of alteration in gene expression by coupling with proteome analysis. We used hypercholesterolemic rats to examine the effects of a crude herbal drug on gene expression and proteome profiles. We could not select genes of interest by using standard methods. However, by coupling with proteome analysis, we found several effects of the crude herbal drug on gene expression. Our results suggest that this method would be useful in selecting gene sets with expressions that do not show a large magnitude of alteration.  相似文献   

13.
MOTIVATION: Cells continuously reprogram their gene expression network as they move through the cell cycle or sense changes in their environment. In order to understand the regulation of cells, time series expression profiles provide a more complete picture than single time point expression profiles. Few analysis techniques, however, are well suited to modelling such time series data. RESULTS: We describe an approach that naturally handles time series data with the capabilities of modelling causality, feedback loops, and environmental or hidden variables using a Dynamic Bayesian network. We also present a novel way of combining prior biological knowledge and current observations to improve the quality of analysis and to model interactions between sets of genes rather than individual genes. Our approach is evaluated on time series expression data measured in response to physiological changes that affect tryptophan metabolism in E. coli. Results indicate that this approach is capable of finding correlations between sets of related genes.  相似文献   

14.
T Conway  B Kraus  D L Tucker  D J Smalley  A F Dorman  L McKibben 《BioTechniques》2002,32(1):110, 112-4, 116, 118-9
Microsoft Windows-based computers have evolved to the point that they provide sufficient computational and visualization power for robust analysis of DNA array data. In fact, smaller laboratories might prefer to carry out some or all of their analyses and visualization in a Windows environment, rather than alternative platforms such as UNIX. We have developed a series of manually executed macros written in Visual Basic for Microsoft Excel spreadsheets, that allows for rapid and comprehensive gene expression data analysis. The first macro assigns gene names to spots on the DNA array and normalizes individual hybridizations by expressing the signal intensity for each gene as a percentage of the sum of all gene intensities. The second macro streamlines statistical consideration of the confidence in individual gene measurements for sets of experimental replicates by calculating probability values with the Student's t test. The third macro introduces a threshold value, calculates expression ratios between experimental conditions, and calculates the standard deviation of the mean of the log ratio values. Selected columns of data are copied by a fourth macro to create a processed data set suitable for entry into a Microsoft Access database. An Access database structure is described that allows simple queries across multiple experiments and export of data into third-party data visualization software packages. These analysis tools can be used in their present form by others working with commercial E. coli membrane arrays, or they may be adapted for use with other systems. The Excel spreadsheets with embedded Visual Basic macros and detailed instructions for their use are available at http://www.ou.edu/microarray.  相似文献   

15.
MOTIVATION: The nematode C. elegans is an ideal model organism in which to investigate the biomolecular mechanisms underlying the connectivity of neurons, because synaptic connections are described in a comprehensive wiring diagram and methods for defining gene expression profiles of individual neurons are now available. RESULTS: Here we present computational techniques linking these two types of information. A systems-based approach (EMBP: Entropy Minimization and Boolean Parsimony) identifies sets of synergistically interacting genes whose joint expression predicts neural connectivity. We introduce an information theoretic measure of the multivariate synergy, a fundamental concept in systems biology, connecting the members of these gene sets. We present and validate our preliminary results based on publicly available information, and demonstrate that their synergy is exceptionally high indicating joint involvement in pathways. Our strategy provides a robust methodology that will yield increasingly more accurate results as more neuron-specific gene expression data emerge. Ultimately, we expect our approach to provide important clues for universal mechanisms of neural interconnectivity.  相似文献   

16.
MOTIVATION: Experimental gene expression data sets, such as those generated by microarray or gene chip experiments, typically have significant noise and complicated interconnectivities that make understanding even simple regulatory patterns difficult. Given these complications, characterizing the effectiveness of different analysis techniques to uncover network groups and structures remains a challenge. Generating simulated expression patterns with known biological features of expression complexity, diversity and interconnectivities provides a more controlled means of investigating the appropriateness of different analysis methods. A simulation-based approach can systematically evaluate different gene expression analysis techniques and provide a basis for improved methods in dynamic metabolic network reconstruction. RESULTS: We have developed an on-line simulator, called eXPatGen, to generate dynamic gene expression patterns typical of microarray experiments. eXPatGen provides a quantitative network structure to represent key biological features, including the induction, repression, and cascade regulation of messenger RNA (mRNA). The simulation is modular such that the expression model can be replaced with other representations, depending on the level of biological detail required by the user. Two example gene networks, of 25 and 100 genes respectively, were simulated. Two standard analysis techniques, clustering and PCA analysis, were performed on the resulting expression patterns in order to demonstrate how the simulator might be used to evaluate different analysis methods and provide experimental guidance for biological studies of gene expression. AVAILABILITY: http://www.che.udel.edu/eXPatGen/  相似文献   

17.
18.
Large-scale microarray gene expression studies can provide insight into complex genetic networks and biological pathways. A comprehensive gene expression database was constructed using Affymetrix GeneChip microarrays and RNA isolated from more than 6,400 distinct normal and diseased human tissues. These individual patient samples were grouped into over 700 sample sets based on common tissue and disease morphologies, and each set contained averaged expression data for over 45,000 gene probe sets representing more than 33,000 known human genes. Sample sets were compared to each other in more than 750 normal vs. disease pairwise comparisons. Relative up or down-regulation patterns of genes across these pairwise comparisons provided unique expression fingerprints that could be compared and matched to a gene of interest using the Match/X algorithm. This algorithm uses the kappa statistic to compute correlations between genes and calculate a distance score between a gene of interest and all other genes in the database. Using cdc2 as a query gene, we identified several hundred genes that had similar expression patterns and highly correlated distance scores. Most of these genes were known components of the cell cycle involved in G2/M progression, spindle function or chromosome arrangement. Some of the identified genes had unknown biological functions but may be related to cdc2 mediated mechanism based on their closely correlated distance scores. This algorithm may provide novel insights into unknown gene function based on correlation to expression profiles of known genes and can identify elements of cellular pathways and gene interactions in a high throughput fashion.  相似文献   

19.
MOTIVATION: Extracting useful information from expression levels of thousands of genes generated with microarray technology needs a variety of analytical techniques. Mathematical programming approaches for classification analysis outperform parametric methods when the data depart from assumptions underlying these methods. Therefore, a mathematical programming approach is developed for gene selection and tissue classification using gene expression profiles. RESULTS: A new mixed integer programming model is formulated for this purpose. The mixed integer programming model simultaneously selects genes and constructs a classification model to classify two groups of tissue samples as accurately as possible. Very encouraging results were obtained with two data sets from the literature as examples. These results show that the mathematical programming approach can rival or outperform traditional classification methods.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号