共查询到20条相似文献,搜索用时 0 毫秒
1.
We present a novel method for finding low-dimensional views of high-dimensional data: Targeted Projection Pursuit. The method proceeds by finding projections of the data that best approximate a target view. Two versions of the method are introduced; one version based on Procrustes analysis and one based on an artificial neural network. These versions are capable of finding orthogonal or non-orthogonal projections, respectively. The method is quantitatively and qualitatively compared with other dimension reduction techniques. It is shown to find 2D views that display the classification of cancers from gene expression data with a visual separation equal to, or better than, existing dimension reduction techniques. AVAILABILITY: source code, additional diagrams, and original data are available from http://computing.unn.ac.uk/staff/CGJF1/tpp/bioinf.html 相似文献
2.
MOTIVATION: Microarrays have become a central tool in biological research. Their applications range from functional annotation to tissue classification and genetic network inference. A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns. This translates to the algorithmic problem of clustering genes based on their expression patterns. RESULTS: We present a novel clustering algorithm, called CLICK, and its applications to gene expression analysis. The algorithm utilizes graph-theoretic and statistical techniques to identify tight groups (kernels) of highly similar elements, which are likely to belong to the same true cluster. Several heuristic procedures are then used to expand the kernels into the full clusters. We report on the application of CLICK to a variety of gene expression data sets. In all those applications it outperformed extant algorithms according to several common figures of merit. We also point out that CLICK can be successfully used for the identification of common regulatory motifs in the upstream regions of co-regulated genes. Furthermore, we demonstrate how CLICK can be used to accurately classify tissue samples into disease types, based on their expression profiles. Finally, we present a new java-based graphical tool, called EXPANDER, for gene expression analysis and visualization, which incorporates CLICK and several other popular clustering algorithms. AVAILABILITY: http://www.cs.tau.ac.il/~rshamir/expander/expander.html 相似文献
3.
Khan HA 《Comparative and Functional Genomics》2004,5(1):39-47
The massive surge in the production of microarray data poses a great challenge for proper analysis and interpretation. In recent years numerous computational tools have been developed to extract meaningful interpretation of microarray gene expression data. However, a convenient tool for two-groups comparison of microarray data is still lacking and users have to rely on commercial statistical packages that might be costly and require special skills, in addition to extra time and effort for transferring data from one platform to other. Various statistical methods, including the t-test, analysis of variance, Pearson test and Mann-Whitney U test, have been reported for comparing microarray data, whereas the utilization of the Wilcoxon signed-rank test, which is an appropriate test for two-groups comparison of gene expression data, has largely been neglected in microarray studies. The aim of this investigation was to build an integrated tool, ArraySolver, for colour-coded graphical display and comparison of gene expression data using the Wilcoxon signed-rank test. The results of software validation showed similar outputs with ArraySolver and SPSS for large datasets. Whereas the former program appeared to be more accurate for 25 or fewer pairs (n = 25), suggesting its potential application in analysing molecular signatures that usually contain small numbers of genes. The main advantages of ArraySolver are easy data selection, convenient report format, accurate statistics and the familiar Excel platform. 相似文献
4.
Background
Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. 相似文献5.
Sharma A Imoto S Miyano S 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2012,9(3):754-764
Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r < h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the relevance of the selected genes in terms of their biological functions. 相似文献
6.
MOTIVATION: Analysis of genome-wide microarray data requires the estimation of a large number of genetic parameters for individual genes and their interaction expression patterns under multiple biological conditions. The sources of microarray error variability comprises various biological and experimental factors, such as biological and individual replication, sample preparation, hybridization and image processing. Moreover, the same gene often shows quite heterogeneous error variability under different biological and experimental conditions, which must be estimated separately for evaluating the statistical significance of differential expression patterns. Widely used linear modeling approaches are limited because they do not allow simultaneous modeling and inference on the large number of these genetic parameters and heterogeneous error components on different genes, different biological and experimental conditions, and varying intensity ranges in microarray data. RESULTS: We propose a Bayesian hierarchical error model (HEM) to overcome the above restrictions. HEM accounts for heterogeneous error variability in an oligonucleotide microarray experiment. The error variability is decomposed into two components (experimental and biological errors) when both biological and experimental replicates are available. Our HEM inference is based on Markov chain Monte Carlo to estimate a large number of parameters from a single-likelihood function for all genes. An F-like summary statistic is proposed to identify differentially expressed genes under multiple conditions based on the HEM estimation. The performance of HEM and its F-like statistic was examined with simulated data and two published microarray datasets-primate brain data and mouse B-cell development data. HEM was also compared with ANOVA using simulated data. AVAILABILITY: The software for the HEM is available from the authors upon request. 相似文献
7.
MOTIVATION: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. RESULTS: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. AVAILABILITY: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request. 相似文献
8.
9.
An improved algorithm for clustering gene expression data 总被引:1,自引:0,他引:1
MOTIVATION: Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering. RESULTS: The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed. 相似文献
10.
A lack of high-throughput techniques for making titrated, gene-specific changes in expression limits our understanding of the relationship between gene expression and cell phenotype. Here, we present a generalizable approach for quantifying growth rate as a function of titrated changes in gene expression level. The approach works by performing CRISPRi with a series of mutated single guide RNAs (sgRNAs) that modulate gene expression. To evaluate sgRNA mutation strategies, we constructed a library of 5927 sgRNAs targeting 88 genes in Escherichia coli MG1655 and measured the effects on growth rate. We found that a compounding mutational strategy, through which mutations are incrementally added to the sgRNA, presented a straightforward way to generate a monotonic and gradated relationship between mutation number and growth rate effect. We also implemented molecular barcoding to detect and correct for mutations that ‘escape’ the CRISPRi targeting machinery; this strategy unmasked deleterious growth rate effects obscured by the standard approach of ignoring escapers. Finally, we performed controlled environmental variations and observed that many gene-by-environment interactions go completely undetected at the limit of maximum knockdown, but instead manifest at intermediate expression perturbation strengths. Overall, our work provides an experimental platform for quantifying the phenotypic response to gene expression variation. 相似文献
11.
Kyungpil Kim Shibo Zhang Keni Jiang Li Cai In-Beum Lee Lewis J Feldman Haiyan Huang 《BMC bioinformatics》2007,8(1):29
Background
Clustering methods are widely used on gene expression data to categorize genes with similar expression profiles. Finding an appropriate (dis)similarity measure is critical to the analysis. In our study, we developed a new measure for clustering the genes when the key factor is the shape of the profile, and when the expression magnitude should also be accounted for in determining the gene relationship. This is achieved by modeling the shape and magnitude parameters separately in a gene expression profile, and then using the estimated shape and magnitude parameters to define a measure in a new feature space. 相似文献12.
CRCView is a user-friendly point-and-click web server for analyzing and visualizing microarray gene expression data using a Dirichlet process mixture model-based clustering algorithm. CRCView is designed to clustering genes based on their expression profiles. It allows flexible input data format, rich graphical illustration as well as integrated GO term based annotation/interpretation of clustering results. Availability: http://helab.bioinformatics.med.umich.edu/crcview/. 相似文献
13.
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources.
The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples.
Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn
offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k-
¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several
drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial
centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A
meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire
solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In
this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data.
Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing
algorithms. 相似文献
14.
Data analysis--not data production--is becoming the bottleneck in gene expression research. Data integration is necessary to cope with an ever increasing amount of data, to cross-validate noisy data sets, and to gain broad interdisciplinary views of large biological data sets. New Internet resources may help researchers to combine data sets across different gene expression platforms. However, noise and disparities in experimental protocols strongly limit data integration. A detailed review of four selected studies reveals how some of these limitations may be circumvented and illustrates what can be achieved through data integration. 相似文献
15.
16.
Classification of gene expression data is a pivotal research area that plays a substantial role in diagnosis and prediction of diseases. Generally, feature selection is one of the extensively used techniques in data mining approaches, especially in classification. Gene expression data are usually composed of dozens of samples characterized by thousands of genes. This increases the dimensionality coupled with the existence of irrelevant and redundant features. Accordingly, the selection of informative genes (features) becomes difficult, which badly affects the gene classification accuracy. In this paper, we consider the feature selection for classifying gene expression microarray datasets. The goal is to detect the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper-based selection method is applied. Experimental results showed that our proposed feature selection technique performs better than other techniques since it produces lower time latency and improves classification performance. 相似文献
17.
18.
19.
20.