首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Until recently, numerous feature selection techniques have been proposed and found wide applications in genomics and proteomics. For instance, feature/gene selection has proven to be useful for biomarker discovery from microarray and mass spectrometry data. While supervised feature selection has been explored extensively, there are only a few unsupervised methods that can be applied to exploratory data analysis. In this paper, we address the problem of unsupervised feature selection. First, we extend Laplacian linear discriminant analysis (LLDA) to unsupervised cases. Second, we propose a novel algorithm for computing LLDA, which is efficient in the case of high dimensionality and small sample size as in microarray data. Finally, an unsupervised feature selection method, called LLDA-based Recursive Feature Elimination (LLDA-RFE), is proposed. We apply LLDA-RFE to several public data sets of cancer microarrays and compare its performance with those of Laplacian score and SVD-entropy, two state-of-the-art unsupervised methods, and with that of Fisher score, a supervised filter method. Our results demonstrate that LLDA-RFE outperforms Laplacian score and shows favorable performance against SVD-entropy. It performs even better than Fisher score for some of the data sets, despite the fact that LLDA-RFE is fully unsupervised.  相似文献   

2.
MOTIVATION: Grouping genes having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Many clustering procedures have shown success in microarray gene clustering; most of them belong to the family of heuristic clustering algorithms. Model-based algorithms are alternative clustering algorithms, which are based on the assumption that the whole set of microarray data is a finite mixture of a certain type of distributions with different parameters. Application of the model-based algorithms to unsupervised clustering has been reported. Here, for the first time, we demonstrated the use of the model-based algorithm in supervised clustering of microarray data. RESULTS: We applied the proposed methods to real gene expression data and simulated data. We showed that the supervised model-based algorithm is superior over the unsupervised method and the support vector machines (SVM) method. AVAILABILITY: The program written in the SAS language implementing methods I-III in this report is available upon request. The software of SVMs is available in the website http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi  相似文献   

3.
This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice.  相似文献   

4.
MOTIVATION: Gene expression profiling is a powerful approach to identify genes that may be involved in a specific biological process on a global scale. For example, gene expression profiling of mutant animals that lack or contain an excess of certain cell types is a common way to identify genes that are important for the development and maintenance of given cell types. However, it is difficult for traditional computational methods, including unsupervised and supervised learning methods, to detect relevant genes from a large collection of expression profiles with high sensitivity and specificity. Unsupervised methods group similar gene expressions together while ignoring important prior biological knowledge. Supervised methods utilize training data from prior biological knowledge to classify gene expression. However, for many biological problems, little prior knowledge is available, which limits the prediction performance of most supervised methods. RESULTS: We present a Bayesian semi-supervised learning method, called BGEN, that improves upon supervised and unsupervised methods by both capturing relevant expression profiles and using prior biological knowledge from literature and experimental validation. Unlike currently available semi-supervised learning methods, this new method trains a kernel classifier based on labeled and unlabeled gene expression examples. The semi-supervised trained classifier can then be used to efficiently classify the remaining genes in the dataset. Moreover, we model the confidence of microarray probes and probabilistically combine multiple probe predictions into gene predictions. We apply BGEN to identify genes involved in the development of a specific cell lineage in the C. elegans embryo, and to further identify the tissues in which these genes are enriched. Compared to K-means clustering and SVM classification, BGEN achieves higher sensitivity and specificity. We confirm certain predictions by biological experiments. AVAILABILITY: The results are available at http://www.csail.mit.edu/~alanqi/projects/BGEN.html.  相似文献   

5.
MOTIVATION: Currently the most popular approach to analyze genome-wide expression data is clustering. One of the major drawbacks of most of the existing clustering methods is that the number of clusters has to be specified a priori. Furthermore, by using pure unsupervised algorithms prior biological knowledge is totally ignored Moreover, most current tools lack an effective framework for tight integration of unsupervised and supervised learning for the analysis of high-dimensional expression data and only very few multi-class supervised approaches are designed with the provision for effectively utilizing multiple functional class labeling. RESULTS: The paper adapts a novel Self-Organizing map called supervised Network Self-Organized Map (sNet-SOM) to the peculiarities of multi-labeled gene expression data. The sNet-SOM determines adaptively the number of clusters with a dynamic extension process. This process is driven by an inhomogeneous measure that tries to balance unsupervised, supervised and model complexity criteria. Nodes within a rectangular grid are grown at the boundary nodes, weights rippled from the internal nodes towards the outer nodes of the grid, and whole columns inserted within the map The appropriate level of expansion is determined automatically. Multiple sNet-SOM models are constructed dynamically each for a different unsupervised/supervised balance and model selection criteria are used to select the one optimum one. The results indicate that sNet-SOM yields competitive performance to other recently proposed approaches for supervised classification at a significantly reduced computational cost and it provides extensive exploratory analysis potentiality within the analysis framework. Furthermore, it explores simple design decisions that are easier to comprehend and computationally efficient.  相似文献   

6.

Background  

Microarray studies provide a way of linking variations of phenotypes with their genetic causations. Constructing predictive models using high dimensional microarray measurements usually consists of three steps: (1) unsupervised gene screening; (2) supervised gene screening; and (3) statistical model building. Supervised gene screening based on marginal gene ranking is commonly used to reduce the number of genes in the model building. Various simple statistics, such as t-statistic or signal to noise ratio, have been used to rank genes in the supervised screening. Despite of its extensive usage, statistical study of supervised gene screening remains scarce. Our study is partly motivated by the differences in gene discovery results caused by using different supervised gene screening methods.  相似文献   

7.
Particle tracking in living systems requires low light exposure and short exposure times to avoid phototoxicity and photobleaching and to fully capture particle motion with high-speed imaging. Low-excitation light comes at the expense of tracking accuracy. Image restoration methods based on deep learning dramatically improve the signal-to-noise ratio in low-exposure data sets, qualitatively improving the images. However, it is not clear whether images generated by these methods yield accurate quantitative measurements such as diffusion parameters in (single) particle tracking experiments. Here, we evaluate the performance of two popular deep learning denoising software packages for particle tracking, using synthetic data sets and movies of diffusing chromatin as biological examples. With synthetic data, both supervised and unsupervised deep learning restored particle motions with high accuracy in two-dimensional data sets, whereas artifacts were introduced by the denoisers in three-dimensional data sets. Experimentally, we found that, while both supervised and unsupervised approaches improved tracking results compared with the original noisy images, supervised learning generally outperformed the unsupervised approach. We find that nicer-looking image sequences are not synonymous with more precise tracking results and highlight that deep learning algorithms can produce deceiving artifacts with extremely noisy images. Finally, we address the challenge of selecting parameters to train convolutional neural networks by implementing a frugal Bayesian optimizer that rapidly explores multidimensional parameter spaces, identifying networks yielding optimal particle tracking accuracy. Our study provides quantitative outcome measures of image restoration using deep learning. We anticipate broad application of this approach to critically evaluate artificial intelligence solutions for quantitative microscopy.  相似文献   

8.
We present a web-based pipeline for microarray gene expression profile analysis, GEPAS, which stands for Gene Expression Profile Analysis Suite (http://gepas.bioinfo.cnio.es). GEPAS is composed of different interconnected modules which include tools for data pre-processing, two-conditions comparison, unsupervised and supervised clustering (which include some of the most popular methods as well as home made algorithms) and several tests for differential gene expression among different classes, continuous variables or survival analysis. A multiple purpose tool for data mining, based on Gene Ontology, is also linked to the tools, which constitutes a very convenient way of analysing clustering results. On-line tutorials are available from our main web server (http://bioinfo.cnio.es).  相似文献   

9.
MOTIVATION: The diverse microarray datasets that have become available over the past several years represent a rich opportunity and challenge for biological data mining. Many supervised and unsupervised methods have been developed for the analysis of individual microarray datasets. However, integrated analysis of multiple datasets can provide a broader insight into genetic regulation of specific biological pathways under a variety of conditions. RESULTS: To aid in the analysis of such large compendia of microarray experiments, we present Microarray Experiment Functional Integration Technology (MEFIT), a scalable Bayesian framework for predicting functional relationships from integrated microarray datasets. Furthermore, MEFIT predicts these functional relationships within the context of specific biological processes. All results are provided in the context of one or more specific biological functions, which can be provided by a biologist or drawn automatically from catalogs such as the Gene Ontology (GO). Using MEFIT, we integrated 40 Saccharomyces cerevisiae microarray datasets spanning 712 unique conditions. In tests based on 110 biological functions drawn from the GO biological process ontology, MEFIT provided a 5% or greater performance increase for 54 functions, with a 5% or more decrease in performance in only two functions.  相似文献   

10.
Abstract. This paper describes the use of supervised methods for the classification of vegetation. The difference between supervised classification and clustering is outlined, with reference to their current use in vegetation science. In the paper we describe the classification of Danish grasslands according to the Habitats Directive of the European Union, and demonstrate how a supervised classification can be used to achieve a standardized and statistical interpretation within a local flora. We thereby offer a statistical solution to the legal problem of protection of certain selected habitat types. The Habitats Directive protects three types of Danish grassland habitats, whereas two remaining types fall outside protection. A classification model is developed, using available Danish grassland data, for the discrimination of these five types based on their species composition. This new Habitats Directive classification is compared to a previously published unsupervised classification of Danish grassland vegetation. An indicator species analysis is used to find significant indicator species for the three protected habitat types in Denmark, and these are compared to the characteristic species mentioned in the interpretation manual of the Habitats Directive. Eventually, we discuss the pros and cons of supervised and unsupervised classification and conclude that supervised methods deserve more attention in vegetation science.  相似文献   

11.
The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1) Protein–protein interactions extraction, and (2) Gene–suicide association extraction. The evaluation of task (1) on the benchmark dataset (AImed corpus) showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene–suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.  相似文献   

12.
Recent neuroscience studies demonstrate that a deeper understanding of brain function requires a deeper understanding of behavior. Detailed behavioral measurements are now often collected using video cameras, resulting in an increased need for computer vision algorithms that extract useful information from video data. Here we introduce a new video analysis tool that combines the output of supervised pose estimation algorithms (e.g. DeepLabCut) with unsupervised dimensionality reduction methods to produce interpretable, low-dimensional representations of behavioral videos that extract more information than pose estimates alone. We demonstrate this tool by extracting interpretable behavioral features from videos of three different head-fixed mouse preparations, as well as a freely moving mouse in an open field arena, and show how these interpretable features can facilitate downstream behavioral and neural analyses. We also show how the behavioral features produced by our model improve the precision and interpretation of these downstream analyses compared to using the outputs of either fully supervised or fully unsupervised methods alone.  相似文献   

13.
Paul TK  Iba H 《Bio Systems》2005,82(3):208-225
Recently, DNA microarray-based gene expression profiles have been used to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and normal tissues. To this end, after selection of some predictive genes based on signal-to-noise (S2N) ratio, unsupervised learning like clustering and supervised learning like k-nearest neighbor (k NN) classifier are widely used. Instead of S2N ratio, adaptive searches like Probabilistic Model Building Genetic Algorithm (PMBGA) can be applied for selection of a smaller size gene subset that would classify patient samples more accurately. In this paper, we propose a new PMBGA-based method for identification of informative genes from microarray data. By applying our proposed method to classification of three microarray data sets of binary and multi-type tumors, we demonstrate that the gene subsets selected with our technique yield better classification accuracy.  相似文献   

14.
The application of ACO-based algorithms in data mining has been growing over the last few years, and several supervised and unsupervised learning algorithms have been developed using this bio-inspired approach. Most recent works about unsupervised learning have focused on clustering, showing the potential of ACO-based techniques. However, there are still clustering areas that are almost unexplored using these techniques, such as medoid-based clustering. Medoid-based clustering methods are helpful—compared to classical centroid-based techniques—when centroids cannot be easily defined. This paper proposes two medoid-based ACO clustering algorithms, where the only information needed is the distance between data: one algorithm that uses an ACO procedure to determine an optimal medoid set (METACOC algorithm) and another algorithm that uses an automatic selection of the number of clusters (METACOC-K algorithm). The proposed algorithms are compared against classical clustering approaches using synthetic and real-world datasets.  相似文献   

15.
Class prediction and feature selection are two learning tasks that are strictly paired in the search of molecular profiles from microarray data. Researchers have become aware how easy it is to incur a selection bias effect, and complex validation setups are required to avoid overly optimistic estimates of the predictive accuracy of the models and incorrect gene selections. This paper describes a semisupervised pattern discovery approach that uses the by-products of complete validation studies on experimental setups for gene profiling. In particular, we introduce the study of the patterns of single sample responses (sample-tracking profiles) to the gene selection process induced by typical supervised learning tasks in microarray studies. We originate sample-tracking profiles as the aggregated off-training evaluation of SVM models of increasing gene panel sizes. Genes are ranked by E-RFE, an entropy-based variant of the recursive feature elimination for support vector machines (RFE-SVM). A dynamic time warping (DTW) algorithm is then applied to define a metric between sample-tracking profiles. An unsupervised clustering based on the DTW metric allows automating the discovery of outliers and of subtypes of different molecular profiles. Applications are described on synthetic data and in two gene expression studies.  相似文献   

16.
Gene expression data analysis   总被引:2,自引:0,他引:2  
Microarrays are one of the latest breakthroughs in experimental molecular biology, which allow monitoring of gene expression for tens of thousands of genes in parallel and are already producing huge amounts of valuable data. Analysis and handling of such data is becoming one of the major bottlenecks in the utilization of the technology. The raw microarray data are images, which have to be transformed into gene expression matrices, tables where rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of the particular gene in the particular sample. These matrices have to be analyzed further if any knowledge about the underlying biological processes is to be extracted. In this paper we concentrate on discussing bioinformatics methods used for such analysis. We briefly discuss supervised and unsupervised data analysis and its applications, such as predicting gene function classes and cancer classification as well as some possible future directions.  相似文献   

17.
18.
Chae M  Chen JJ 《PloS one》2011,6(8):e22546

Background

In microarray data analysis, hierarchical clustering (HC) is often used to group samples or genes according to their gene expression profiles to study their associations. In a typical HC, nested clustering structures can be quickly identified in a tree. The relationship between objects is lost, however, because clusters rather than individual objects are compared. This results in a tree that is hard to interpret.

Methodology/Principal Findings

This study proposes an ordering method, HC-SYM, which minimizes bilateral symmetric distance of two adjacent clusters in a tree so that similar objects in the clusters are located in the cluster boundaries. The performance of HC-SYM was evaluated by both supervised and unsupervised approaches and compared favourably with other ordering methods.

Conclusions/Significance

The intuitive relationship between objects and flexibility of the HC-SYM method can be very helpful in the exploratory analysis of not only microarray data but also similar high-dimensional data.  相似文献   

19.
20.
Analysis of cellular phenotypes in large imaging data sets conventionally involves supervised statistical methods, which require user-annotated training data. This paper introduces an unsupervised learning method, based on temporally constrained combinatorial clustering, for automatic prediction of cell morphology classes in time-resolved images. We applied the unsupervised method to diverse fluorescent markers and screening data and validated accurate classification of human cell phenotypes, demonstrating fully objective data labeling in image-based systems biology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号