首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background  

The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data.  相似文献   

2.
A reliable and accurate identification of the type of tumors is crucial to the proper treatment of cancers. In recent years, it has been shown that sparse representation (SR) by l1-norm minimization is robust to noise, outliers and even incomplete measurements, and SR has been successfully used for classification. This paper presents a new SR-based method for tumor classification using gene expression data. A set of metasamples are extracted from the training samples, and then an input testing sample is represented as the linear combination of these metasamples by l1-regularized least square method. Classification is achieved by using a discriminating function defined on the representation coefficients. Since l1-norm minimization leads to a sparse solution, the proposed method is called metasample-based SR classification (MSRC). Extensive experiments on publicly available gene expression data sets show that MSRC is efficient for tumor classification, achieving higher accuracy than many existing representative schemes.  相似文献   

3.
《IRBM》2020,41(4):229-239
Feature selection algorithms are the cornerstone of machine learning. By increasing the properties of the samples and samples, the feature selection algorithm selects the significant features. The general name of the methods that perform this function is the feature selection algorithm. The general purpose of feature selection algorithms is to select the most relevant properties of data classes and to increase the classification performance. Thus, we can select features based on their classification performance. In this study, we have developed a feature selection algorithm based on decision support vectors classification performance. The method can work according to two different selection criteria. We tested the classification performances of the features selected with P-Score with three different classifiers. Besides, we assessed P-Score performance with 13 feature selection algorithms in the literature. According to the results of the study, the P-Score feature selection algorithm has been determined as a method which can be used in the field of machine learning.  相似文献   

4.
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies.  相似文献   

5.
Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.  相似文献   

6.
Sparse representation classification (SRC) is one of the most promising classification methods for supervised learning. This method can effectively exploit discriminating information by introducing a regularization terms to the data. With the desirable property of sparisty, SRC is robust to both noise and outliers. In this study, we propose a weighted meta-sample based non-parametric sparse representation classification method for the accurate identification of tumor subtype. The proposed method includes three steps. First, we extract the weighted meta-samples for each sub class from raw data, and the rationality of the weighting strategy is proven mathematically. Second, sparse representation coefficients can be obtained by regularization of underdetermined linear equations. Thus, data dependent sparsity can be adaptively tuned. A simple characteristic function is eventually utilized to achieve classification. Asymptotic time complexity analysis is applied to our method. Compared with some state-of-the-art classifiers, the proposed method has lower time complexity and more flexibility. Experiments on eight samples of publicly available gene expression profile data show the effectiveness of the proposed method.  相似文献   

7.
MOTIVATION: DNA arrays permit rapid, large-scale screening for patterns of gene expression and simultaneously yield the expression levels of thousands of genes for samples. The number of samples is usually limited, and such datasets are very sparse in high-dimensional gene space. Furthermore, most of the genes collected may not necessarily be of interest and uncertainty about which genes are relevant makes it difficult to construct an informative gene space. Unsupervised empirical sample pattern discovery and informative genes identification of such sparse high-dimensional datasets present interesting but challenging problems. RESULTS: A new model called empirical sample pattern detection (ESPD) is proposed to delineate pattern quality with informative genes. By integrating statistical metrics, data mining and machine learning techniques, this model dynamically measures and manipulates the relationship between samples and genes while conducting an iterative detection of informative space and the empirical pattern. The performance of the proposed method with various array datasets is illustrated.  相似文献   

8.
The availability of a great range of prior biological knowledge about the roles and functions of genes and gene-gene interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed.  相似文献   

9.
MOTIVATION: Two practical realities constrain the analysis of microarray data, mass spectra from proteomics, and biomedical infrared or magnetic resonance spectra. One is the 'curse of dimensionality': the number of features characterizing these data is in the thousands or tens of thousands. The other is the 'curse of dataset sparsity': the number of samples is limited. The consequences of these two curses are far-reaching when such data are used to classify the presence or absence of disease. RESULTS: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the training and independent validation sets. This non-uniqueness leads to interpretational difficulties and casts doubt on the biological relevance of any of these 'optimal' feature sets. We suggest an approach to assess the relative quality of apparently equally good classifiers.  相似文献   

10.
ABSTRACT: BACKGROUND: Patients with Mild Cognitive Impairment (MCI) are at high risk of progression to Alzheimer's dementia. Identifying MCI individuals with high likelihood of conversion to dementia and the associated biosignatures has recently received increasing attention in AD research. Different biosignatures for AD (neuroimaging, demographic, genetic and cognitive measures) may contain complementary information for diagnosis and prognosis of AD. METHODS: We have conducted a comprehensive study using a large number of samples from the Alzheimer's Disease Neuroimaging Initiative (ADNI) to test the power of integrating various baseline data for predicting the conversion from MCI to probable AD and identifying a small subset of biosignatures for the prediction and assess the relative importance of different modalities in predicting MCI to AD conversion. We have employed sparse logistic regression with stability selection for the integration and selection of potential predictors. Our study differs from many of the other ones in three important respects: (1) we use a large cohort of MCI samples that are unbiased with respect to age or education status between case and controls (2) we integrate and test various types of baseline data available in ADNI including MRI, demographic, genetic and cognitive measures and (3) we apply sparse logistic regression with stability selection to ADNI data for robust feature selection. RESULTS: We have used 319 MCI subjects from ADNI that had MRI measurements at the baseline and passed quality control, including 177 MCI Non-converters and 142 MCI Converters. Conversion was considered over the course of a 4-year follow-up period. A combination of 15 features (predictors) including those from MRI scans, APOE genotyping, and cognitive measures achieves the best prediction with an AUC score of 0.8587. These results also demonstrate the effectiveness of stability selection for feature selection in the context of sparse logistic regression.  相似文献   

11.
12.
Considering the two-class classification problem in brain imaging data analysis, we propose a sparse representation-based multi-variate pattern analysis (MVPA) algorithm to localize brain activation patterns corresponding to different stimulus classes/brain states respectively. Feature selection can be modeled as a sparse representation (or sparse regression) problem. Such technique has been successfully applied to voxel selection in fMRI data analysis. However, single selection based on sparse representation or other methods is prone to obtain a subset of the most informative features rather than all. Herein, our proposed algorithm recursively eliminates informative features selected by a sparse regression method until the decoding accuracy based on the remaining features drops to a threshold close to chance level. In this way, the resultant feature set including all the identified features is expected to involve all the informative features for discrimination. According to the signs of the sparse regression weights, these selected features are separated into two sets corresponding to two stimulus classes/brain states. Next, in order to remove irrelevant/noisy features in the two selected feature sets, we perform a nonparametric permutation test at the individual subject level or the group level. In data analysis, we verified our algorithm with a toy data set and an intrinsic signal optical imaging data set. The results show that our algorithm has accurately localized two class-related patterns. As an application example, we used our algorithm on a functional magnetic resonance imaging (fMRI) data set. Two sets of informative voxels, corresponding to two semantic categories (i.e., “old people” and “young people”), respectively, are obtained in the human brain.  相似文献   

13.
Automatic species identification has many advantages over traditional species identification. Currently, most plant automatic identification methods focus on the features of leaf shape, venation and texture, which are promising for the identification of some plant species. However, leaf tooth, a feature commonly used in traditional species identification, is ignored. In this paper, a novel automatic species identification method using sparse representation of leaf tooth features is proposed. In this method, image corners are detected first, and the abnormal image corner is removed by the PauTa criteria. Next, the top and bottom leaf tooth edges are discriminated to effectively correspond to the extracted image corners; then, four leaf tooth features (Leaf-num, Leaf-rate, Leaf-sharpness and Leaf-obliqueness) are extracted and concatenated into a feature vector. Finally, a sparse representation-based classifier is used to identify a plant species sample. Tests on a real-world leaf image dataset show that our proposed method is feasible for species identification.  相似文献   

14.
This paper presents a method for direct identification of fungal species solely by means of digital image analysis of colonies as seen after growth on a standard medium. The method described is completely automated and hence objective once digital images of the reference fungi have been established. Using a digital image it is possible to extract precise information from the surface of the fungal colony. This includes color distribution, colony dimensions and texture measurements. For fungal identification, this is normally done by visual observation that often results in a very subjective data recording. Isolates of nine different species of the genus Penicillium have been selected for the purpose. After incubation for 7 days, the fungal colonies are digitized using a very accurate digital camera. Prior to the image analysis each image is corrected for self-illumination, thereby gaining a set of directly corresponding images with respect to illumination. A Windows application has been developed to locate the position and size of up to three colonies in the digitized image. Using the estimated positions and sizes of the colonies, a number of relevant features can be extracted for further analysis. The method used to determine the position of the colonies will be covered as well as the feature selection. The texture measurements of colonies of the nine species were analyzed and a clustering of the data into the correct species was confirmed. This indicates that it is indeed possible to identify a given colony merely by macromorphological features. A classifier (in the normal distribution) based on measurements of 151 colonies incubated on yeast extract sucrose agar (YES) was used to discriminate between the species. This resulted in a correct classification rate of 100% when used on the training set and 96% using cross-validation. The same methods applied to 194 colonies incubated on Czapek yeast extract agar (CYA) resulted in a correct classification rate of 98% on the training set and 71% using cross-validation.  相似文献   

15.
To investigate the feasibility of identification of qualified and adulterated oil product using hyperspectral imaging(HIS) technique, a novel feature set based on quantized histogram matrix (QHM) and feature selection method using improved kernel independent component analysis (iKICA) is proposed for HSI. We use UV and Halogen excitations in this study. Region of interest(ROI) of hyperspectral images of 256 oil samples from four varieties are obtained within the spectral region of 400–720nm. Radiation indexes extracted from each ROI are used as feature vectors. These indexes are individual band radiation index (RI), difference of consecutive spectral band radiation index (DRI), ratio of consecutive spectral band radiation index (RRI) and normalized DRI (NDRI). Another set of features called quantized histogram matrix (QHM) are extracted by applying quantization on the image histogram from these features. Based on these feature sets, improved kernel independent component analysis (iKICA) is used to select significant features. For comparison, algorithms such as plus L reduce R (plusLrR), Fisher, multidimensional scaling (MDS), independent component analysis (ICA), and principle component analysis (PCA) are also used to select the most significant wavelengths or features. Support vector machine (SVM) is used as the classifier. Experimental results show that the proposed methods are able to obtain robust and better classification performance with fewer number of spectral bands and simplify the design of computer vision systems.  相似文献   

16.
Large volumes of data are routinely collected during bioprocess operations and, more recently, in basic biological research using genomics-based technologies. While these data often lack sufficient detail to be used for mechanism identification, it is possible that the underlying mechanisms affecting cell phenotype or process outcome are reflected as specific patterns in the overall or temporal sensor logs. This raises the possibility of identifying outcome-specific fingerprints that can be used for process or phenotype classification and the identification of discriminating characteristics, such as specific genes or process variables. The aim of this work is to provide a systematic approach to identifying and modeling patterns in historical records and using this information for process classification. This approach differs from others in that emphasis is placed on analyzing the data structure first and thereby extracting potentially relevant features prior to model creation. The initial step in this overall approach is to first identify the discriminating features of the relevant measurements and time windows, which can then be subsequently used to discriminate among different classes of process behavior. This is achieved via a mean hypothesis testing algorithm. Next, the homogeneity of the multivariate data in each class is explored via a novel cluster analysis technique called PC1 Time Series Clustering to ensure that the data subsets used accurately reflect the variability displayed in the historical records. This will be the topic of the second paper in this series. We present here the method for identifying discriminating features in data via mean hypothesis testing along with results from the analysis of case studies from industrial fermentations Copyright 2000 Academic Press.  相似文献   

17.
Falin LJ  Tyler BM 《PloS one》2011,6(7):e22071
The widespread use of high-throughput experimental assays designed to measure the entire complement of a cell's genes or gene products has led to vast stores of data that are extremely plentiful in terms of the number of items they can measure in a single sample, yet often sparse in the number of samples per experiment due to their high cost. This often leads to datasets where the number of treatment levels or time points sampled is limited, or where there are very small numbers of technical and/or biological replicates. Here we introduce a novel algorithm to quantify the uncertainty in the unmeasured intervals between biological measurements taken across a set of quantitative treatments. The algorithm provides a probabilistic distribution of possible gene expression values within unmeasured intervals, based on a plausible biological constraint. We show how quantification of this uncertainty can be used to guide researchers in further data collection by identifying which samples would likely add the most information to the system under study. Although the context for developing the algorithm was gene expression measurements taken over a time series, the approach can be readily applied to any set of quantitative systems biology measurements taken following quantitative (i.e. non-categorical) treatments. In principle, the method could also be applied to combinations of treatments, in which case it could greatly simplify the task of exploring the large combinatorial space of future possible measurements.  相似文献   

18.
Nowadays, brain signals are employed in various scientific and practical fields such as Medical Science, Cognitive Science, Neuroscience, and Brain Computer Interfaces. Hence, the need for robust signal analysis methods with adequate accuracy and generalizability is inevitable. The brain signal analysis is faced with complex challenges including small sample size, high dimensionality and noisy signals. Moreover, because of the non-stationarity of brain signals and the impacts of mental states on brain function, the brain signals are associated with an inherent uncertainty. In this paper, an evidence-based combining classifiers method is proposed for brain signal analysis. This method exploits the power of combining classifiers for solving complex problems and the ability of evidence theory to model as well as to reduce the existing uncertainty. The proposed method models the uncertainty in the labels of training samples in each feature space by assigning soft and crisp labels to them. Then, some classifiers are employed to approximate the belief function corresponding to each feature space. By combining the evidence raised from each classifier through the evidence theory, more confident decisions about testing samples can be made. The obtained results by the proposed method compared to some other evidence-based and fixed rule combining methods on artificial and real datasets exhibit the ability of the proposed method in dealing with complex and uncertain classification problems.  相似文献   

19.
In terms of making genes expression data more interpretable and comprehensible, there exists a significant superiority on sparse methods. Many sparse methods, such as penalized matrix decomposition (PMD) and sparse principal component analysis (SPCA), have been applied to extract plants core genes. Supervised algorithms, especially the support vector machine-recursive feature elimination (SVM-RFE) method, always have good performance in gene selection. In this paper, we draw into class information via the total scatter matrix and put forward a class-information-based penalized matrix decomposition (CIPMD) method to improve the gene identification performance of PMD-based method. Firstly, the total scatter matrix is obtained based on different samples of the gene expression data. Secondly, a new data matrix is constructed by decomposing the total scatter matrix. Thirdly, the new data matrix is decomposed by PMD to obtain the sparse eigensamples. Finally, the core genes are identified according to the nonzero entries in eigensamples. The results on simulation data show that CIPMD method can reach higher identification accuracies than the conventional gene identification methods. Moreover, the results on real gene expression data demonstrate that CIPMD method can identify more core genes closely related to the abiotic stresses than the other methods.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号