期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Classification using partial least squares with penalized logistic regression

Fort G Lambert-Lacroix S 《Bioinformatics (Oxford, England)》2005,21(7):1104-1111

MOTIVATION: One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one. RESULTS: In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate. 相似文献

2.

Partial least squares dimension reduction for microarray gene expression data with a censored response

Nguyen DV 《Mathematical biosciences》2005,193(1):119-137

相似文献

3.

Independent component analysis-based penalized discriminant method for tumor classification using gene expression data 总被引：2，自引：0，他引：2

Huang DS Zheng CH 《Bioinformatics (Oxford, England)》2006,22(15):1855-1862

相似文献

4.

基因芯片整合方法在预报儿童急性髓性白血病亚型中的应用

郭景康朱煜王健《生物信息学》2009,7(2):99-103

对急性髓性白血病（AML）病人进行明确的亚型分类,有助于制定合适的治疗方案并预测其治疗效果。之前研究表明基因芯片技术在白血病亚型分类中已取得了较好效果,但由于儿童AML发病率较低,相应的芯片分析研究较少,因此目前用于构建儿童AML亚型分类模型的数据相对不足,是否可以应用现有的成人分类模型数据来对儿童AML进行预报还有待研究。应用基因芯片整合分析方法,对来自不同实验的研究成人或儿童AML亚型分类的基因芯片数据进行整合,应用支持向量机分析整合后数据集的亚型预报准确率。结果表明整合后的芯片数据在儿童AML亚型分类预报中的准确率达到97．24％,特征基因分析结果也说明在同一种AML亚型中,对于来自不同年龄组的样本,其特征基因有较高的表达相似性。相似文献

5.

A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approaches

Sampson DL Parker TJ Upton Z Hurst CP 《PloS one》2011,6(9):e24973

The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called “omics” disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems. 相似文献

6.

Partial least squares proportional hazard regression for application to DNA microarray survival data 总被引：3，自引：0，他引：3

Nguyen DV Rocke DM 《Bioinformatics (Oxford, England)》2002,18(12):1625-1632

相似文献

7.

Multi-class cancer classification via partial least squares with gene expression profiles 总被引：8，自引：0，他引：8

Nguyen DV Rocke DM 《Bioinformatics (Oxford, England)》2002,18(9):1216-1226

MOTIVATION: Discrimination between two classes such as normal and cancer samples and between two types of cancers based on gene expression profiles is an important problem which has practical implications as well as the potential to further our understanding of gene expression of various cancer cells. Classification or discrimination of more than two groups or classes (multi-class) is also needed. The need for multi-class discrimination methodologies is apparent in many microarray experiments where various cancer types are considered simultaneously. RESULTS: Thus, in this paper we present the extension to the classification methodology proposed earlier Nguyen and Rocke (2002b; Bioinformatics, 18, 39-50) to classify cancer samples from multiple classes. The methodologies proposed in this paper are applied to four gene expression data sets with multiple classes: (a) a hereditary breast cancer data set with (1) BRCA1-mutation, (2) BRCA2-mutation and (3) sporadic breast cancer samples, (b) an acute leukemia data set with (1) acute myeloid leukemia (AML), (2) T-cell acute lymphoblastic leukemia (T-ALL) and (3) B-cell acute lymphoblastic leukemia (B-ALL) samples, (c) a lymphoma data set with (1) diffuse large B-cell lymphoma (DLBCL), (2) B-cell chronic lymphocytic leukemia (BCLL) and (3) follicular lymphoma (FL) samples, and (d) the NCI60 data set with cell lines derived from cancers of various sites of origin. In addition, we evaluated the classification algorithms and examined the variability of the error rates using simulations based on randomization of the real data sets. We note that there are other methods for addressing multi-class prediction recently and our approach is along the line of Nguyen and Rocke (2002b; Bioinformatics, 18, 39-50). CONTACT: dnguyen@stat.tamu.edu; dmrocke@ucdavis.edu 相似文献

8.

Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO

Datta S Le-Rademacher J Datta S 《Biometrics》2007,63(1):259-271

We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data-reweighting, mean imputation, and multiple imputation-are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies. 相似文献

9.

PCA 和PLS 应用于胃癌亚型分类研究

李建更李萍李君阮晓钢《生物物理学报》2009,25(2):141-147

文章研究了基于微阵列基因表达数据的胃癌亚型分类。微阵列基因表达数据样本少、纬度高、噪声大的特点,使得数据降维成为分类成功的关键。作者将主成分分析(PCA) 和偏最小二乘（PLS）两种降维方法应用于胃癌亚型分类研究,以支持向量机（SVM）、K- 近邻法（KNN）为分类器对两套胃癌数据进行亚型分类。分类效果相比传统的医理诊断略高,最高准确率可达100%。研究结果表明,主成分分析和偏最小二乘方法能够有效地提取分类特征信息,并能在保持较高的分类准确率的前提下大幅度地降低基因表达数据的维数。相似文献

10.

Multidimensional support vector machines for visualization of gene expression data

Komura D Nakamura H Tsutsumi S Aburatani H Ihara S 《Bioinformatics (Oxford, England)》2005,21(4):439-444

MOTIVATION: Since DNA microarray experiments provide us with huge amount of gene expression data, they should be analyzed with statistical methods to extract the meanings of experimental results. Some dimensionality reduction methods such as Principal Component Analysis (PCA) are used to roughly visualize the distribution of high dimensional gene expression data. However, in the case of binary classification of gene expression data, PCA does not utilize class information when choosing axes. Thus clearly separable data in the original space may not be so in the reduced space used in PCA. RESULTS: For visualization and class prediction of gene expression data, we have developed a new SVM-based method called multidimensional SVMs, that generate multiple orthogonal axes. This method projects high dimensional data into lower dimensional space to exhibit properties of the data clearly and to visualize a distribution of the data roughly. Furthermore, the multiple axes can be used for class prediction. The basic properties of conventional SVMs are retained in our method: solutions of mathematical programming are sparse, and nonlinear classification is implemented implicitly through the use of kernel functions. The application of our method to the experimentally obtained gene expression datasets for patients' samples indicates that our algorithm is efficient and useful for visualization and class prediction. CONTACT: komura@hal.rcast.u-tokyo.ac.jp. 相似文献

11.

Application of wavelet-based neural network on DNA microarray data

Jack Lee Benny Zee 《Bioinformation》2008,3(5):223-229

The advantage of using DNA microarray data when investigating human cancer gene expressions is its ability to generate enormous amount of information from a single assay in order to speed up the scientific evaluation process. The number of variables from the gene expression data coupled with comparably much less number of samples creates new challenges to scientists and statisticians. In particular, the problems include enormous degree of collinearity among genes expressions, likely violation of model assumptions as well as high level of noise with potential outliers. To deal with these problems, we propose a block wavelet shrinkage principal component (BWSPCA) analysis method to optimize the information during the noise reduction process. This paper firstly uses the National Cancer Institute database (NC160) as an illustration and shows a significant improvement in dimension reduction. Secondly we combine BWSPCA with an artificial neural network-based gene minimization strategy to establish a Block Wavelet-based Neural Network model in a robust and accurate cancer classification process (BWNN). Our extensive experiments on six public cancer datasets have shown that the method of BWNN for tumor classification performed well, especially on some difficult instances with large-class (more than two) expression data. This proposed method is extremely useful for data denoising and is competitiveness with respect to other methods such as BagBoost, RandomForest (RanFor), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) and Artificial Neural Network (ANN). 相似文献

12.

3′-End Sequencing for Expression Quantification (3SEQ) from Archival Tumor Samples

Andrew H. Beck Ziming Weng Daniela M. Witten Shirley Zhu Joseph W. Foley Phil Lacroute Cheryl L. Smith Robert Tibshirani Matt van de Rijn Arend Sidow Robert B. West 《PloS one》2010,5(1)

Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address this limitation of gene expression microarrays, we designed a novel procedure (3′-end sequencing for expression quantification (3SEQ)) for gene expression profiling from FFPET using next-generation sequencing. We performed gene expression profiling by 3SEQ and microarray on both frozen tissue and FFPET from two soft tissue tumors (desmoid type fibromatosis (DTF) and solitary fibrous tumor (SFT)) (total n = 23 samples, which were each profiled by at least one of the four platform-tissue preparation combinations). Analysis of 3SEQ data revealed many genes differentially expressed between the tumor types (FDR<0.01) on both the frozen tissue (∼9.6K genes) and FFPET (∼8.1K genes). Analysis of microarray data from frozen tissue revealed fewer differentially expressed genes (∼4.64K), and analysis of microarray data on FFPET revealed very few (69) differentially expressed genes. Functional gene set analysis of 3SEQ data from both frozen tissue and FFPET identified biological pathways known to be important in DTF and SFT pathogenesis and suggested several additional candidate oncogenic pathways in these tumors. These findings demonstrate that 3SEQ is an effective technique for gene expression profiling from archival tumor samples and may facilitate significant advances in translational cancer research. 相似文献

13.

Partial least squares: a versatile tool for the analysis of high-dimensional genomic data 总被引：3，自引：0，他引：3

Boulesteix AL Strimmer K 《Briefings in bioinformatics》2007,8(1):32-44

相似文献

14.

Phylogenetic modeling of heterogeneous gene-expression microarray data from cancerous specimens

Abu-Asab MS Chaouchi M Amri H 《Omics : a journal of integrative biology》2008,12(3):183-199

The qualitative dimension of gene expression data and its heterogeneous nature in cancerous specimens can be accounted for by phylogenetic modeling that incorporates the directionality of altered gene expressions, complex patterns of expressions among a group of specimens, and data-based rather than specimen-based gene linkage. Our phylogenetic modeling approach is a double algorithmic technique that includes polarity assessment that brings out the qualitative value of the data, followed by maximum parsimony analysis that is most suitable for the data heterogeneity of cancer gene expression. We demonstrate that polarity assessment of expression values into derived and ancestral states, via outgroup comparison, reduces experimental noise; reveals dichotomously expressed asynchronous genes; and allows data pooling as well as comparability of intra- and interplatforms. Parsimony phylogenetic analysis of the polarized values produces a multidimensional classification of specimens into clades that reveal shared derived gene expressions (the synapomorphies); provides better assessment of ontogenic pathways and phyletic relatedness of specimens; efficiently utilizes dichotomously expressed genes; produces highly predictive class recognition; illustrates gene linkage and multiple developmental pathways; provides higher concordance between gene lists; and projects the direction of change among specimens. Further implication of this phylogenetic approach is that it may transform microarray into diagnostic, prognostic, and predictive tool. 相似文献

15.

Classification of microarray data with factor mixture models

Martella F 《Bioinformatics (Oxford, England)》2006,22(2):202-208

相似文献

16.

Graphical methods for class prediction using dimension reduction techniques on DNA microarray data

Bura E Pfeiffer RM 《Bioinformatics (Oxford, England)》2003,19(10):1252-1258

MOTIVATION: We introduce simple graphical classification and prediction tools for tumor status using gene-expression profiles. They are based on two dimension estimation techniques sliced average variance estimation (SAVE) and sliced inverse regression (SIR). Both SAVE and SIR are used to infer on the dimension of the classification problem and obtain linear combinations of genes that contain sufficient information to predict class membership, such as tumor type. Plots of the estimated directions as well as numerical thresholds estimated from the plots are used to predict tumor classes in cDNA microarrays and the performance of the class predictors is assessed by cross-validation. A microarray simulation study is carried out to compare the power and predictive accuracy of the two methods. RESULTS: The methods are applied to cDNA microarray data on BRCA1 and BRCA2 mutation carriers as well as sporadic tumors from Hedenfalk et al. (2001). All samples are correctly classified. 相似文献

17.

A Microarray Platform-Independent Classification Tool for Cell of Origin Class Allows Comparative Analysis of Gene Expression in Diffuse Large B-cell Lymphoma

Matthew A. Care Sharon Barrans Lisa Worrillow Andrew Jack David R. Westhead Reuben M. Tooze 《PloS one》2013,8(2)

相似文献

18.

Effective dimension reduction methods for tumor classification using gene expression data 总被引：2，自引：0，他引：2

Antoniadis A Lambert-Lacroix S Leblanc F 《Bioinformatics (Oxford, England)》2003,19(5):563-570

MOTIVATION: One particular application of microarray data, is to uncover the molecular variation among cancers. One feature of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usually in the thousands. In statistical terms this very large number of predictors compared to a small number of samples or observations makes the classification problem difficult. An efficient way to solve this problem is by using dimension reduction statistical techniques in conjunction with nonparametric discriminant procedures. RESULTS: We view the classification problem as a regression problem with few observations and many predictor variables. We use an adaptive dimension reduction method for generalized semi-parametric regression models that allows us to solve the 'curse of dimensionality problem' arising in the context of expression data. The predictive performance of the resulting classification rule is illustrated on two well know data sets in the microarray literature: the leukemia data that is known to contain classes that are easy 'separable' and the colon data set. 相似文献

19.

Microarray learning with ABC

Amaratunga D Cabrera J Kovtun V 《Biostatistics (Oxford, England)》2008,9(1):128-136

Standard clustering algorithms when applied to DNA microarray data often tend to produce erroneous clusters. A major contributor to this divergence is the feature characteristic of microarray data sets that the number of predictors (genes) in such data far exceeds the number of samples by many orders of magnitude, with only a small percentage of predictors being truly informative with regards to the clustering while the rest merely add noise. An additional complication is that the predictors exhibit an unknown complex correlational configuration embedded in a small subspace of the entire predictor space. Under these conditions, standard clustering algorithms fail to find the true clusters even when applied in tandem with some sort of gene filtering or dimension reduction to reduce the number of predictors. We propose, as an alternative, a novel method for unsupervised classification of DNA microarray data. The method, which is based on the idea of aggregating results obtained from an ensemble of randomly resampled data (where both samples and genes are resampled), introduces a way of tilting the procedure so that the ensemble includes minimal representation from less important areas of the gene predictor space. The method produces a measure of dissimilarity between each pair of samples that can be used in conjunction with (a) a method like Ward's procedure to generate a cluster analysis and (b) multidimensional scaling to generate useful visualizations of the data. We call the dissimilarity measures ABC dissimilarities since they are obtained by aggregating bundles of clusters. An extensive comparison of several clustering methods using actual DNA microarray data convincingly demonstrates that classification using ABC dissimilarities offers significantly superior performance. 相似文献

20.

Inference in morphological taxonomy using collinear data and small sample sizes: Monogenean sclerites (Platyhelminthes) as a case study

Matthias Vignon 《Zoologica scripta》2011,40(3):306-316

Vignon, M. (2011) Inference in morphological taxonomy using collinear data and small sample sizes: Monogenean sclerites (Platyhelminthes) as a case study. —Zoologica Scripta, 40, 306–316. Taxonomists and evolutionary biologists frequently use a combination of morphological measurements to distinguish between species and investigate local adaptation. However, the entire set of characters often displays various degrees of collinearity. This paper discusses the effect of using collinear data in morphological taxonomy and ways to handle multicollinearity in a classification context, with special consideration for small sample size. In addition, I propose a robust and easy‐to‐use combination of dimension reduction using partial least squares (PLS) with traditional discriminant methods for morphological data. To do this, I investigated morphological variation patterns among four monogenean populations from the Pacific Ocean using the correlated morphological features of the sclerotized attachment organ. The new approach yielded better prediction results (lower classification error rates) than the traditional dimension reduction method based on principle component analysis (PCA) and is also much more robust for small sample size. This emphasizes that PLS may be more efficient than PCA in dealing with correlated data and extracting the most relevant morphological differences among groups. 相似文献