期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Tumor classification by partial least squares using microarray gene expression data 总被引：30，自引：0，他引：30

Nguyen DV Rocke DM 《Bioinformatics (Oxford, England)》2002,18(1):39-50

MOTIVATION: One important application of gene expression microarray data is classification of samples into categories, such as the type of tumor. The use of microarrays allows simultaneous monitoring of thousands of genes expressions per sample. This ability to measure gene expression en masse has resulted in data with the number of variables p(genes) far exceeding the number of samples N. Standard statistical methodologies in classification and prediction do not work well or even at all when N < p. Modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data. RESULTS: We propose a novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions. This procedure involves dimension reduction using Partial Least Squares (PLS) and classification using Logistic Discrimination (LD) and Quadratic Discriminant Analysis (QDA). We compare PLS to the well known dimension reduction method of Principal Components Analysis (PCA). Under many circumstances PLS proves superior; we illustrate a condition when PCA particularly fails to predict well relative to PLS. The proposed methods were applied to five different microarray data sets involving various human tumor samples: (1) normal versus ovarian tumor; (2) Acute Myeloid Leukemia (AML) versus Acute Lymphoblastic Leukemia (ALL); (3) Diffuse Large B-cell Lymphoma (DLBCLL) versus B-cell Chronic Lymphocytic Leukemia (BCLL); (4) normal versus colon tumor; and (5) Non-Small-Cell-Lung-Carcinoma (NSCLC) versus renal samples. Stability of classification results and methods were further assessed by re-randomization studies. 相似文献

2.

Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and LASSO

Datta S Le-Rademacher J Datta S 《Biometrics》2007,63(1):259-271

We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data-reweighting, mean imputation, and multiple imputation-are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies. 相似文献

3.

Multi-class cancer classification via partial least squares with gene expression profiles 总被引：8，自引：0，他引：8

Nguyen DV Rocke DM 《Bioinformatics (Oxford, England)》2002,18(9):1216-1226

MOTIVATION: Discrimination between two classes such as normal and cancer samples and between two types of cancers based on gene expression profiles is an important problem which has practical implications as well as the potential to further our understanding of gene expression of various cancer cells. Classification or discrimination of more than two groups or classes (multi-class) is also needed. The need for multi-class discrimination methodologies is apparent in many microarray experiments where various cancer types are considered simultaneously. RESULTS: Thus, in this paper we present the extension to the classification methodology proposed earlier Nguyen and Rocke (2002b; Bioinformatics, 18, 39-50) to classify cancer samples from multiple classes. The methodologies proposed in this paper are applied to four gene expression data sets with multiple classes: (a) a hereditary breast cancer data set with (1) BRCA1-mutation, (2) BRCA2-mutation and (3) sporadic breast cancer samples, (b) an acute leukemia data set with (1) acute myeloid leukemia (AML), (2) T-cell acute lymphoblastic leukemia (T-ALL) and (3) B-cell acute lymphoblastic leukemia (B-ALL) samples, (c) a lymphoma data set with (1) diffuse large B-cell lymphoma (DLBCL), (2) B-cell chronic lymphocytic leukemia (BCLL) and (3) follicular lymphoma (FL) samples, and (d) the NCI60 data set with cell lines derived from cancers of various sites of origin. In addition, we evaluated the classification algorithms and examined the variability of the error rates using simulations based on randomization of the real data sets. We note that there are other methods for addressing multi-class prediction recently and our approach is along the line of Nguyen and Rocke (2002b; Bioinformatics, 18, 39-50). CONTACT: dnguyen@stat.tamu.edu; dmrocke@ucdavis.edu 相似文献

4.

Surrogate variable analysis using partial least squares (SVA-PLS) in gene expression studies

Chakraborty S Datta S Datta S 《Bioinformatics (Oxford, England)》2012,28(6):799-806

MOTIVATION: In a typical gene expression profiling study, our prime objective is to identify the genes that are differentially expressed between the samples from two different tissue types. Commonly, standard analysis of variance (ANOVA)/regression is implemented to identify the relative effects of these genes over the two types of samples from their respective arrays of expression levels. But, this technique becomes fundamentally flawed when there are unaccounted sources of variability in these arrays (latent variables attributable to different biological, environmental or other factors relevant in the context). These factors distort the true picture of differential gene expression between the two tissue types and introduce spurious signals of expression heterogeneity. As a result, many genes which are actually differentially expressed are not detected, whereas many others are falsely identified as positives. Moreover, these distortions can be different for different genes. Thus, it is also not possible to get rid of these variations by simple array normalizations. This both-way error can lead to a serious loss in sensitivity and specificity, thereby causing a severe inefficiency in the underlying multiple testing problem. In this work, we attempt to identify the hidden effects of the underlying latent factors in a gene expression profiling study by partial least squares (PLS) and apply ANCOVA technique with the PLS-identified signatures of these hidden effects as covariates, in order to identify the genes that are truly differentially expressed between the two concerned tissue types. RESULTS: We compare the performance of our method SVA-PLS with standard ANOVA and a relatively recent technique of surrogate variable analysis (SVA), on a wide variety of simulation settings (incorporating different effects of the hidden variable, under situations with varying signal intensities and gene groupings). In all settings, our method yields the highest sensitivity while maintaining relatively reasonable values for the specificity, false discovery rate and false non-discovery rate. Application of our method to gene expression profiling for acute megakaryoblastic leukemia shows that our method detects an additional six genes, that are missed by both the standard ANOVA method as well as SVA, but may be relevant to this disease, as can be seen from mining the existing literature. 相似文献

5.

Classification using partial least squares with penalized logistic regression

Fort G Lambert-Lacroix S 《Bioinformatics (Oxford, England)》2005,21(7):1104-1111

MOTIVATION: One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one. RESULTS: In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate. 相似文献

6.

Higher-order partial least squares for predicting gene expression levels from chromatin states

Shiquan Sun Xifang Sun Yan Zheng 《BMC bioinformatics》2018,19(5):113

相似文献

7.

Kernelized partial least squares for feature reduction and classification of gene microarray data

WH Land X Qiao DE Margolis WS Ford CT Paquette JF Perez-Rogers JA Borgia JY Yang Y Deng 《BMC systems biology》2011,5(Z3):S13

相似文献

8.

Partial least squares dimension reduction for microarray gene expression data with a censored response

Nguyen DV 《Mathematical biosciences》2005,193(1):119-137

相似文献

9.

Protein family classification with partial least squares

Opiyo SO Moriyama EN 《Journal of proteome research》2007,6(2):846-853

The quality of protein function predictions relies on appropriate training of protein classification methods. Performance of these methods can be affected when only a limited number of protein samples are available, which is often the case in divergent protein families. Whereas profile hidden Markov models and PSI-BLAST presented significant performance decrease in such cases, alignment-free partial least-squares classifiers performed consistently better even when used to identify short fragmented sequences. 相似文献

10.

Missing value estimation for DNA microarray gene expression data: local least squares imputation 总被引：9，自引：0，他引：9

Kim H Golub GH Park H 《Bioinformatics (Oxford, England)》2005,21(2):187-198

MOTIVATION: Gene expression data often contain missing expression values. Effective missing value estimation methods are needed since many algorithms for gene expression data analysis require a complete matrix of gene array values. In this paper, imputation methods based on the least squares formulation are proposed to estimate missing values in the gene expression data, which exploit local similarity structures in the data as well as least squares optimization process. RESULTS: The proposed local least squares imputation method (LLSimpute) represents a target gene that has missing values as a linear combination of similar genes. The similar genes are chosen by k-nearest neighbors or k coherent genes that have large absolute values of Pearson correlation coefficients. Non-parametric missing values estimation method of LLSimpute are designed by introducing an automatic k-value estimator. In our experiments, the proposed LLSimpute method shows competitive results when compared with other imputation methods for missing value estimation on various datasets and percentages of missing values in the data. AVAILABILITY: The software is available at http://www.cs.umn.edu/~hskim/tools.html CONTACT: hpark@cs.umn.edu 相似文献

11.

Missing value estimation for DNA microarray gene expression data: local least squares imputation

Kim Hyunsoo; Golub Gene H.; Park Haesun 《Bioinformatics (Oxford, England)》2006,22(11):1410-1411

In our article, only a set of random positions of missing valueswas used for each dataset. However, imputation methods may 相似文献

12.

样条变换偏最小二乘在肝癌数据分类中的应用

李建更李辉《生物学杂志》2011,28(6):58-61

肝癌是中国最常见的恶性肿瘤之一。基于肿瘤基因表达谱数据的分析与研究是当今研究的热点,对于癌症的早期诊断、治疗具有十分重要的意义。针对高维小样本基因表达谱数据所显现的变量间严重共线性、类别变量与预测变量的非线性关系,采用了基于样条变换的偏最小二乘回归新技术。首先通过筛选法去除基因表达谱数据中的冗余信息,然后以3次B基样条变换实现非线性基因表达谱数据的线性化重构,随后将重构的矩阵交由偏最小二乘法构建类别变量与预测变量间的关系模型。最后,通过对肝癌肿瘤基因表达谱数据的分析,结果显示此分类模型对数据重构稳健,有效的解决了高维小样本基因表达谱数据间的过拟合和变量间的共线性,具有较高的拟合和分类正确率。相似文献

13.

Monotone spline‐based least squares estimation for panel count data with informative observation times

下载免费PDF全文

Shirong Deng Li Liu Xingqiu Zhao 《Biometrical journal. Biometrische Zeitschrift》2015,57(5):743-765

This article discusses the statistical analysis of panel count data when the underlying recurrent event process and observation process may be correlated. For the recurrent event process, we propose a new class of semiparametric mean models that allows for the interaction between the observation history and covariates. For inference on the model parameters, a monotone spline‐based least squares estimation approach is developed, and the resulting estimators are consistent and asymptotically normal. In particular, our new approach does not rely on the model specification of the observation process. The proposed inference procedure performs well through simulation studies, and it is illustrated by the analysis of bladder tumor data. 相似文献

14.

Semi-supervised methods to predict patient survival from gene expression data 总被引：1，自引：0，他引：1

下载免费PDF全文

Bair E Tibshirani R 《PLoS biology》2004,2(4):e108

An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer. 相似文献

15.

Testing association of a pathway with survival using gene expression data 总被引：2，自引：0，他引：2

Goeman JJ Oosting J Cleton-Jansen AM Anninga JK van Houwelingen HC 《Bioinformatics (Oxford, England)》2005,21(9):1950-1957

MOTIVATION: A recent surge of interest in survival as the primary clinical endpoint of microarray studies has called for an extension of the Global Test methodology to survival. RESULTS: We present a score test for association of the expression profile of one or more groups of genes with a (possibly censored) survival time. Groups of genes may be pathways, areas of the genome, clusters from a cluster analysis or all genes on a chip. The test allows one to test hypotheses about the influence of these groups of genes on survival directly, without the intermediary of single gene testing. The test is based on the Cox proportional hazards model and is calculated using martingale residuals. It is possible to adjust the test for the presence of covariates. We also present a diagnostic graph to assist in the interpretation of the test result, visualizing the influence of genes. The test is applied to a tumor dataset, revealing pathways from the gene ontology database that are associated with survival of patients. AVAILABILITY: The Global Test for survival has been incorporated into the R-package globaltest (version 3.0), available at http://www.bioconductor.org 相似文献

16.

Partial least squares proportional hazard regression for application to DNA microarray survival data 总被引：3，自引：0，他引：3

Nguyen DV Rocke DM 《Bioinformatics (Oxford, England)》2002,18(12):1625-1632

相似文献

17.

Iterative partial least squares with right-censored data analysis: a comparison to other dimension reduction techniques

Huang J Harrington D 《Biometrics》2005,61(1):17-24

In the linear model with right-censored responses and many potential explanatory variables, regression parameter estimates may be unstable or, when the covariates outnumber the uncensored observations, not estimable. We propose an iterative algorithm for partial least squares, based on the Buckley-James estimating equation, to estimate the covariate effect and predict the response for a future subject with a given set of covariates. We use a leave-two-out cross-validation method for empirically selecting the number of components in the partial least-squares fit that approximately minimizes the error in estimating the covariate effect of a future observation. Simulation studies compare the methods discussed here with other dimension reduction techniques. Data from the AIDS Clinical Trials Group protocol 333 are used to motivate the methodology. 相似文献

18.

Reconstruction of genetic association networks from microarray data: a partial least squares approach

Pihur V Datta S Datta S 《Bioinformatics (Oxford, England)》2008,24(4):561-568

MOTIVATION: Gene association/interaction networks provide vast amounts of information about essential processes inside the cell. A complete picture of gene-gene associations/interactions would open new horizons for biologists, ranging from pure appreciation to successful manipulation of biological pathways for therapeutic purposes. Therefore, identification of important biological complexes whose members (genes and their products proteins) interact with each other is of prime importance. Numerous experimental methods exist but, for the most part, they are costly and labor intensive. Computational techniques, such as the one proposed in this work, provide a quick 'budget' solution that can be used as a screening tool before more expensive techniques are attempted. Here, we introduce a novel computational method based on the partial least squares (PLS) regression technique for reconstruction of genetic networks from microarray data. RESULTS: The proposed PLS method is shown to be an effective screening procedure for the detection of gene-gene interactions from microarray data. Both simulated and real microarray experiments show that the PLS-based approach is superior to its competitors both in terms of performance and applicability. AVAILABILITY: R code is available from the supplementary web-site whose URL is given below. 相似文献

19.

Predicting student performance via NAEP secondary art analysis using partial least squares SEM

Lihua Xu Read Diket 《Arts Education Policy Review》2018,119(4):231-242

This article describes a secondary analysis of the National Assessment of Educational Progress 2008 eighth-grade visual arts data (N = 3,912). These assessments occur under government mandate on a periodic schedule and data on the arts were collected in 1997 and again in 2008. The purpose of this study was to predict students' visual art response performance using students' home environment, personal characteristics, in-school curriculum, and art-related not-for-school (extracurricular) activities. Formative measurement models and structural paths were modeled in structural equation modeling (SEM) using Smart PLS. The initial SEM model included four latent constructs and one endogenous variable measuring students' performance. Both direct and indirect effects between latent constructs were modeled and assessed. Altogether, the four latent constructs explained 21.3% of the variance students' responding performance, out of which home environment construct had the strongest impact. School-related artistic activities in school do predict students' performance significantly but in lesser strength. Students' personal attributes and their art-related not-for-school activities predict students' performance to a substantially lesser degree. Implications of these findings will be discussed in terms of the data findings and larger issues of what these data represent as a means of following curriculum articulation with standards and the impact of art specialists in schools. 相似文献

20.

Improve survival prediction using principal components of gene expression data

Shen YJ Huang SG 《基因组蛋白质组与生物信息学报(英文版)》2006,4(2):110-119

The purpose of many microarray studies is to find the association between gene expression and sample characteristics such as treatment type or sample phenotype. There has been a surge of efforts developing different methods for delineating the association. Aside from the high dimensionality of microarray data, one well recognized challenge is the fact that genes could be complicatedly inter-related, thus making many statistical methods inappropriate to use directly on the expression data. Multivariate methods such as principal component analysis （PCA） and clustering are often used as a part of the effort to capture the gene correlation, and the derived components or clusters are used to describe the association between gene expression and sample phenotype. We propose a method for patient population dichotomization using maximally selected test statistics in combination with the PCA method, which shows favorable results. The proposed method is compared with a currently well-recognized method. 相似文献