首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Variable selection is critical in competing risks regression with high-dimensional data. Although penalized variable selection methods and other machine learning-based approaches have been developed, many of these methods often suffer from instability in practice. This paper proposes a novel method named Random Approximate Elastic Net (RAEN). Under the proportional subdistribution hazards model, RAEN provides a stable and generalizable solution to the large-p-small-n variable selection problem for competing risks data. Our general framework allows the proposed algorithm to be applicable to other time-to-event regression models, including competing risks quantile regression and accelerated failure time models. We show that variable selection and parameter estimation improved markedly using the new computationally intensive algorithm through extensive simulations. A user-friendly R package RAEN is developed for public use. We also apply our method to a cancer study to identify influential genes associated with the death or progression from bladder cancer.  相似文献   

2.

Background

A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure.

Results

We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data.

Conclusion

We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods.  相似文献   

3.

Background  

Inferring gene networks from time-course microarray experiments with vector autoregressive (VAR) model is the process of identifying functional associations between genes through multivariate time series. This problem can be cast as a variable selection problem in Statistics. One of the promising methods for variable selection is the elastic net proposed by Zou and Hastie (2005). However, VAR modeling with the elastic net succeeds in increasing the number of true positives while it also results in increasing the number of false positives.  相似文献   

4.
Summary We consider penalized linear regression, especially for “large p, small n” problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the Lγ ‐norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the γ and some weights inside the Lγ ‐norm. Simulation studies demonstrate the superior finite‐sample performance of the proposed method as compared to Lasso, elastic net, and a recently proposed network‐based method. The new method performs best in variable selection across all simulation set‐ups considered. For illustration, the method is applied to a microarray dataset to predict survival times for some glioblastoma patients using a gene expression dataset and a gene network compiled from some Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.  相似文献   

5.
Preprocessing for high‐dimensional censored datasets, such as the microarray data, is generally considered as an important technique to gain further stability by reducing potential noise from the data. When variable selection including inference is carried out with high‐dimensional censored data the objective is to obtain a smaller subset of variables and then perform the inferential analysis using model estimates based on the selected subset of variables. This two stage inferential analysis is prone to circularity bias because of the noise that might still remain in the dataset. In this work, I propose an adaptive preprocessing technique that uses sure independence screening (SIS) idea to accomplish variable selection and reduces the circularity bias by some popularly known refined high‐dimensional methods such as the elastic net, adaptive elastic net, weighted elastic net, elastic net‐AFT, and two greedy variable selection methods known as TCS, PC‐simple all implemented with the accelerated lifetime models. The proposed technique addresses several features including the issue of collinearity between important and some unimportant covariates, which is often the case in high‐dimensional setting under variable selection framework, and different level of censoring. Simulation studies along with an empirical analysis with a real microarray data, mantle cell lymphoma, is carried out to demonstrate the performance of the adaptive pre‐processing technique.  相似文献   

6.
The development of clinical prediction models requires the selection of suitable predictor variables. Techniques to perform objective Bayesian variable selection in the linear model are well developed and have been extended to the generalized linear model setting as well as to the Cox proportional hazards model. Here, we consider discrete time‐to‐event data with competing risks and propose methodology to develop a clinical prediction model for the daily risk of acquiring a ventilator‐associated pneumonia (VAP) attributed to P. aeruginosa (PA) in intensive care units. The competing events for a PA VAP are extubation, death, and VAP due to other bacteria. Baseline variables are potentially important to predict the outcome at the start of ventilation, but may lose some of their predictive power after a certain time. Therefore, we use a landmark approach for dynamic Bayesian variable selection where the set of relevant predictors depends on the time already spent at risk. We finally determine the direct impact of a variable on each competing event through cause‐specific variable selection.  相似文献   

7.
One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.  相似文献   

8.
MOTIVATION: A common task in microarray data analysis consists of identifying genes associated with a phenotype. When the outcomes of interest are censored time-to-event data, standard approaches assess the effect of genes by fitting univariate survival models. In this paper, we propose a Bayesian variable selection approach, which allows the identification of relevant markers by jointly assessing sets of genes. We consider accelerated failure time (AFT) models with log-normal and log-t distributional assumptions. A data augmentation approach is used to impute the failure times of censored observations and mixture priors are used for the regression coefficients to identify promising subsets of variables. The proposed method provides a unified procedure for the selection of relevant genes and the prediction of survivor functions. RESULTS: We demonstrate the performance of the method on simulated examples and on several microarray datasets. For the simulation study, we consider scenarios with large number of noisy variables and different degrees of correlation between the relevant and non-relevant (noisy) variables. We are able to identify the correct covariates and obtain good prediction of the survivor functions. For the microarray applications, some of our selected genes are known to be related to the diseases under study and a few are in agreement with findings from other researchers. AVAILABILITY: The Matlab code for implementing the Bayesian variable selection method may be obtained from the corresponding author. CONTACT: mvannucci@stat.tamu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

9.
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.  相似文献   

10.
MOTIVATION: An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. It would be desirable to have models with good prediction accuracy and parsimony property. RESULTS: We propose to use the L(1) penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by using the recently developed least-angle regression (LARS) method. Our simulation studies and application to real datasets on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed procedure, which we call the LARS-Cox procedure, can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The LARS-Cox regression gives better predictive performance than the L(2) penalized regression and a few other dimension-reduction based methods. CONCLUSIONS: We conclude that the proposed LARS-Cox procedure can be very useful in identifying genes relevant to survival phenotypes and in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients.  相似文献   

11.
Classification of gene microarrays by penalized logistic regression   总被引:2,自引:0,他引:2  
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.  相似文献   

12.
ABSTRACT: BACKGROUND: A latent behavior of a biological cell is complex. Deriving the underlying simplicity, or the fundamental rules governing this behavior has been the Holy Grail of systems biology. Data-driven prediction of the system components and their component interplays that are responsible for the target system's phenotype is a key and challenging step in this endeavor. RESULTS: The proposed approach, which we call System Phenotype-related Interplaying Components Enumerator (SPICE), iteratively enumerates statistically significant that are hypothesized (1) to play an important role in defining the specificity of the target system's phenotype(s); (2) to exhibit a functionally coherent behavior, namely, act in a coordinated manner to perform the phenotype-specific function; and (3) to improve the predictive skill of the system's phenotype(s) when used collectively in the ensemble of predictive models. When validated, SPICE effectively identified system components related to three target phenotypes: biohydrogen production, motility, and cancer. Manual results curation agreed with the known phenotype-related system components reported in literature. Additionally, using the identified system components as discriminatory features improved the prediction accuracy by 10% on the phenotype-classification task when compared to a number of state-of-the-art methods applied to eight benchmark microarray data sets. CONCLUSION: Conclusion: We formulate a problem--- enumeration of phenotype-determining system component interplays---and propose an effective methodology (SPICE) to address this problem. SPICE improved identification of cancer-related groups of genes from various microarray data sets and detected groups of genes associated with microbial biohydrogen production and motility, many of which were reported in literature. SPICE also improved the predictive skill of the system's phenotype determination compared to individual classifiers and/or other ensemble methods, such as bagging, boosting, random forest, nearest shrunken centroid, and random forest variable selection method. All the supporting data and software can be found in the supplemental files.  相似文献   

13.
Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.  相似文献   

14.
Time‐dependent covariates are frequently encountered in regression analysis for event history data and competing risks. They are often essential predictors, which cannot be substituted by time‐fixed covariates. This study briefly recalls the different types of time‐dependent covariates, as classified by Kalbfleisch and Prentice [The Statistical Analysis of Failure Time Data, Wiley, New York, 2002] with the intent of clarifying their role and emphasizing the limitations in standard survival models and in the competing risks setting. If random (internal) time‐dependent covariates are to be included in the modeling process, then it is still possible to estimate cause‐specific hazards but prediction of the cumulative incidences and survival probabilities based on these is no longer feasible. This article aims at providing some possible strategies for dealing with these prediction problems. In a multi‐state framework, a first approach uses internal covariates to define additional (intermediate) transient states in the competing risks model. Another approach is to apply the landmark analysis as described by van Houwelingen [Scandinavian Journal of Statistics 2007, 34 , 70–85] in order to study cumulative incidences at different subintervals of the entire study period. The final strategy is to extend the competing risks model by considering all the possible combinations between internal covariate levels and cause‐specific events as final states. In all of those proposals, it is possible to estimate the changes/differences of the cumulative risks associated with simple internal covariates. An illustrative example based on bone marrow transplant data is presented in order to compare the different methods.  相似文献   

15.
We wished to construct a prognostic model based on ferroptosis-related genes and to simultaneously evaluate the performance of the prognostic model and analyze differences between high-risk and low-risk groups at all levels. The gene-expression profiles and relevant clinical data of patients with non-small-cell lung cancer (NSCLC) were downloaded from public databases. Differentially expressed genes (DEGs) were obtained by analyzing differences between cancer tissues and paracancerous tissues, and common genes between DEGs and ferroptosis-related genes were identified as candidate ferroptosis-related genes. Next, a risk-score model was constructed using univariate Cox analysis and least absolute shrinkage and selection operator (Lasso) analysis. According to the median risk score, samples were divided into high-risk and low-risk groups, and a series of bioinformatics analyses were conducted to verify the predictive ability of the model. Single-sample gene set enrichment analysis (ssGSEA) was used to investigate differences in immune status between high-risk and low-risk groups, and differences in gene mutations between the two groups were investigated. A risk-score model was constructed based on 21 ferroptosis-related genes. A Kaplan–Meier curve and receiver operating characteristic curve showed that the model had good prediction ability. Univariate and multivariate Cox analyses revealed that ferroptosis-related genes associated with the prognosis may be used as independent prognostic factors for the overall survival time of NSCLC patients. The pathways enriched with DEGs in low-risk and high-risk groups were analyzed, and the enriched pathways were correlated significantly with immunosuppressive status.  相似文献   

16.
Yi Li  Lu Tian  Lee‐Jen Wei 《Biometrics》2011,67(2):427-435
Summary In a longitudinal study, suppose that the primary endpoint is the time to a specific event. This response variable, however, may be censored by an independent censoring variable or by the occurrence of one of several dependent competing events. For each study subject, a set of baseline covariates is collected. The question is how to construct a reliable prediction rule for the future subject's profile of all competing risks of interest at a specific time point for risk‐benefit decision making. In this article, we propose a two‐stage procedure to make inferences about such subject‐specific profiles. For the first step, we use a parametric model to obtain a univariate risk index score system. We then estimate consistently the average competing risks for subjects who have the same parametric index score via a nonparametric function estimation procedure. We illustrate this new proposal with the data from a randomized clinical trial for evaluating the efficacy of a treatment for prostate cancer. The primary endpoint for this study was the time to prostate cancer death, but had two types of dependent competing events, one from cardiovascular death and the other from death of other causes.  相似文献   

17.
High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.  相似文献   

18.
Prediction of drug response based on genomic alterations is an important task in the research of personalized medicine. Current elastic net model utilized a sure independence screening to select relevant genomic features with drug response, but it may neglect the combination effect of some marginally weak features. In this work, we applied an iterative sure independence screening scheme to select drug response relevant features from the Cancer Cell Line Encyclopedia (CCLE) dataset. For each drug in CCLE, we selected up to 40 features including gene expressions, mutation and copy number alterations of cancer-related genes, and some of them are significantly strong features but showing weak marginal correlation with drug response vector. Lasso regression based on the selected features showed that our prediction accuracies are higher than those by elastic net regression for most drugs.  相似文献   

19.
This paper explores how mortality is related to such socio-economic factors as education, occupation, skill level and income for the years 1992-1997 using an extensive sample of the Danish population. We employ a competing risks proportional hazard model to allow for different causes of death. This method is important as some factors have unequal (and sometimes opposite) influence on the cause-specific mortality rates. We find that the often-found inverse correlation between socio-economic status and mortality is to a large degree absent among Danish women who die of cancer. In addition, for men the negative correlation between socio-economic status and mortality prevails for some diseases, but for women we find that factors such as being married, income, wealth and education are not significantly associated with higher life expectancy. Marriage increases the likelihood of dying from cancer for women, early retirement prolongs survival for men, and homeownership increases life expectancy in general.  相似文献   

20.

Background:  

In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号