首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. It would be desirable to have models with good prediction accuracy and parsimony property. RESULTS: We propose to use the L(1) penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by using the recently developed least-angle regression (LARS) method. Our simulation studies and application to real datasets on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed procedure, which we call the LARS-Cox procedure, can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The LARS-Cox regression gives better predictive performance than the L(2) penalized regression and a few other dimension-reduction based methods. CONCLUSIONS: We conclude that the proposed LARS-Cox procedure can be very useful in identifying genes relevant to survival phenotypes and in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients.  相似文献   

2.
One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.  相似文献   

3.
High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.  相似文献   

4.
We present the use of innovative machine learning techniques in the understanding of Coronary Heart Disease (CHD) through intermediate traits, as an example of the use of this class of methods as a first step towards a systems epidemiology approach of complex diseases genetics. Using a sample of 252 middle-aged men, of which 102 had a CHD event in 10 years follow-up, we applied machine learning algorithms for the selection of CHD intermediate phenotypes, established markers, risk factors, and their previously associated genetic polymorphisms, and constructed a map of relationships between the selected variables. Of the 52 variables considered, 42 were retained after selection of the most informative variables for CHD. The constructed map suggests that most selected variables were related to CHD in a context dependent manner while only a small number of variables were related to a specific outcome. We also observed that loss of complexity in the network was linked to a future CHD event. We propose that novel, non-linear, and integrative epidemiological approaches are required to combine all available information, in order to truly translate the new advances in medical sciences to gains in preventive measures and patients care.  相似文献   

5.
6.
Marko NF  Toms SA  Barnett GH  Weil R 《Genomics》2008,91(5):395-406
We used microarray analysis to investigate associations between genotypic expression profiles and survival phenotypes in patients with primary glioblastoma (GBM). Tumor samples from 7 long-term glioblastoma survivors (>24 months) and 13 short-term survivors (<9 months) were analyzed to detect differential patterns of gene expression between these groups and to identify genotypic subclasses of glioblastomas that correlate with survival phenotypes. Five unsupervised and three supervised clustering algorithms consistently and accurately grouped the tumors into genotypic subgroups corresponding to the two clinical survival phenotypes. Three unique prospective mathematical classification algorithms were subsequently trained to use expression data to stratify unknown glioblastomas between survival groups and performed this task with 100% accuracy in validation studies. A set of 1478 genes with significant differential expression (p<0.01) between long-term and short-term survivors was identified, and additional mathematical filtering was used to isolate a 43-gene "fingerprint" that distinguished survival phenotypes. Differential regulation of a subset of these genes was confirmed using RT-PCR. Gene ontology analysis of the fingerprint demonstrated pathophysiologic functions for the gene products that are consistent with current models of tumor biology, suggesting that differential expression of these genes may contribute etiologically to the observed differences in survival. These results demonstrate that unique expression profiles characterize genotypic subsets of primary GBMs associated with differential survival phenotypes, and these profiles can be used in a prospective fashion to assign unknown tumors to survival groups. Future efforts will focus on building more robust classifiers and identifying additional subclasses of gliomas with phenotypic significance.  相似文献   

7.
MOTIVATION: Recent research has shown that gene expression profiles can potentially be used for predicting various clinical phenotypes, such as tumor class, drug response and survival time. While there has been extensive studies on tumor classification, there has been less emphasis on other phenotypic features, in particular, patient survival time or time to cancer recurrence, which are subject to right censoring. We consider in this paper an analysis of censored survival time based on microarray gene expression profiles. RESULTS: We propose a dimension reduction strategy, which combines principal components analysis and sliced inverse regression, to identify linear combinations of genes, that both account for the variability in the gene expression levels and preserve the phenotypic information. The extracted gene combinations are then employed as covariates in a predictive survival model formulation. We apply the proposed method to a large diffuse large-B-cell lymphoma dataset, which consists of 240 patients and 7399 genes, and build a Cox proportional hazards model based on the derived gene expression components. The proposed method is shown to provide a good predictive performance for patient survival, as demonstrated by both the significant survival difference between the predicted risk groups and the receiver operator characteristics analysis. AVAILABILITY: R programs are available upon request from the authors. SUPPLEMENTARY INFORMATION: http://dna.ucdavis.edu/~hli/bioinfo-surv-supp.pdf.  相似文献   

8.
Lung cancer is one of the most malignant cancers worldwide, and lung adenocarcinoma (LUAD) is the most common histologic subtype. Thousands of biomarkers related to the survival and prognosis of patients with this cancer type have been investigated through database mining; however, the prediction effect of a single gene biomarker is not satisfactorily specific or sensitive. Thus, the present study aimed to develop a novel gene signature of prognostic values for patients with LUAD. Using a data-mining method, we performed expression profiling of 1145 mRNAs in large cohorts with LUAD (n = 511) from The Cancer Genome Atlas database. Using the Gene Set Enrichment Analysis, we selected 198 genes related to GLYCOLYSIS, which is the most important enrichment gene set. Moreover, these genes were identified using Cox proportional regression modeling. We established a risk score staging system to predict the outcome of patients with LUAD and subsequently identified four genes (AGRN, AKR1A1, DDIT4, and HMMR) that were closely related to the prognosis of patients with LUAD. The identified genes allowed us to classify patients into the high-risk group (with poor outcome) and low-risk group (with better outcome). Compared with other clinical factors, the risk score has a better performance in predicting the outcome of patients with LUAD, particularly in the early stage of LUAD. In conclusion, we developed a four-gene signature related to glycolysis by utilizing the Cox regression model and a risk staging model for LUAD, which might prove valuable for the clinical management of patients with LUAD.  相似文献   

9.
Cai T  Tonini G  Lin X 《Biometrics》2011,67(3):975-986
There is growing evidence that genomic and proteomic research holds great potential for changing irrevocably the practice of medicine. The ability to identify important genomic and biological markers for risk assessment can have a great impact in public health from disease prevention, to detection, to treatment selection. However, the potentially large number of markers and the complexity in the relationship between the markers and the outcome of interest impose a grand challenge in developing accurate risk prediction models. The standard approach to identifying important markers often assesses the marginal effects of individual markers on a phenotype of interest. When multiple markers relate to the phenotype simultaneously via a complex structure, such a type of marginal analysis may not be effective. To overcome such difficulties, we employ a kernel machine Cox regression framework and propose an efficient score test to assess the overall effect of a set of markers, such as genes within a pathway or a network, on survival outcomes. The proposed test has the advantage of capturing the potentially nonlinear effects without explicitly specifying a particular nonlinear functional form. To approximate the null distribution of the score statistic, we propose a simple resampling procedure that can be easily implemented in practice. Numerical studies suggest that the test performs well with respect to both empirical size and power even when the number of variables in a gene set is not small compared to the sample size.  相似文献   

10.
In this article, we address a missing data problem that occurs in transplant survival studies. Recipients of organ transplants are followed up from transplantation and their survival times recorded, together with various explanatory variables. Due to differences in data collection procedures in different centers or over time, a particular explanatory variable (or set of variables) may only be recorded for certain recipients, which results in this variable being missing for a substantial number of records in the data. The variable may also turn out to be an important predictor of survival and so it is important to handle this missing-by-design problem appropriately. Consensus in the literature is to handle this problem with complete case analysis, as the missing data are assumed to arise under an appropriate missing at random mechanism that gives consistent estimates here. Specifically, the missing values can reasonably be assumed not to be related to the survival time. In this article, we investigate the potential for multiple imputation to handle this problem in a relevant study on survival after kidney transplantation, and show that it comprehensively outperforms complete case analysis on a range of measures. This is a particularly important finding in the medical context as imputing large amounts of missing data is often viewed with scepticism.  相似文献   

11.
Local climatic conditions likely constitute an important selective pressure on genes underlying important fitness‐related traits such as flowering time, and in many species, flowering phenology and climatic gradients strongly covary. To test whether climate shapes the genetic variation on flowering time genes and to identify candidate flowering genes involved in the adaptation to environmental heterogeneity, we used a large Medicago truncatula core collection to examine the association between nucleotide polymorphisms at 224 candidate genes and both climate variables and flowering phenotypes. Unlike genome‐wide studies, candidate gene approaches are expected to enrich for the number of meaningful trait associations because they specifically target genes that are known to affect the trait of interest. We found that flowering time mediates adaptation to climatic conditions mainly by variation at genes located upstream in the flowering pathways, close to the environmental stimuli. Variables related to the annual precipitation regime reflected selective constraints on flowering time genes better than the other variables tested (temperature, altitude, latitude or longitude). By comparing phenotype and climate associations, we identified 12 flowering genes as the most promising candidates responsible for phenological adaptation to climate. Four of these genes were located in the known flowering time QTL region on chromosome 7. However, climate and flowering associations also highlighted largely distinct gene sets, suggesting different genetic architectures for adaptation to climate and flowering onset.  相似文献   

12.
The rising global epidemic of diabetic nephropathy (DN) will likely lead to increase in the prevalence of cardiovascular morbidity and mortality posing a serious burden for public health care. Despite greater understanding of the etiology of diabetes and the development of novel treatment strategies to control blood glucose levels, the prevalence and incidence rate of DN is increasing especially in minority populations including Mexican–Americans. Mexican–Americans with type 2 diabetes (T2DM) are three times more likely to develop microalbuminuria, and four times more likely to develop clinical proteinuria compared to non-Hispanic whites. Furthermore, Mexican–Americans have a sixfold increased risk of developing renal failure secondary to T2DM compared to Caucasians. Prevention and better treatment of DN should be a high priority for both health-care organizations and society at large. Pathogenesis of DN is multi-factorial. Familial clustering of DN-related traits in MAs show that DN and related traits are heritable and that genes play a susceptibility role. While, there has been some progress in identifying genes which when mutated influence an individual’s risk, major gene(s) responsible for DN are yet to be identified. Knowledge of the genetic causes of DN is essential for elucidation of its mechanisms, and for adequate classification, prognosis, and treatment. Self-identification and collaboration among researchers with suitable genomic and clinical data for meta-analyses in Mexican–Americans is critical for progress in replicating/identifying DN risk genes in this population. This paper reviews the approaches and recent efforts made to identify genetic variants contributing to risk for DN and related phenotypes in the Mexican–American population.  相似文献   

13.
MOTIVATION: A common task in microarray data analysis consists of identifying genes associated with a phenotype. When the outcomes of interest are censored time-to-event data, standard approaches assess the effect of genes by fitting univariate survival models. In this paper, we propose a Bayesian variable selection approach, which allows the identification of relevant markers by jointly assessing sets of genes. We consider accelerated failure time (AFT) models with log-normal and log-t distributional assumptions. A data augmentation approach is used to impute the failure times of censored observations and mixture priors are used for the regression coefficients to identify promising subsets of variables. The proposed method provides a unified procedure for the selection of relevant genes and the prediction of survivor functions. RESULTS: We demonstrate the performance of the method on simulated examples and on several microarray datasets. For the simulation study, we consider scenarios with large number of noisy variables and different degrees of correlation between the relevant and non-relevant (noisy) variables. We are able to identify the correct covariates and obtain good prediction of the survivor functions. For the microarray applications, some of our selected genes are known to be related to the diseases under study and a few are in agreement with findings from other researchers. AVAILABILITY: The Matlab code for implementing the Bayesian variable selection method may be obtained from the corresponding author. CONTACT: mvannucci@stat.tamu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

14.
Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.  相似文献   

15.
大肠癌转移相关分子标签的筛选   总被引:1,自引:0,他引:1  
目的:筛选与转移相关的大肠癌分子标签.方法:本文通过对大肠癌表达谱数据进行分析,按照表达谱中大肠癌转移情况,组织学分化程度以及患者生存时间进行显著性分析,将与大肠癌转移相关的具有显著性意义的基因群进行聚类,通过主成分分析以及自组织映射的方法计算出差异表达基因中,起着主体分类作用的基因群.结果:筛选出与肿瘤组织分化程度以及患者生存时间相关的差异表达基因,进行功能富集,并筛选出了一批与大肠癌转移密切相关的重要基因,这些基因对大肠癌早期诊断,及时治疗,预后评估有着重要意义.结论:细胞代谢,趋化因子信号通路和细胞因子受体等分子事件与大肠癌分化程度密切相关,是否发生转移与大肠癌的预后生存期密切相关.  相似文献   

16.

Background

Amyotrophic lateral sclerosis (ALS) is a degenerative disease predominantly affecting motor neurons and manifesting as several different phenotypes. Whether these phenotypes correspond to different underlying disease processes is unknown. We used latent cluster analysis to identify groupings of clinical variables in an objective and unbiased way to improve phenotyping for clinical and research purposes.

Methods

Latent class cluster analysis was applied to a large database consisting of 1467 records of people with ALS, using discrete variables which can be readily determined at the first clinic appointment. The model was tested for clinical relevance by survival analysis of the phenotypic groupings using the Kaplan-Meier method.

Results

The best model generated five distinct phenotypic classes that strongly predicted survival (p<0.0001). Eight variables were used for the latent class analysis, but a good estimate of the classification could be obtained using just two variables: site of first symptoms (bulbar or limb) and time from symptom onset to diagnosis (p<0.00001).

Conclusion

The five phenotypic classes identified using latent cluster analysis can predict prognosis. They could be used to stratify patients recruited into clinical trials and generating more homogeneous disease groups for genetic, proteomic and risk factor research.  相似文献   

17.
The detailed geometry of atherosclerosis-prone vascular segments may influence their susceptibility by mediating local hemodynamics. An appreciation of the role of specific geometric variables is complicated by the considerable correlation among the many parameters that can be used to describe arterial shape and size. Factor analysis is a useful tool for identifying the essential features of such an inter-related data set, as well as for predicting hemodynamic risk in terms of these features and for interpreting the role of specific geometric variables. Here, factor analysis is applied to a set of 14 geometric variables obtained from magnetic resonance images of 50 human carotid bifurcations. Two factors alone were capable of predicting 12 hemodynamic metrics related to shear and near-wall residence time with adjusted squared Pearson's correlation coefficient as high as 0.54 and P-values less than 0.0001. One factor measures cross-sectional expansion at the bifurcation; the other measures the colinearity of the common and internal carotid artery axes at the bifurcation. The factors explain the apparent lack of an effect of branch angle on hemodynamic risk. The relative risk among the 50 bifurcations, based on time-average wall shear stress, could be predicted with a sensitivity and specificity as high as 0.84. The predictability of the hemodynamic metrics and relative risk is only modestly sensitive to assumptions about flow rates and flow partitions in the bifurcation.  相似文献   

18.
Statistical methods designed specifically for the analysis of chronic disease incidence and progression in longitudinal studies are presented. These method model the risk of acute phases of chronic disease separately from the temporal change in risk variables. This could be accomplished because, under a specific biological model of the disease mechanism, the problems of estimating the risk of an acute event and of predicting the change in risk variables are independent. Specifically, a quadratic equation relating risk variable values to chronic disease risk and a system of linear equations predicting future risk variable values from present values may beestimated separately. Taken together, they utilize the full information available in a longitudinal study on the temporal dimension of chronic disease progression. In addition, the model is found to possess a number of attractive statistical and theoretical properties. These methods are applied to longitudinal data from the Framingham Study on coronary heart disease (CHD) in males. A quadratic function relating the risk of a CHD event to selected risk variables (age, and the natural logarithms of serum cholesterol, uric acid, diastolic blood pressure and pulse pressure) was estimated from measurements made at four points equally spaced in time (two years) with a further morbidity follow-up at a fifth point. The risk function was found to predict CHD risk accurately. It showed that, apart from the linear effects of the risk variables, cohort effects, quadratic effects and interaction effects were important predictors of CHD risk. The linear regression equations used to predict future risk variable values showed that there was an intricate network of cross-temporal associations. Study of the two types of equations jointly show that putative risk variables could affect the risk of CHD incidence both directly, by being associated with higher levels of risk, and indirectly, by causing other risk variable values to change with time. The results led us to identify several different roles that risk variables might play in CHD incidence.  相似文献   

19.

Background

Currently, prognostication for pancreatic ductal adenocarcinoma (PDAC) is based upon a coarse clinical staging system. Thus, more accurate prognostic tests are needed for PDAC patients to aid treatment decisions.

Methods and Findings

Affymetrix gene expression profiling was carried out on 15 human PDAC tumors and from the data we identified a 13-gene expression signature (risk score) that correlated with patient survival. The gene expression risk score was then independently validated using published gene expression data and survival data for an additional 101 patients with pancreatic cancer. Patients with high-risk scores had significantly higher risk of death compared to patients with low-risk scores (HR 2.27, p = 0.002). When the 13-gene score was combined with lymph node status the risk-score further discriminated the length of patient survival time (p<0.001). Patients with a high-risk score had poor survival independent of nodal status; however, nodal status increased predictability for survival in patients with a low-risk gene signature score (low-risk N1 vs. low-risk N0: HR = 2.0, p = 0.002). While AJCC stage correlated with patient survival (p = 0.03), the 13-gene score was superior at predicting survival. Of the 13 genes comprising the predictive model, four have been shown to be important in PDAC, six are unreported in PDAC but important in other cancers, and three are unreported in any cancer.

Conclusions

We identified a 13-gene expression signature that predicts survival of PDAC patients and could prove useful for making treatment decisions. This risk score should be evaluated prospectively in clinical trials for prognostication and for predicting response to chemotherapy. Investigation of new genes identified in our model may lead to novel therapeutic targets.  相似文献   

20.
Characterizing the functional phenotypes of neurons is essential for understanding how genotypes can be related to the neural basis of behaviour. Traditional classifications of neurons by single features (such as morphology or firing behaviour) are increasingly inadequate for reflecting functional phenotypes, as they do not integrate functions across different neuronal types. Here, we describe a set of rules for identifying and predicting functional phenotypes that combine morphology, intrinsic ion channel species and their distributions in dendrites, and functional properties. This more comprehensive neuronal classification should be an improvement on traditional classifications for relating genotype to functional phenotype.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号