首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Quantitative trait loci (QTL)/association mapping aims at finding genomic loci associated with the phenotypes, whereas genomic selection focuses on breeding value prediction based on genomic data. Variable selection is a key to both of these tasks as it allows to (1) detect clear mapping signals of QTL activity, and (2) predict the genome-enhanced breeding values accurately. In this paper, we provide an overview of a statistical method called least absolute shrinkage and selection operator (LASSO) and two of its generalizations named elastic net and adaptive LASSO in the contexts of QTL mapping and genomic breeding value prediction in plants (or animals). We also briefly summarize the Bayesian interpretation of LASSO, and the inspired hierarchical Bayesian models. We illustrate the implementation and examine the performance of methods using three public data sets: (1) North American barley data with 127 individuals and 145 markers, (2) a simulated QTLMAS XII data with 5,865 individuals and 6,000 markers for both QTL mapping and genomic selection, and (3) a wheat data with 599 individuals and 1,279 markers only for genomic selection.  相似文献   

2.
Houseman EA  Coull BA  Betensky RA 《Biometrics》2006,62(4):1062-1070
Genomic data are often characterized by a moderate to large number of categorical variables observed for relatively few subjects. Some of the variables may be missing or noninformative. An example of such data is loss of heterozygosity (LOH), a dichotomous variable, observed on a moderate number of genetic markers. We first consider a latent class model where, conditional on unobserved membership in one of k classes, the variables are independent with probabilities determined by a regression model of low dimension q. Using a family of penalties including the ridge and LASSO, we extend this model to address higher-dimensional problems. Finally, we present an orthogonal map that transforms marker space to a space of "features" for which the constrained model has better predictive power. We demonstrate these methods on LOH data collected at 19 markers from 93 brain tumor patients. For this data set, the existing unpenalized latent class methodology does not produce estimates. Additionally, we show that posterior classes obtained from this method are associated with survival for these patients.  相似文献   

3.
Hazard regression for interval-censored data with penalized spline   总被引:1,自引:0,他引:1  
Cai T  Betensky RA 《Biometrics》2003,59(3):570-579
This article introduces a new approach for estimating the hazard function for possibly interval- and right-censored survival data. We weakly parameterize the log-hazard function with a piecewise-linear spline and provide a smoothed estimate of the hazard function by maximizing the penalized likelihood through a mixed model-based approach. We also provide a method to estimate the amount of smoothing from the data. We illustrate our approach with two well-known interval-censored data sets. Extensive numerical studies are conducted to evaluate the efficacy of the new procedure.  相似文献   

4.
We introduce and evaluate data analysis methods to interpret simultaneous measurement of multiple genomic features made on the same biological samples. Our tools use gene sets to provide an interpretable common scale for diverse genomic information. We show we can detect genetic effects, although they may act through different mechanisms in different samples, and show we can discover and validate important disease-related gene sets that would not be discovered by analyzing each data type individually.  相似文献   

5.
MOTIVATION: Genome sequencing projects and high-through-put technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. The algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data. RESULTS: The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cell-cycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities. AVAILABILITY: Software code is available by request from the first author. CONTACT: jkasturi@cse.psu.edu.  相似文献   

6.
The use of penalized logistic regression for cancer classification using microarray expression data is presented. Two dimension reduction methods are respectively combined with the penalized logistic regression so that both the classification accuracy and computational speed are enhanced. Two other machine-learning methods, support vector machines and least-squares regression, have been chosen for comparison. It is shown that our methods have achieved at least equal or better results. They also have the advantage that the output probability can be explicitly given and the regression coefficients are easier to interpret. Several other aspects, such as the selection of penalty parameters and components, pertinent to the application of our methods for cancer classification are also discussed.  相似文献   

7.
三种回归分析方法在Hyperion影像LAI反演中的比较   总被引:2,自引:0,他引:2  
孙华  鞠洪波  张怀清  林辉  凌成星 《生态学报》2012,32(24):7781-7790
借助GPS进行地面精确定位,利用LAI-2000冠层分析仅在攸县黄丰桥林场开展130个样地(60m×60m)的叶面积指数(Leaf Area Index,LAI)测量.采用FLAASH模块对Hyperion数据进行大气校正并与地面同步冠层观测数据进行拟合,通过研究地面实测LAI与Hyperion影像波段及其衍生的系列植被指数(NDVI、RVI等)的相关性,筛选出估算叶面积指数的植被指数因子.应用曲线估计、逐步回归及偏最小二乘三种回归分析技术分别建立叶面积指数的最优估算模型.结果表明:参与建模的因子中,比值植被指数(RVI)与LAI的相关性最大,敏感性最高,其次是SARVI0.1,NDVI705,NDVI,SARVI0.1,SARVI0.25;曲线估计、逐步回归分析和偏最小二乘回归三种分析方法所建的6个回归模型中,偏最小二乘回归的拟合效果最好,预测值与实测值的决定系数R2为0.84、曲线估计的拟合效果最低,预测值与实测值的决定系数R2为0.64;建模精度分析表明,选用5-6个自变量因子进行LAI建模是可靠的,以6个植被因子建立的偏最小二乘回归模型预测精度最高.  相似文献   

8.
Discovery of biological networks from diverse functional genomic data   总被引:1,自引:0,他引:1  
We have developed a general probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data. This framework was validated by accurately recovering known networks for 31 biological processes in Saccharomyces cerevisiae and experimentally verifying predictions for the process of chromosomal segregation. Our system, bioPIXIE, a public, comprehensive system for integration, analysis, and visualization of biological network predictions for S. cerevisiae, is freely accessible over the worldwide web.  相似文献   

9.
One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival.  相似文献   

10.
In recent years accelerometers have become widely used to objectively assess physical activity. Usually intensity ranges are assigned to the measured accelerometer counts by simple cut points, disregarding the underlying activity pattern. Under the assumption that physical activity can be seen as distinct sequence of distinguishable activities, the use of hidden Markov models (HMM) has been proposed to improve the modeling of accelerometer data. As further improvement we propose to use expectile regression utilizing a Whittaker smoother with an L0‐penalty to better capture the intensity levels underlying the observed counts. Different expectile asymmetries beyond the mean allow the distinction of monotonous and more variable activities as expectiles effectively model the complete distribution of the counts. This new approach is investigated in a simulation study, where we simulated 1,000 days of accelerometer data with 1 and 5 s epochs, based on collected labeled data to resemble real‐life data as closely as possible. The expectile regression is compared to HMMs and the commonly used cut point method with regard to misclassification rate, number of identified bouts and identified levels as well as the proportion of the estimate being in the range of of the true activity level. In summary, expectile regression utilizing a Whittaker smoother with an L0‐penalty outperforms HMMs and the cut point method and is hence a promising approach to model accelerometer data.  相似文献   

11.
High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.  相似文献   

12.
Classification of gene microarrays by penalized logistic regression   总被引:2,自引:0,他引:2  
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.  相似文献   

13.
Motivated by investigating the relationship between progesterone and the days in a menstrual cycle in a longitudinal study, we propose a multikink quantile regression model for longitudinal data analysis. It relaxes the linearity condition and assumes different regression forms in different regions of the domain of the threshold covariate. In this paper, we first propose a multikink quantile regression for longitudinal data. Two estimation procedures are proposed to estimate the regression coefficients and the kink points locations: one is a computationally efficient profile estimator under the working independence framework while the other one considers the within-subject correlations by using the unbiased generalized estimation equation approach. The selection consistency of the number of kink points and the asymptotic normality of two proposed estimators are established. Second, we construct a rank score test based on partial subgradients for the existence of the kink effect in longitudinal studies. Both the null distribution and the local alternative distribution of the test statistic have been derived. Simulation studies show that the proposed methods have excellent finite sample performance. In the application to the longitudinal progesterone data, we identify two kink points in the progesterone curves over different quantiles and observe that the progesterone level remains stable before the day of ovulation, then increases quickly in 5 to 6 days after ovulation and then changes to stable again or drops slightly.  相似文献   

14.
15.
MOTIVATION: Multilayer perceptrons (MLP) represent one of the widely used and effective machine learning methods currently applied to diagnostic classification based on high-dimensional genomic data. Since the dimensionalities of the existing genomic data often exceed the available sample sizes by orders of magnitude, the MLP performance may degrade owing to the curse of dimensionality and over-fitting, and may not provide acceptable prediction accuracy. RESULTS: Based on Fisher linear discriminant analysis, we designed and implemented an MLP optimization scheme for a two-layer MLP that effectively optimizes the initialization of MLP parameters and MLP architecture. The optimized MLP consistently demonstrated its ability in easing the curse of dimensionality in large microarray datasets. In comparison with a conventional MLP using random initialization, we obtained significant improvements in major performance measures including Bayes classification accuracy, convergence properties and area under the receiver operating characteristic curve (A(z)). SUPPLEMENTARY INFORMATION: The Supplementary information is available on http://www.cbil.ece.vt.edu/publications.htm  相似文献   

16.
AZZALINI  A. 《Biometrika》1994,81(4):767-775
  相似文献   

17.
Robust regression for clustered data with application to binary responses   总被引:3,自引:0,他引:3  
Preisser JS  Qaqish BF 《Biometrics》1999,55(2):574-579
Generalized estimating equations (GEE) can be highly influenced by the presence of unusual data points. A generalization of the GEE procedure, which yields parameter estimates and fitted values that are resistant to influential data, is introduced. Resistant generalized estimating equations (REGEE) include weights in the estimating equations to downweight influential observations or clusters. Influential observations are downweighted according to their leverage or residual in an example of correlated binary regression applied to 137 urinary incontinent elderly patients from 38 medical practices.  相似文献   

18.
A mixture Markov regression model is proposed to analyze heterogeneous time series data. Mixture quasi‐likelihood is formulated to model time series with mixture components and exogenous variables. The parameters are estimated by quasi‐likelihood estimating equations. A modified EM algorithm is developed for the mixture time series model. The model and proposed algorithm are tested on simulated data and applied to mosquito surveillance data in Peel Region, Canada.  相似文献   

19.
We consider the problem of estimating the intensity functions for a continuous time 'illness-death' model with intermittently observed data. In such a case, it may happen that a subject becomes diseased between two visits and dies without being observed. Consequently, there is an uncertainty about the precise number of transitions. Estimating the intensity of transition from health to illness by survival analysis (treating death as censoring) is biased downwards. Furthermore, the dates of transitions between states are not known exactly. We propose to estimate the intensity functions by maximizing a penalized likelihood. The method yields smooth estimates without parametric assumptions. This is illustrated using data from a large cohort study on cerebral ageing. The age-specific incidence of dementia is estimated using an illness-death approach and a survival approach.  相似文献   

20.
In the study of molecular and phenotypic evolution, understanding the relative importance of random genetic drift and positive selection as the mechanisms for driving divergences between populations and maintaining polymorphisms within populations has been a central issue. A variety of statistical methods has been developed for detecting natural selection operating at the amino acid and nucleotide sequence levels. These methods may be largely classified into those aimed at detecting recurrent and/or recent/ongoing natural selection by utilizing the divergence and/or polymorphism data. Using these methods, pervasive positive selection has been identified for protein-coding and non-coding sequences in the genomic analysis of some organisms. However, many of these methods have been criticized by using computer simulation and real data analysis to produce excessive false-positives and to be sensitive to various disturbing factors. Importantly, some of these methods have been invalidated experimentally. These facts indicate that many of the statistical methods for detecting natural selection are unreliable. In addition, the signals that have been believed as the evidence for fixations of advantageous mutations due to positive selection may also be interpreted as the evidence for fixations of deleterious mutations due to random genetic drift. The genomic diversity data are rapidly accumulating in various organisms, and detection of natural selection may play a critical role for clarifying the relative role of random genetic drift and positive selection in molecular and phenotypic evolution. It is therefore important to develop reliable statistical methods that are unbiased as well as robust against various disturbing factors, for inferring natural selection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号