首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Summary We consider penalized linear regression, especially for “large p, small n” problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the Lγ ‐norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the γ and some weights inside the Lγ ‐norm. Simulation studies demonstrate the superior finite‐sample performance of the proposed method as compared to Lasso, elastic net, and a recently proposed network‐based method. The new method performs best in variable selection across all simulation set‐ups considered. For illustration, the method is applied to a microarray dataset to predict survival times for some glioblastoma patients using a gene expression dataset and a gene network compiled from some Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.  相似文献   

2.
Jeffrey T. Leek 《Biometrics》2011,67(2):344-352
Summary High‐dimensional data, such as those obtained from a gene expression microarray or second generation sequencing experiment, consist of a large number of dependent features measured on a small number of samples. One of the key problems in genomics is the identification and estimation of factors that associate with many features simultaneously. Identifying the number of factors is also important for unsupervised statistical analyses such as hierarchical clustering. A conditional factor model is the most common model for many types of genomic data, ranging from gene expression, to single nucleotide polymorphisms, to methylation. Here we show that under a conditional factor model for genomic data with a fixed sample size, the right singular vectors are asymptotically consistent for the unobserved latent factors as the number of features diverges. We also propose a consistent estimator of the dimension of the underlying conditional factor model for a finite fixed sample size and an infinite number of features based on a scaled eigen‐decomposition. We propose a practical approach for selection of the number of factors in real data sets, and we illustrate the utility of these results for capturing batch and other unmodeled effects in a microarray experiment using the dependence kernel approach of Leek and Storey (2008, Proceedings of the National Academy of Sciences of the United States of America 105 , 18718–18723) .  相似文献   

3.
cDNA‐AFLP is one of the techniques developed to study differentially expressed genes. This recent technique is advantageous because it does not need prior sequence knowledge and is reliable due to highly stringent PCR conditions. The traditional cDNA‐AFLP method uses radioactively labelled products and is characterised by high sensitivity and resolution. Here, the use of Cy5‐labelled primers to detect products on polyacrylamide gels is reported. This non‐radioactive method, based on fluorescence, is shown to be faster and the recovery of interesting bands is easier. The study of the differential gene expression of the interaction between potato and Phytophthora infestans was used for the valuation of this method. Different gene expression profiles – such as up‐regulation, down‐regulation or point expression – were obtained. Moreover, this technique was shown to be highly reproducible.  相似文献   

4.
5.
The paper presents effective and mathematically exact procedures for selection of variables which are applicable in cases with a very high dimension as, for example, in gene expression analysis. Choosing sets of variables is an important method to increase the power of the statistical conclusions and to facilitate the biological interpretation. For the construction of sets, each single variable is considered as the centre of potential sets of variables. Testing for significance is carried out by means of the Westfall‐Young principle based on resampling or by the parametric method of spherical tests. The particular requirements for statistical stability are taken into account; each kind of overfitting is avoided. Thus, high power is attained and the familywise type I error can be kept in spite of the large dimension. To obtain graphical representations by heat maps and curves, a specific data compression technique is applied. Gene expression data from B‐cell lymphoma patients serve for the demonstration of the procedures.  相似文献   

6.
Liya Fu  You‐Gan Wang 《Biometrics》2012,68(4):1074-1082
Summary Rank‐based inference is widely used because of its robustness. This article provides optimal rank‐based estimating functions in analysis of clustered data with random cluster effects. The extensive simulation studies carried out to evaluate the performance of the proposed method demonstrate that it is robust to outliers and is highly efficient given the existence of strong cluster correlations. The performance of the proposed method is satisfactory even when the correlation structure is misspecified, or when heteroscedasticity in variance is present. Finally, a real dataset is analyzed for illustration.  相似文献   

7.
8.
Fei Liu  David Dunson  Fei Zou 《Biometrics》2011,67(2):504-512
Summary This article considers the problem of selecting predictors of time to an event from a high‐dimensional set of candidate predictors using data from multiple studies. As an alternative to the current multistage testing approaches, we propose to model the study‐to‐study heterogeneity explicitly using a hierarchical model to borrow strength. Our method incorporates censored data through an accelerated failure time model. Using a carefully formulated prior specification, we develop a fast approach to predictor selection and shrinkage estimation for high‐dimensional predictors. For model fitting, we develop a Monte Carlo expectation maximization (MC‐EM) algorithm to accommodate censored data. The proposed approach, which is related to the relevance vector machine (RVM), relies on maximum a posteriori estimation to rapidly obtain a sparse estimate. As for the typical RVM, there is an intrinsic thresholding property in which unimportant predictors tend to have their coefficients shrunk to zero. We compare our method with some commonly used procedures through simulation studies. We also illustrate the method using the gene expression barcode data from three breast cancer studies.  相似文献   

9.
A major task in the statistical analysis of genetic data such as gene expressions and single nucleotide polymorphisms (SNPs) is to predict whether a patient has a certain disease, or from which of several known subtypes of a disease a patient suffers. A large number of discrimination methods have been proposed in the literature and have been applied to genetic data to tackle this task. In this paper, we give an overview on the most popular of these procedures in the analysis of genetic data. Moreover, we describe how these methods for supervised classification can be combined with variable selection approaches to reduce the number of genetic features from several thousands to as few as possible to form a concise classification rule. Finally, we show how the resulting statistical models can be validated. (© 2008 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

10.
A multiple parametric test procedure is proposed, which considers tests of means of several variables. The single variables or subsets of variables are ordered according to a data‐dependent criterion and tested in this succession without alpha‐adjustment until the first non‐significant test. The test procedure needs the assumption of a multivariate normal distribution and utilizes the theory of spherical distributions. The basic version is particularly suited for variables with approximately equal variances. As a typical example, the procedure is applied to gene expression data from a commercial array.  相似文献   

11.
12.
13.
14.
15.
16.
Lu Chen  Li Hsu  Kathleen Malone 《Biometrics》2009,65(4):1105-1114
Summary The population‐based case–control study design is perhaps one of, if not the most, commonly used designs for investigating the genetic and environmental contributions to disease risk in epidemiological studies. Ages at onset and disease status of family members are routinely and systematically collected from the participants in this design. Considering age at onset in relatives as an outcome, this article is focused on using the family history information to obtain the hazard function, i.e., age‐dependent penetrance function, of candidate genes from case–control studies. A frailty‐model‐based approach is proposed to accommodate the shared risk among family members that is not accounted for by observed risk factors. This approach is further extended to accommodate missing genotypes in family members and a two‐phase case–control sampling design. Simulation results show that the proposed method performs well in realistic settings. Finally, a population‐based two‐phase case–control breast cancer study of the BRCA1 gene is used to illustrate the method.  相似文献   

17.
18.
19.
Summary Second‐generation sequencing (sec‐gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T's, between 30 and 100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base‐calling. The complexity of the base‐calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across‐sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec‐gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base‐calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base‐calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base‐calling performance.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号