共查询到20条相似文献,搜索用时 0 毫秒
1.
Li Z Sillanpää MJ 《TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik》2012,125(3):419-435
Quantitative trait loci (QTL)/association mapping aims at finding genomic loci associated with the phenotypes, whereas genomic selection focuses on breeding value prediction based on genomic data. Variable selection is a key to both of these tasks as it allows to (1) detect clear mapping signals of QTL activity, and (2) predict the genome-enhanced breeding values accurately. In this paper, we provide an overview of a statistical method called least absolute shrinkage and selection operator (LASSO) and two of its generalizations named elastic net and adaptive LASSO in the contexts of QTL mapping and genomic breeding value prediction in plants (or animals). We also briefly summarize the Bayesian interpretation of LASSO, and the inspired hierarchical Bayesian models. We illustrate the implementation and examine the performance of methods using three public data sets: (1) North American barley data with 127 individuals and 145 markers, (2) a simulated QTLMAS XII data with 5,865 individuals and 6,000 markers for both QTL mapping and genomic selection, and (3) a wheat data with 599 individuals and 1,279 markers only for genomic selection. 相似文献
2.
We introduce and evaluate data analysis methods to interpret simultaneous measurement of multiple genomic features made on
the same biological samples. Our tools use gene sets to provide an interpretable common scale for diverse genomic information.
We show we can detect genetic effects, although they may act through different mechanisms in different samples, and show we
can discover and validate important disease-related gene sets that would not be discovered by analyzing each data type individually. 相似文献
3.
MOTIVATION: Genome sequencing projects and high-through-put technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. The algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data. RESULTS: The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cell-cycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities. AVAILABILITY: Software code is available by request from the first author. CONTACT: jkasturi@cse.psu.edu. 相似文献
4.
Shen L Tan EC 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2005,2(2):166-175
The use of penalized logistic regression for cancer classification using microarray expression data is presented. Two dimension reduction methods are respectively combined with the penalized logistic regression so that both the classification accuracy and computational speed are enhanced. Two other machine-learning methods, support vector machines and least-squares regression, have been chosen for comparison. It is shown that our methods have achieved at least equal or better results. They also have the advantage that the output probability can be explicitly given and the regression coefficients are easier to interpret. Several other aspects, such as the selection of penalty parameters and components, pertinent to the application of our methods for cancer classification are also discussed. 相似文献
5.
Myers CL Robson D Wible A Hibbs MA Chiriac C Theesfeld CL Dolinski K Troyanskaya OG 《Genome biology》2005,6(13):R114
We have developed a general probabilistic system for query-based discovery of pathway-specific networks through integration of diverse genome-wide data. This framework was validated by accurately recovering known networks for 31 biological processes in Saccharomyces cerevisiae and experimentally verifying predictions for the process of chromosomal segregation. Our system, bioPIXIE, a public, comprehensive system for integration, analysis, and visualization of biological network predictions for S. cerevisiae, is freely accessible over the worldwide web. 相似文献
6.
One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer. Results from analysis of a breast cancer microarray gene expression data set indicate that the pathways of metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer-specific survival. 相似文献
7.
High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence. 相似文献
8.
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described. 相似文献
9.
10.
11.
Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data 总被引:1,自引:0,他引:1
Wang Z Wang Y Xuan J Dong Y Bakay M Feng Y Clarke R Hoffman EP 《Bioinformatics (Oxford, England)》2006,22(6):755-761
MOTIVATION: Multilayer perceptrons (MLP) represent one of the widely used and effective machine learning methods currently applied to diagnostic classification based on high-dimensional genomic data. Since the dimensionalities of the existing genomic data often exceed the available sample sizes by orders of magnitude, the MLP performance may degrade owing to the curse of dimensionality and over-fitting, and may not provide acceptable prediction accuracy. RESULTS: Based on Fisher linear discriminant analysis, we designed and implemented an MLP optimization scheme for a two-layer MLP that effectively optimizes the initialization of MLP parameters and MLP architecture. The optimized MLP consistently demonstrated its ability in easing the curse of dimensionality in large microarray datasets. In comparison with a conventional MLP using random initialization, we obtained significant improvements in major performance measures including Bayes classification accuracy, convergence properties and area under the receiver operating characteristic curve (A(z)). SUPPLEMENTARY INFORMATION: The Supplementary information is available on http://www.cbil.ece.vt.edu/publications.htm 相似文献
12.
We consider the problem of estimating the intensity functions for a continuous time 'illness-death' model with intermittently observed data. In such a case, it may happen that a subject becomes diseased between two visits and dies without being observed. Consequently, there is an uncertainty about the precise number of transitions. Estimating the intensity of transition from health to illness by survival analysis (treating death as censoring) is biased downwards. Furthermore, the dates of transitions between states are not known exactly. We propose to estimate the intensity functions by maximizing a penalized likelihood. The method yields smooth estimates without parametric assumptions. This is illustrated using data from a large cohort study on cerebral ageing. The age-specific incidence of dementia is estimated using an illness-death approach and a survival approach. 相似文献
13.
Suzuki Y 《Genes & genetic systems》2010,85(6):359-376
In the study of molecular and phenotypic evolution, understanding the relative importance of random genetic drift and positive selection as the mechanisms for driving divergences between populations and maintaining polymorphisms within populations has been a central issue. A variety of statistical methods has been developed for detecting natural selection operating at the amino acid and nucleotide sequence levels. These methods may be largely classified into those aimed at detecting recurrent and/or recent/ongoing natural selection by utilizing the divergence and/or polymorphism data. Using these methods, pervasive positive selection has been identified for protein-coding and non-coding sequences in the genomic analysis of some organisms. However, many of these methods have been criticized by using computer simulation and real data analysis to produce excessive false-positives and to be sensitive to various disturbing factors. Importantly, some of these methods have been invalidated experimentally. These facts indicate that many of the statistical methods for detecting natural selection are unreliable. In addition, the signals that have been believed as the evidence for fixations of advantageous mutations due to positive selection may also be interpreted as the evidence for fixations of deleterious mutations due to random genetic drift. The genomic diversity data are rapidly accumulating in various organisms, and detection of natural selection may play a critical role for clarifying the relative role of random genetic drift and positive selection in molecular and phenotypic evolution. It is therefore important to develop reliable statistical methods that are unbiased as well as robust against various disturbing factors, for inferring natural selection. 相似文献
14.
15.
Mathematical models are an essential tool in systems biology, linking the behaviour of a system to the interactions between its components. Parameters in empirical mathematical models must be determined using experimental data, a process called regression. Because experimental data are noisy and incomplete, diagnostics that test the structural identifiability and validity of models and the significance and determinability of their parameters are needed to ensure that the proposed models are supported by the available data. 相似文献
16.
Reproducing kernel hilbert spaces regression methods for genomic assisted prediction of quantitative traits
下载免费PDF全文

Reproducing kernel Hilbert spaces regression procedures for prediction of total genetic value for quantitative traits, which make use of phenotypic and genomic data simultaneously, are discussed from a theoretical perspective. It is argued that a nonparametric treatment may be needed for capturing the multiple and complex interactions potentially arising in whole-genome models, i.e., those based on thousands of single-nucleotide polymorphism (SNP) markers. After a review of reproducing kernel Hilbert spaces regression, it is shown that the statistical specification admits a standard mixed-effects linear model representation, with smoothing parameters treated as variance components. Models for capturing different forms of interaction, e.g., chromosome-specific, are presented. Implementations can be carried out using software for likelihood-based or Bayesian inference. 相似文献
17.
MOTIVATION: Genomic DNA copy number alterations are characteristic of many human diseases including cancer. Various techniques and platforms have been proposed to allow researchers to partition the whole genome into segments where copy numbers change between contiguous segments, and subsequently to quantify DNA copy number alterations. In this paper, we incorporate the spatial dependence of DNA copy number data into a regression model and formalize the detection of DNA copy number alterations as a penalized least squares regression problem. In addition, we use a stationary bootstrap approach to estimate the statistical significance and false discovery rate. RESULTS: The proposed method is studied by simulations and illustrated by an application to an extensively analyzed dataset in the literature. The results show that the proposed method can correctly detect the numbers and locations of the true breakpoints while appropriately controlling the false positives. AVAILABILITY: http://bioinformatics.med.yale.edu/DNACopyNumber CONTACT: hongyu.zhao@yale.edu SUPPLEMENTARY INFORMATION: http://bioinformatics.med.yale.edu/DNACopyNumber. 相似文献
18.
SUMMARY: AnnBuilder is an R package for assembling genomic annotation data. The system currently provides parsers to process annotation data from LocusLink, Gene Ontology Consortium, and Human Gene Project and can be extended to new data sources via user defined parsers. AnnBuilder differs from other existing systems in that it provides users with unlimited ability to assemble data from user selected sources. The products of AnnBuilder are files in XML format that can be easily used by different systems. AVAILABILITY: (http://www.bioconductor.org). Open source. 相似文献
19.
Hojin Moon Hongshik Ahn Ralph L Kodell Chien-Ju Lin Songjoon Baek James J Chen 《Genome biology》2007,7(12):R121
Personalized medicine is defined by the use of genomic signatures of patients to assign effective therapies. We present Classification
by Ensembles from Random Partitions (CERP) for class prediction and apply CERP to genomic data on leukemia patients and to
genomic data with several clinical variables on breast cancer patients. CERP performs consistently well compared to the other
classification algorithms. The predictive accuracy can be improved by adding some relevant clinical/histopathological measurements
to the genomic data. 相似文献
20.
Personalized medicine is defined by the use of genomic signatures of patients to assign effective therapies. We present Classification by Ensembles from Random Partitions (CERP) for class prediction and apply CERP to genomic data on leukemia patients and to genomic data with several clinical variables on breast cancer patients. CERP performs consistently well compared to the other classification algorithms. The predictive accuracy can be improved by adding some relevant clinical/histopathological measurements to the genomic data. 相似文献