期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

Darst Burcu Engelman Corinne D. Tian Ye Lorenzo Bermejo Justo 《BMC genetics》2018,19(1):81-88

Background

GAW20 working group 5 brought together researchers who contributed 7 papers with the aim of evaluating methods to detect genetic by epigenetic interactions. GAW20 distributed real data from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study, including single-nucleotide polymorphism (SNP) markers, methylation (cytosine-phosphate-guanine [CpG]) markers, and phenotype information on up to 995 individuals. In addition, a simulated data set based on the real data was provided.

Results

The 7 contributed papers analyzed these data sets with a number of different statistical methods, including generalized linear mixed models, mediation analysis, machine learning, W-test, and sparsity-inducing regularized regression. These methods generally appeared to perform well. Several papers confirmed a number of causative SNPs in either the large number of simulation sets or the real data on chromosome 11. Findings were also reported for different SNPs, CpG sites, and SNP–CpG site interaction pairs.

Conclusions

In the simulation (200 replications), power appeared generally good for large interaction effects, but smaller effects will require larger studies or consortium collaboration for realizing a sufficient power.

相似文献

2.

Methods and results from the genome-wide association group at GAW20

Xuexia Wang Felix Boekstegers Regina Brinster 《BMC genetics》2018,19(1):79

Background

This paper summarizes the contributions from the Genome-wide Association Study group (GWAS group) of the GAW20. The GWAS group contributions focused on topics such as association tests, phenotype imputation, and application of empirical kinships. The goals of the GWAS group contributions were varied. A real or a simulated data set based on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study was employed by different methods. Different outcomes and covariates were considered, and quality control procedures varied throughout the contributions.

Results

The consideration of heritability and family structure played a major role in some contributions. The inclusion of family information and adaptive weights based on data were found to improve power in genome-wide association studies. It was proven that gene-level approaches are more powerful than single-marker analysis. Other contributions focused on the comparison between pedigree-based kinship and empirical kinship matrices, and investigated similar results in heritability estimation, association mapping, and genomic prediction. A new approach for linkage mapping of triglyceride levels was able to identify a novel linkage signal.

Conclusions

This summary paper reports on promising statistical approaches and findings of the members of the GWAS group applied on real and simulated data which encompass the current topics of epigenetic and pharmacogenomics.

相似文献

3.

Methods and results from the genome-wide association group at GAW20

Wang Xuexia Boekstegers Felix Brinster Regina 《BMC genetics》2018,19(1):109-117

Background

X chromosome inactivation (XCI) is an important gene regulation mechanism in females to equalize the expression levels of X chromosome between two sexes. Generally, one of two X chromosomes in females is randomly chosen to be inactivated. Nonrandom XCI (XCI skewing) is also observed in females, which has been reported to play an important role in many X-linked diseases. However, there is no statistical measure available for the degree of the XCI skewing based on family data in population genetics.

Results

In this article, we propose a statistical approach to measure the degree of the XCI skewing based on family trios, which is represented by a ratio of two genotypic relative risks in females. The point estimate of the ratio is obtained from the maximum likelihood estimates of two genotypic relative risks. When parental genotypes are missing in some family trios, the expectation-conditional-maximization algorithm is adopted to obtain the corresponding maximum likelihood estimates. Further, the confidence interval of the ratio is derived based on the likelihood ratio test. Simulation results show that the likelihood-based confidence interval has an accurate coverage probability under the situations considered. Also, we apply our proposed method to the rheumatoid arthritis data from USA for its practical use, and find out that a locus, rs2238907, may undergo the XCI skewing against the at-risk allele. But this needs to be further confirmed by molecular genetics.

Conclusions

The proposed statistical measure for the skewness of XCI is applicable to complete family trio data or family trio data with some paternal genotypes missing. The likelihood-based confidence interval has an accurate coverage probability under the situations considered. Therefore, our proposed statistical measure is generally recommended in practice for discovering the potential loci which undergo the XCI skewing.

相似文献

4.

Data mining of the GAW14 simulated data using rough set theory and tree-based methods

Wei LY Huang CL Chen CH 《BMC genetics》2005,6(Z1):S133

Rough set theory and decision trees are data mining methods used for dealing with vagueness and uncertainty. They have been utilized to unearth hidden patterns in complicated datasets collected for industrial processes. The Genetic Analysis Workshop 14 simulated data were generated using a system that implemented multiple correlations among four consequential layers of genetic data (disease-related loci, endophenotypes, phenotypes, and one disease trait). When information of one layer was blocked and uncertainty was created in the correlations among these layers, the correlation between the first and last layers (susceptibility genes and the disease trait in this case), was not easily directly detected. In this study, we proposed a two-stage process that applied rough set theory and decision trees to identify genes susceptible to the disease trait. During the first stage, based on phenotypes of subjects and their parents, decision trees were built to predict trait values. Phenotypes retained in the decision trees were then advanced to the second stage, where rough set theory was applied to discover the minimal subsets of genes associated with the disease trait. For comparison, decision trees were also constructed to map susceptible genes during the second stage. Our results showed that the decision trees of the first stage had accuracy rates of about 99% in predicting the disease trait. The decision trees and rough set theory failed to identify the true disease-related loci. 相似文献

5.

An examination of on-line machine learning approaches for pseudo-random generated data

Jia Zhu Chuanhua Xu Zhixu Li Gabriel Fung Xueqin Lin Jin Huang Changqin Huang 《Cluster computing》2016,19(3):1309-1321

A pseudo-random generator is an algorithm to generate a sequence of objects determined by a truly random seed which is not truly random. It has been widely used in many applications, such as cryptography and simulations. In this article, we examine current popular machine learning algorithms with various on-line algorithms for pseudo-random generated data in order to find out which machine learning approach is more suitable for this kind of data for prediction based on on-line algorithms. To further improve the prediction performance, we propose a novel sample weighted algorithm that takes generalization errors in each iteration into account. We perform intensive evaluation on real Baccarat data generated by Casino machines and random number generated by a popular Java program, which are two typical examples of pseudo-random generated data. The experimental results show that support vector machine and k-nearest neighbors have better performance than others with and without sample weighted algorithm in the evaluation data set. 相似文献

6.

MRCNN: a deep learning model for regression of genome-wide DNA methylation

Tian Qi Zou Jianxiao Tang Jianxiong Fang Yuan Yu Zhongli Fan Shicai 《BMC genomics》2019,20(2):1-10

Background

Determination of genome-wide DNA methylation is significant for both basic research and drug development. As a key epigenetic modification, this biochemical process can modulate gene expression to influence the cell differentiation which can possibly lead to cancer. Due to the involuted biochemical mechanism of DNA methylation, obtaining a precise prediction is a considerably tough challenge. Existing approaches have yielded good predictions, but the methods either need to combine plenty of features and prerequisites or deal with only hypermethylation and hypomethylation.

Results

In this paper, we propose a deep learning method for prediction of the genome-wide DNA methylation, in which the Methylation Regression is implemented by Convolutional Neural Networks (MRCNN). Through minimizing the continuous loss function, experiments show that our model is convergent and more precise than the state-of-art method (DeepCpG) according to results of the evaluation. MRCNN also achieves the discovery of de novo motifs by analysis of features from the training process.

Conclusions

Genome-wide DNA methylation could be evaluated based on the corresponding local DNA sequences of target CpG loci. With the autonomous learning pattern of deep learning, MRCNN enables accurate predictions of genome-wide DNA methylation status without predefined features and discovers some de novo methylation-related motifs that match known motifs by extracting sequence patterns.

相似文献

7.

Identification of novel therapeutics for complex diseases from genome-wide association data

Mani P Grover Sara Ballouz Kaavya A Mohanasundaram Richard A George Craig D H Sherman Tamsyn M Crowley Merridee A Wouters 《BMC medical genomics》2014,7(Z1):S8

Background

Human genome sequencing has enabled the association of phenotypes with genetic loci, but our ability to effectively translate this data to the clinic has not kept pace. Over the past 60 years, pharmaceutical companies have successfully demonstrated the safety and efficacy of over 1,200 novel therapeutic drugs via costly clinical studies. While this process must continue, better use can be made of the existing valuable data. In silico tools such as candidate gene prediction systems allow rapid identification of disease genes by identifying the most probable candidate genes linked to genetic markers of the disease or phenotype under investigation. Integration of drug-target data with candidate gene prediction systems can identify novel phenotypes which may benefit from current therapeutics. Such a drug repositioning tool can save valuable time and money spent on preclinical studies and phase I clinical trials.

Methods

We previously used Gentrepid (http://www.gentrepid.org) as a platform to predict 1,497 candidate genes for the seven complex diseases considered in the Wellcome Trust Case-Control Consortium genome-wide association study; namely Type 2 Diabetes, Bipolar Disorder, Crohn's Disease, Hypertension, Type 1 Diabetes, Coronary Artery Disease and Rheumatoid Arthritis. Here, we adopted a simple approach to integrate drug data from three publicly available drug databases: the Therapeutic Target Database, the Pharmacogenomics Knowledgebase and DrugBank; with candidate gene predictions from Gentrepid at the systems level.

Results

Using the publicly available drug databases as sources of drug-target association data, we identified a total of 428 candidate genes as novel therapeutic targets for the seven phenotypes of interest, and 2,130 drugs feasible for repositioning against the predicted novel targets.

Conclusions

By integrating genetic, bioinformatic and drug data, we have demonstrated that currently available drugs may be repositioned as novel therapeutics for the seven diseases studied here, quickly taking advantage of prior work in pharmaceutics to translate ground-breaking results in genetics to clinical treatments.

相似文献

8.

yMGV: a database for visualization and data mining of published genome-wide yeast expression data 总被引：1，自引：1，他引：0

下载免费PDF全文

Philippe Marc Frdric Devaux Claude Jacq 《Nucleic acids research》2001,29(13):e63

相似文献

9.

A review of genome-wide association studies for multiple sclerosis: classical and hypothesis-driven approaches

V. V. Bashinskaya O. G. Kulakova A. N. Boyko A. V. Favorov O. O. Favorova 《Human genetics》2015,134(11-12):1143-1162

相似文献

10.

DNA methylation biomarkers for the occurrence of lung adenocarcinoma from TCGA data mining

下载免费PDF全文

Xiao‐Feng Zhu Bi‐Sheng Zhu Fei‐Ma Wu Hai‐Bo Hu 《Journal of cellular physiology》2018,233(10):6777-6784

相似文献

11.

Application of novel and existing methods to identify genes with evidence of epigenetic association: results from GAW20

Fuady Angga M. Lent Samantha Sarnowski Chlo&#; Tintle Nathan L. 《BMC genetics》2018,19(1):72-97

Background

The rise in popularity and accessibility of DNA methylation data to evaluate epigenetic associations with disease has led to numerous methodological questions. As part of GAW20, our working group of 8 research groups focused on gene searching methods.

Results

Although the methods were varied, we identified 3 main themes within our group. First, many groups tackled the question of how best to use pedigree information in downstream analyses, finding that (a) the use of kinship matrices is common practice, (b) ascertainment corrections may be necessary, and (c) pedigree information may be useful for identifying parent-of-origin effects. Second, many groups also considered multimarker versus single-marker tests. Multimarker tests had modestly improved power versus single-marker methods on simulated data, and on real data identified additional associations that were not identified with single-marker methods, including identification of a gene with a strong biological interpretation. Finally, some of the groups explored methods to combine single-nucleotide polymorphism (SNP) and DNA methylation into a single association analysis.

Conclusions

A causal inference method showed promise at discovering new mechanisms of SNP activity; gene-based methods of summarizing SNP and DNA methylation data also showed promise. Even though numerous questions still remain in the analysis of DNA methylation data, our discussions at GAW20 suggest some emerging best practices.

相似文献

12.

Statistical approaches for the analysis of DNA methylation microarray data

Siegmund KD 《Human genetics》2011,129(6):585-595

Following the rapid development and adoption in DNA methylation microarray assays, we are now experiencing a growth in the number of statistical tools to analyze the resulting large-scale data sets. As is the case for other microarray applications, biases caused by technical issues are of concern. Some of these issues are old (e.g., two-color dye bias and probe- and array-specific effects), while others are new (e.g., fragment length bias and bisulfite conversion efficiency). Here, I highlight characteristics of DNA methylation that suggest standard statistical tools developed for other data types may not be directly suitable. I then describe the microarray technologies most commonly in use, along with the methods used for preprocessing and obtaining a summary measure. I finish with a section describing downstream analyses of the data, focusing on methods that model percentage DNA methylation as the outcome, and methods for integrating DNA methylation with gene expression or genotype data. 相似文献

13.

ACE-it: a tool for genome-wide integration of gene dosage and RNA expression data 总被引：2，自引：0，他引：2

van Wieringen WN Belien JA Vosse SJ Achame EM Ylstra B 《Bioinformatics (Oxford, England)》2006,22(15):1919-1920

SUMMARY: We describe a tool, called ACE-it (Array CGH Expression integration tool). ACE-it links the chromosomal position of the gene dosage measured by array CGH to the genes measured by the expression array. ACE-it uses this link to statistically test whether gene dosage affects RNA expression. AVAILABILITY: ACE-it is freely available at http://ibivu.cs.vu.nl/programs/acewww/. 相似文献

14.

A two-phase Bayesian methodology for the analysis of binary phenotypes in genome-wide association studies

Chase Joyner Christopher McMahan James Baurley Bens Pardamean 《Biometrical journal. Biometrische Zeitschrift》2020,62(1):191-201

Recent advances in sequencing and genotyping technologies are contributing to a data revolution in genome-wide association studies that is characterized by the challenging large p small n problem in statistics. That is, given these advances, many such studies now consider evaluating an extremely large number of genetic markers (p) genotyped on a small number of subjects (n). Given the dimension of the data, a joint analysis of the markers is often fraught with many challenges, while a marginal analysis is not sufficient. To overcome these obstacles, herein, we propose a Bayesian two-phase methodology that can be used to jointly relate genetic markers to binary traits while controlling for confounding. The first phase of our approach makes use of a marginal scan to identify a reduced set of candidate markers that are then evaluated jointly via a hierarchical model in the second phase. Final marker selection is accomplished through identifying a sparse estimator via a novel and computationally efficient maximum a posteriori estimation technique. We evaluate the performance of the proposed approach through extensive numerical studies, and consider a genome-wide application involving colorectal cancer. 相似文献

15.

Linkage analysis of GAW14 simulated data: comparison of multimarker, multipoint, and conditional approaches

Barber MJ Wheeler E Cordell HJ 《BMC genetics》2005,6(Z1):S40

The purposes of this study were 1) to examine the performance of a new multimarker regression approach for model-free linkage analysis in comparison to a conventional multipoint approach, and 2) to determine the whether a conditioning strategy would improve the performance of the conventional multipoint method when applied to data from two interacting loci. Linkage analysis of the Kofendrerd Personality Disorder phenotype to chromosomes 1 and 3 was performed in three populations for all 100 replicates of the Genetic Analysis Workshop 14 simulated data. Three approaches were used: a conventional multipoint analysis using the Zlr statistic as calculated in the program ALLEGRO; a conditioning approach in which the per-family contribution on one chromosome was weighted according to evidence for linkage on the other chromosome; and a novel multimarker regression approach. The multipoint and multimarker approaches were generally successful in localizing known susceptibility loci on chromosomes 1 and 3, and were found to give broadly similar results. No advantage was found with the per-family conditioning approach. The effect on power and type I error of different choices of weighting scheme (to account for different numbers of affected siblings) in the multimarker approach was examined. 相似文献

16.

Linkage analysis of GAW14 simulated data: comparison of multimarker,multipoint, and conditional approaches 总被引：3，自引：0，他引：3

Barber Mathew J Wheeler Eleanor Cordell Heather J 《BMC genetics》2005,6(1):1-6

相似文献

17.

GenMiner: mining non-redundant association rules from integrated gene expression data and annotations

Martinez R Pasquier N Pasquier C 《Bioinformatics (Oxford, England)》2008,24(22):2643-2644

GenMiner is an implementation of association rule discovery dedicated to the analysis of genomic data. It allows the analysis of datasets integrating multiple sources of biological data represented as both discrete values, such as gene annotations, and continuous values, such as gene expression measures. GenMiner implements the new NorDi (normal discretization) algorithm for normalizing and discretizing continuous values and takes advantage of the Close algorithm to efficiently generate minimal non-redundant association rules. Experiments show that execution time and memory usage of GenMiner are significantly smaller than those of the standard Apriori-based approach, as well as the number of extracted association rules. AVAILABILITY: The GenMiner software and supplementary materials are available at http://bioinfo.unice.fr/publications/genminer_article/ and http://keia.i3s.unice.fr/?Implementations:GenMiner SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. 相似文献

18.

Biochemical,machine learning and molecular approaches for the differential diagnosis of Mucopolysaccharidoses

Kadali Srilatha Naushad Shaik Mohammad Radha Rama Devi Akella Bodiga Vijaya Lakshmi 《Molecular and cellular biochemistry》2019,458(1-2):27-37

This study was aimed to construct classification and regression tree (CART) model of glycosaminoglycans (GAGs) for the differential diagnosis of Mucopolysaccharidoses (MPS). Two-dimensional electrophoresis and liquid chromatography–tandem mass spectrometry (LC–MS/MS) were used for the qualitative and quantitative analysis of GAGs. Specific enzyme assays and targeted gene sequencing were performed to confirm the diagnosis. Machine learning tools were used to develop CART model based on GAG profile. Qualitative and quantitative CART models showed 96.3% and 98.3% accuracy, respectively, in the differential diagnosis of MPS. The thresholds of different GAGs diagnostic of specific MPS types were established. In 60 MPS positive cases, 46 different mutations were identified in six specific genes. Among 31 different mutations identified in IDUA, nine were nonsense mutations and two were gross deletions while the remaining were missense mutations. In IDS gene, four missense, two frameshift, and one deletion were identified. In NAGLU gene, c.1693C?>?T and c.1914_1914insT were the most common mutations. Two ARSB, one case each of SGSH and GALNS mutations were observed. LC–MS/MS-based GAG pattern showed higher accuracy in the differential diagnosis of MPS. The mutation spectrum of MPS, specifically in IDUA and IDS genes, is highly heterogeneous among the cases studied.

相似文献

19.

Accounting for population stratification in practice: a comparison of the main strategies dedicated to genome-wide association studies

Bouaziz M Ambroise C Guedj M 《PloS one》2011,6(12):e28845

Genome-Wide Association Studies are powerful tools to detect genetic variants associated with diseases. Their results have, however, been questioned, in part because of the bias induced by population stratification. This is a consequence of systematic differences in allele frequencies due to the difference in sample ancestries that can lead to both false positive or false negative findings. Many strategies are available to account for stratification but their performances differ, for instance according to the type of population structure, the disease susceptibility locus minor allele frequency, the degree of sampling imbalanced, or the sample size. We focus on the type of population structure and propose a comparison of the most commonly used methods to deal with stratification that are the Genomic Control, Principal Component based methods such as implemented in Eigenstrat, adjusted Regressions and Meta-Analyses strategies. Our assessment of the methods is based on a large simulation study, involving several scenarios corresponding to many types of population structures. We focused on both false positive rate and power to determine which methods perform the best. Our analysis showed that if there is no population structure, none of the tests led to a bias nor decreased the power except for the Meta-Analyses. When the population is stratified, adjusted Logistic Regressions and Eigenstrat are the best solutions to account for stratification even though only the Logistic Regressions are able to constantly maintain correct false positive rates. This study provides more details about these methods. Their advantages and limitations in different stratification scenarios are highlighted in order to propose practical guidelines to account for population stratification in Genome-Wide Association Studies. 相似文献

20.

A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approaches

Sampson DL Parker TJ Upton Z Hurst CP 《PloS one》2011,6(9):e24973

The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called “omics” disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems. 相似文献