首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 742 毫秒
1.
With the advance of genome-wide association studies and newly identified SNP (single-nucleotide polymorphism) associations with complex disease, important discoveries have emerged focusing not only on individual genes but on disease-associated pathways and gene sets. The authors used prospective myocardial infarction case-control studies nested in the Nurses’ Health and Health Professionals Follow-Up Studies to investigate genetic variants associated with myocardial infarction or LDL, HDL, triglycerides, adiponectin and apolipoprotein B (apoB). Using these case-control studies to illustrate an integrative systems biology approach, the authors applied SNP set enrichment analysis to identify gene sets where expression SNPs representing genes from these sets show enrichment in their association with endpoints of interest. The authors also explored an aggregate score approach. While power limited one’s ability to detect significance for association of individual loci with myocardial infarction, the authors found significance for loci associated with LDL, HDL, apoB and triglycerides, replicating previous observations. Applying SNP set enrichment analysis and risk score methods, the authors also found significance for three gene sets and for aggregate scores associated with myocardial infarction as well as for loci-related to cardiovascular risk factors, supporting the use of these methods in practice.  相似文献   

2.
Comprehensive characterization of a gene's impact on phenotypes requires knowledge of the context of the gene. To address this issue we introduce a systematic data integration method Candidate Genes and SNPs (CANGES) that links SNP and linkage disequilibrium data to pathway- and protein-protein interaction information. It can be used as a knowledge discovery tool for the search of disease associated causative variants from genome-wide studies as well as to generate new hypotheses on synergistically functioning genes. We demonstrate the utility of CANGES by integrating pathway and protein-protein interaction data to identify putative functional variants for (i) the p53 gene and (ii) three glioblastoma multiforme (GBM) associated risk genes. For the GBM case, we further integrate the CANGES results with clinical and genome-wide data for 209 GBM patients and identify genes having effects on GBM patient survival. Our results show that selecting a focused set of genes can result in information beyond the traditional genome-wide association approaches. Taken together, holistic approach to identify possible interacting genes and SNPs with CANGES provides a means to rapidly identify networks for any set of genes and generate novel hypotheses. CANGES is available in http://csbi.ltdk.helsinki.fi/CANGES/  相似文献   

3.

Background

Recent development of high-resolution single nucleotide polymorphism (SNP) arrays allows detailed assessment of genome-wide human genome variations. There is increasing recognition of the importance of SNPs for medicine and developmental biology. However, SNP data set typically has a large number of SNPs (e.g., 400 thousand SNPs in genome-wide Parkinson disease data set) and a few hundred of samples. Conventional classification methods may not be effective when applied to such genome-wide SNP data.

Results

In this paper, we use shrunken dissimilarity measure to analyze and select relevant SNPs for classification problems. Examples of HapMap data and Parkinson disease (PD) data are given to demonstrate the effectiveness of the proposed method, and illustrate it has a potential to become a useful analysis tool for SNP data sets. We use Parkinson disease data as an example, and perform a whole genome analysis. For the 367440 SNPs with less than 1% missing percentage from all 22 chromosomes, we can select 357 SNPs from this data set. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that has a similarity value being greater than a threshold are selected to construct several groups of genes. For the SNPs that involved in these groups of genes, a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and SNPs with significance of P < 0.01 are chosen to identify SNPs networks based on their P values. Here SNPs networks are constructed based on Gene Ontology knowledge, and therefore each SNP network plays a role in the biological process. An analysis shows that such networks have relationships directly or indirectly to Parkinson disease.

Conclusions

Experimental results show that our approach is suitable to handle genetic variations, and provide useful knowledge in a genome-wide SNP study.
  相似文献   

4.
In many case-control genetic association studies, a set of correlated secondary phenotypes that may share common genetic factors with disease status are collected. Examination of these secondary phenotypes can yield valuable insights about the disease etiology and supplement the main studies. However, due to unequal sampling probabilities between cases and controls, standard regression analysis that assesses the effect of SNPs (single nucleotide polymorphisms) on secondary phenotypes using cases only, controls only, or combined samples of cases and controls can yield inflated type I error rates when the test SNP is associated with the disease. To solve this issue, we propose a Gaussian copula-based approach that efficiently models the dependence between disease status and secondary phenotypes. Through simulations, we show that our method yields correct type I error rates for the analysis of secondary phenotypes under a wide range of situations. To illustrate the effectiveness of our method in the analysis of real data, we applied our method to a genome-wide association study on high-density lipoprotein cholesterol (HDL-C), where "cases" are defined as individuals with extremely high HDL-C level and "controls" are defined as those with low HDL-C level. We treated 4 quantitative traits with varying degrees of correlation with HDL-C as secondary phenotypes and tested for association with SNPs in LIPG, a gene that is well known to be associated with HDL-C. We show that when the correlation between the primary and secondary phenotypes is >0.2, the P values from case-control combined unadjusted analysis are much more significant than methods that aim to correct for ascertainment bias. Our results suggest that to avoid false-positive associations, it is important to appropriately model secondary phenotypes in case-control genetic association studies.  相似文献   

5.
The advent of high-throughput sequencing technology has resulted in the ability to measure millions of single-nucleotide polymorphisms (SNPs) from thousands of individuals. Although these high-dimensional data have paved the way for better understanding of the genetic architecture of common diseases, they have also given rise to challenges in developing computational methods for learning epistatic relationships among genetic markers. We propose a new method, named cuckoo search epistasis (CSE) for identifying significant epistatic interactions in population-based association studies with a case–control design. This method combines a computationally efficient Bayesian scoring function with an evolutionary-based heuristic search algorithm, and can be efficiently applied to high-dimensional genome-wide SNP data. The experimental results from synthetic data sets show that CSE outperforms existing methods including multifactorial dimensionality reduction and Bayesian epistasis association mapping. In addition, on a real genome-wide data set related to Alzheimer''s disease, CSE identified SNPs that are consistent with previously reported results, and show the utility of CSE for application to genome-wide data.  相似文献   

6.
The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single-marker association methods. As an alternative to single-marker analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of penalized regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by false discovery rate (FDR) control, and assess their performance in comparison with SMA. PR methods were compared with SMA, using realistically simulated GWAS data with a continuous phenotype and real data. Based on these comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini–Hochberg FDR control (SMA-BH). PR with FDR-based penalty parameter selection controlled the FDR somewhat conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on SNP selection with FDR control. Incorporating linkage disequilibrium into the penalization by adapting penalties developed for covariates measured on graphs can improve power but also generate more false positives or wider regions for follow-up. We recommend the elastic net with a mixing weight for the Lasso penalty near 0.5 as the best method.  相似文献   

7.
The objective of the study was to identify interacting genes contributing to rheumatoid arthritis (RA) susceptibility and identify SNPs that discriminate between RA patients who were anti-cyclic citrullinated protein positive and healthy controls. We analyzed two independent cohorts from the North American Rheumatoid Arthritis Consortium. A cohort of 908 RA cases and 1,260 controls was used to discover pairwise interactions among SNPs and to identify a set of single nucleotide polymorphisms (SNPs) that predict RA status, and a second cohort of 952 cases and 1,760 controls was used to validate the findings. After adjusting for HLA-shared epitope alleles, we identified and replicated seven SNP pairs within the HLA class II locus with significant interaction effects. We failed to replicate significant pairwise interactions among non-HLA SNPs. The machine learning approach “random forest” applied to a set of SNPs selected from single-SNP and pairwise interaction tests identified 93 SNPs that distinguish RA cases from controls with 70% accuracy. HLA SNPs provide the most classification information, and inclusion of non-HLA SNPs improved classification. While specific gene–gene interactions are difficult to validate using genome-wide SNP data, a stepwise approach combining association and classification methods identifies candidate interacting SNPs that distinguish RA cases from healthy controls.  相似文献   

8.
The extended Simes’ test (known as GATES) and scaled chi-square test were proposed to combine a set of dependent genome-wide association signals at multiple single-nucleotide polymorphisms (SNPs) for assessing the overall significance of association at the gene or pathway levels. The two tests use different strategies to combine association p values and can outperform each other when the number of and linkage disequilibrium between SNPs vary. In this paper, we introduce a hybrid set-based test (HYST) combining the two tests for genome-wide association studies (GWASs). We describe how HYST can be used to evaluate statistical significance for association at the protein-protein interaction (PPI) level in order to increase power for detecting disease-susceptibility genes of moderate effect size. Computer simulations demonstrated that HYST had a reasonable type 1 error rate and was generally more powerful than its parents and other alternative tests to detect a PPI pair where both genes are associated with the disease of interest. We applied the method to three complex disease GWAS data sets in the public domain; the method detected a number of highly connected significant PPI pairs involving multiple confirmed disease-susceptibility genes not found in the SNP- and gene-based association analyses. These results indicate that HYST can be effectively used to examine a collection of predefined SNP sets based on prior biological knowledge for revealing additional disease-predisposing genes of modest effects in GWASs.  相似文献   

9.
The increased numbers of genetic markers produced by genomic techniques have the potential to both identify hybrid individuals and localize chromosomal regions responding to selection and contributing to introgression. We used restriction-site-associated DNA sequencing to identify a dense set of candidate SNP loci with fixed allelic differences between introduced rainbow trout (Oncorhynchus mykiss) and native westslope cutthroat trout (Oncorhynchus clarkii lewisi). We distinguished candidate SNPs from homeologs (paralogs resulting from whole-genome duplication) by detecting excessively high observed heterozygosity and deviations from Hardy-Weinberg proportions. We identified 2923 candidate species-specific SNPs from a single Illumina sequencing lane containing 24 barcode-labelled individuals. Published sequence data and ongoing genome sequencing of rainbow trout will allow physical mapping of SNP loci for genome-wide scans and will also provide flanking sequence for design of qPCR-based TaqMan(?) assays for high-throughput, low-cost hybrid identification using a subset of 50-100 loci. This study demonstrates that it is now feasible to identify thousands of informative SNPs in nonmodel species quickly and at reasonable cost, even if no prior genomic information is available.  相似文献   

10.
Next-generation sequencing has transformed the fields of ecological and evolutionary genetics by allowing for cost-effective identification of genome-wide variation. Single nucleotide polymorphism (SNP) arrays, or “SNP chips”, enable very large numbers of individuals to be consistently genotyped at a selected set of these identified markers, and also offer the advantage of being able to analyse samples of variable DNA quality. We used reduced representation restriction-aided digest sequencing (RAD-seq) of 31 birds of the threatened hihi (Notiomystis cincta; stitchbird) and low-coverage whole genome sequencing (WGS) of 10 of these birds to develop an Affymetrix 50 K SNP chip. We overcame the limitations of having no hihi reference genome and a low quantity of sequence data by separate and pooled de novo assembly of each of the 10 WGS birds. Reads from all individuals were mapped back to these de novo assemblies to identify SNPs. A subset of RAD-seq and WGS SNPs were selected for inclusion on the chip, prioritising SNPs with the highest quality scores whose flanking sequence uniquely aligned to the zebra finch (Taeniopygia guttata) genome. Of the 58,466 SNPs manufactured on the chip, 72% passed filtering metrics and were polymorphic. By genotyping 1,536 hihi on the array, we found that SNPs detected in multiple assemblies were more likely to successfully genotype, representing a cost-effective approach to identify SNPs for genotyping. Here, we demonstrate the utility of the SNP chip by describing the high rates of linkage disequilibrium in the hihi genome, reflecting the history of population bottlenecks in the species.  相似文献   

11.
Genetic mutations may interact to increase the risk of human complex diseases. Mapping of multiple interacting disease loci in the human genome has recently shown promise in detecting genes with little main effects. The power of interaction association mapping, however, can be greatly influenced by the set of single nucleotide polymorphism (SNP) genotyped in a case-control study. Previous imputation methods only focus on imputation of individual SNPs without considering their joint distribution of possible interactions. We present a new method that simultaneously detects multilocus interaction associations and imputes missing SNPs from a full Bayesian model. Our method treats both the case-control sample and the reference data as random observations. The output of our method is the posterior probabilities of SNPs for their marginal and interacting associations with the disease. Using simulations, we show that the method produces accurate and robust imputation with little overfitting problems. We further show that, with the type I error rate maintained at a common level, SNP imputation can consistently and sometimes substantially improve the power of detecting disease interaction associations. We use a data set of inflammatory bowel disease to demonstrate the application of our method.  相似文献   

12.
GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data.  相似文献   

13.
14.
GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.  相似文献   

15.
16.
A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.  相似文献   

17.
There has been increased interest in discovering combinations of single-nucleotide polymorphisms (SNPs) that are strongly associated with a phenotype even if each SNP has little individual effect. Efficient approaches have been proposed for searching two-locus combinations from genome-wide datasets. However, for high-order combinations, existing methods either adopt a brute-force search which only handles a small number of SNPs (up to few hundreds), or use heuristic search that may miss informative combinations. In addition, existing approaches lack statistical power because of the use of statistics with high degrees-of-freedom and the huge number of hypotheses tested during combinatorial search. Due to these challenges, functional interactions in high-order combinations have not been systematically explored. We leverage discriminative-pattern-mining algorithms from the data-mining community to search for high-order combinations in case-control datasets. The substantially improved efficiency and scalability demonstrated on synthetic and real datasets with several thousands of SNPs allows the study of several important mathematical and statistical properties of SNP combinations with order as high as eleven. We further explore functional interactions in high-order combinations and reveal a general connection between the increase in discriminative power of a combination over its subsets and the functional coherence among the genes comprising the combination, supported by multiple datasets. Finally, we study several significant high-order combinations discovered from a lung-cancer dataset and a kidney-transplant-rejection dataset in detail to provide novel insights on the complex diseases. Interestingly, many of these associations involve combinations of common variations that occur in small fractions of population. Thus, our approach is an alternative methodology for exploring the genetics of rare diseases for which the current focus is on individually rare variations.  相似文献   

18.
19.
Genome-wide association studies (GWAS) provide a powerful approach for identifying quantitative trait loci without prior knowledge of location or function. To identify loci associated with wool production traits, we performed a genome-wide association study on a total of 765 Chinese Merino sheep (JunKen type) genotyped with 50 K single nucleotide polymorphisms (SNPs). In the present study, five wool production traits were examined: fiber diameter, fiber diameter coefficient of variation, fineness dispersion, staple length and crimp. We detected 28 genome-wide significant SNPs for fiber diameter, fiber diameter coefficient of variation, fineness dispersion, and crimp trait in the Chinese Merino sheep. About 43% of the significant SNP markers were located within known or predicted genes, including YWHAZ, KRTCAP3, TSPEAR, PIK3R4, KIF16B, PTPN3, GPRC5A, DDX47, TCF9, TPTE2, EPHA5 and NBEA genes. Our results not only confirm the results of previous reports, but also provide a suite of novel SNP markers and candidate genes associated with wool traits. Our findings will be useful for exploring the genetic control of wool traits in sheep.  相似文献   

20.
MOTIVATION: Using simulation studies for quantitative trait loci (QTL), we evaluate the prediction quality of regression models that include as covariates single-nucleotide polymorphism (SNP) genetic markers which did not achieve genome-wide significance in the original genome-wide association study, but were among the SNPs with the smallest P-value for the selected association test. We compare the results of such regression models to the standard approach which is to include only SNPs that achieve genome-wide significance. Using mean square prediction error as the model metric, our simulation results suggest that by using the coefficient of determination (R(2)) value as a guideline to increase or reduce the number of SNPs included in the regression model, we can achieve better prediction quality than the standard approach. However, important parameters such as trait heritability, the approximate number of QTLs, etc. have to be determined from previous studies or have to be estimated accurately.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号