We applied three approaches for the identification of polymorphisms explaining the linkage evidence to the Genetic Analysis Workshop 14 simulated data: 1) the genotype-IBD sharing test (GIST); 2) an approach suggested by Horikawa and colleagues; and 3) the homozygote sharing test (HST). These tests were compared with a family-based association test. Two linked regions with highest nonparametric linkage scores were selected to apply these methods. In the first region, Horikawa's method identified the most SNPs within the region containing the disease susceptibility locus, while HST performed best in the second region. However, Horikawa's method also had the most type I errors. These methods show potential as additional tools to complement family-based association tests for the identification of disease susceptibility variants.  相似文献   

The multifactor dimensionality reduction (MDR) is a model-free approach that can identify gene x gene or gene x environment effects in a case-control study. Here we explore several modifications of the MDR method. We extended MDR to provide model selection without crossvalidation, and use a chi-square statistic as an alternative to prediction error (PE). We also modified the permutation test to provide different levels of stringency. The extended MDR (EMDR) includes three permutation tests (fixed, non-fixed, and omnibus) to obtain p-values of multilocus models. The goal of this study was to compare the different approaches implemented in the EMDR method and evaluate the ability to identify genetic effects in the Genetic Analysis Workshop 14 simulated data. We used three replicates from the simulated family data, generating matched pairs from family triads. The results showed: 1) chi-square and PE statistics give nearly consistent results; 2) results of EMDR without cross-validation matched that of EMDR with 10-fold cross-validation; 3) the fixed permutation test reports false-positive results in data from loci unrelated to the disease, but the non-fixed and omnibus permutation tests perform well in preventing false positives, with the omnibus test being the most conservative. We conclude that the non-cross-validation test can provide accurate results with the advantage of high efficiency compared to 10-cross-validation, and the non-fixed permutation test provides a good compromise between power and false-positive rate.  相似文献   

The purposes of this study were 1) to examine the performance of a new multimarker regression approach for model-free linkage analysis in comparison to a conventional multipoint approach, and 2) to determine the whether a conditioning strategy would improve the performance of the conventional multipoint method when applied to data from two interacting loci. Linkage analysis of the Kofendrerd Personality Disorder phenotype to chromosomes 1 and 3 was performed in three populations for all 100 replicates of the Genetic Analysis Workshop 14 simulated data. Three approaches were used: a conventional multipoint analysis using the Zlr statistic as calculated in the program ALLEGRO; a conditioning approach in which the per-family contribution on one chromosome was weighted according to evidence for linkage on the other chromosome; and a novel multimarker regression approach. The multipoint and multimarker approaches were generally successful in localizing known susceptibility loci on chromosomes 1 and 3, and were found to give broadly similar results. No advantage was found with the per-family conditioning approach. The effect on power and type I error of different choices of weighting scheme (to account for different numbers of affected siblings) in the multimarker approach was examined.  相似文献   

We used our newly developed linkage disequilibrium (LD) plotting software, JLIN, to plot linkage disequilibrium between pairs of single-nucleotide polymorphisms (SNPs) for three chromosomes of the Genetic Analysis Workshop 14 Aipotu simulated population to assess the effect of missing data on LD calculations. Our haplotype analysis program, SIMHAP, was used to assess the effect of missing data on haplotype-phenotype association. Genotype data was removed at random, at levels of 1%, 5%, and 10%, and the LD calculations and haplotype association results for these levels of missingness were compared to those for the complete dataset. It was concluded that ignoring individuals with missing data substantially affects the number of regions of LD detected which, in turn, could affect tagging SNPs chosen to generate haplotypes.  相似文献   

We combined the results of whole-genome linkage and association analyses to determine which markers were most strongly associated with Kofendrerd Personality Disorder. Using replicate 1 from the Genetic Analysis Workshop 14 Aipotu, Karangar, Danacaa, and New York City simulated populations, we determined that several markers showed significant linkage and association with disease status. We used both SNP and microsatellite markers to determine patterns and chromosomal regions of markers. Three consistently associated markers were C01R0050, C03R0280, and C10R0882. Using generalized linear mixed models, we modelled the effect of the three predefined phenotypic categories on disease status and concluded that the phenotypes defining the "anxiety-related" category best predicted the outcome.  相似文献   

For mapping complex disease traits, linkage studies are often followed by a case-control association strategy in order to identify disease-associated genes/single-nucleotide polymorphisms (SNPs). Substantial efforts are required in selecting the most informative cases from a large collection of affected individuals in order to maximize the power of the study, while taking into consideration study cost. In this article, we applied and extended three case-selection strategies that use allele-sharing information method for families with multiple affected offspring to select most informative cases using additional information on disease severity. Our results revealed that most significant associations, as measured by the lowest p-values, were obtained from a strategy that selected a case with the most allele sharing with other affected sibs from linked families ("linked-best"), despite reduction in sample size resulting from discarding unlinked families. Moreover, information on disease severity appears to be useful to improve the ability to detect associations between markers and disease loci.  相似文献   

A genetic analysis of age of onset of alcoholism was performed on the Collaborative Study on the Genetics of Alcoholism data released for Genetic Analysis Workshop 14. Our study illustrates an application of the log-normal age of onset model in our software Genetic Epidemiology Models (GEMs). The phenotype ALDX1 of alcoholism was studied. The analysis strategy was to first find the markers of the Affymetrix SNP dataset with significant association with age of onset, and then to perform linkage analysis on them. ALDX1 revealed strong evidence of linkage for marker tsc0041591 on chromosome 2 and suggestive linkage for marker tsc0894042 on chromosome 3. The largest separation in mean ages of onset of ALDX1 was 19.76 and 24.41 between male smokers who are carriers of the risk allele of tsc0041591 and the non-carriers, respectively. Hence, male smokers who are carriers of marker tsc0041591 on chromosome 2 have an average onset of ALDX1 almost 5 years earlier than non-carriers.  相似文献   

Recent studies have suggested that a high-density single nucleotide polymorphism (SNP) marker set could provide equivalent or even superior information compared with currently used microsatellite (STR) marker sets for gene mapping by linkage. The focus of this study was to compare results obtained from linkage analyses involving extended pedigrees with STR and single-nucleotide polymorphism (SNP) marker sets. We also wanted to compare the performance of current linkage programs in the presence of high marker density and extended pedigree structures. One replicate of the Genetic Analysis Workshop 14 (GAW14) simulated extended pedigrees (n = 50) from New York City was analyzed to identify the major gene D2. Four marker sets with varying information content and density on chromosome 3 (STR [7.5 cM]; SNP [3 cM, 1 cM, 0.3 cM]) were analyzed to detect two traits, the original affection status, and a redefined trait more closely correlated with D2. Multipoint parametric and nonparametric linkage analyses (NPL) were performed using programs GENEHUNTER, MERLIN, SIMWALK2, and S.A.G.E. SIBPAL. Our results suggested that the densest SNP map (0.3 cM) had the greatest power to detect linkage for the original trait (genetic heterogeneity), with the highest LOD score/NPL score and mapping precision. However, no significant improvement in linkage signals was observed with the densest SNP map compared with STR or SNP-1 cM maps for the redefined affection status (genetic homogeneity), possibly due to the extremely high information contents for all maps. Finally, our results suggested that each linkage program had limitations in handling the large, complex pedigrees as well as a high-density SNP marker set.  相似文献   



GAW20 working group 5 brought together researchers who contributed 7 papers with the aim of evaluating methods to detect genetic by epigenetic interactions. GAW20 distributed real data from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study, including single-nucleotide polymorphism (SNP) markers, methylation (cytosine-phosphate-guanine [CpG]) markers, and phenotype information on up to 995 individuals. In addition, a simulated data set based on the real data was provided.


The 7 contributed papers analyzed these data sets with a number of different statistical methods, including generalized linear mixed models, mediation analysis, machine learning, W-test, and sparsity-inducing regularized regression. These methods generally appeared to perform well. Several papers confirmed a number of causative SNPs in either the large number of simulation sets or the real data on chromosome 11. Findings were also reported for different SNPs, CpG sites, and SNP–CpG site interaction pairs.


In the simulation (200 replications), power appeared generally good for large interaction effects, but smaller effects will require larger studies or consortium collaboration for realizing a sufficient power.



Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data.


We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations.


Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight.

The adequacy of various phenetic and phylogenetic estimation methods was evaluated using simulated data sets. Two parsimony programs were used to construct maximum parsimony trees (WAGNER 78 and HENNIG 86). The CAFCA program was used to perform group-compatibility analysis. Four UPGMA clustering strategies were employed. The simulation model GENESIS was used to generate data sets under different evolutionary conditions. The effects of input parameters and tree properties on the accuracy of the estimated trees were evaluated. UPGMA based on product moment correlations of unstandardized characters appeared to perform best, under all evolutionary conditions tested. The effect of input parameters on the accuracy was not very significant. Among the tree statistics the stemminess of the true tree appeared to be the most important estimator of accuracy.  相似文献   

In electron tomography the reconstructed density function is typically corrupted by noise and artifacts. Under those conditions, separating the meaningful regions of the reconstructed density function is not trivial. Despite development efforts that specifically target electron tomography manual segmentation continues to be the preferred method. Based on previous good experiences using a segmentation based on fuzzy logic principles (fuzzy segmentation) where the reconstructed density functions also have low signal-to-noise ratio, we applied it to electron tomographic reconstructions. We demonstrate the usefulness of the fuzzy segmentation algorithm evaluating it within the limits of segmenting electron tomograms of selectively stained, plastic embedded spiny dendrites. The results produced by the fuzzy segmentation algorithm within the framework presented are encouraging.  相似文献   

Recently, alcohol-related traits have been shown to have a genetic component. Here, we study the association of specific genetic measures in one of the three sets of electrophysiological measures in families with alcoholism distributed as part of the Genetic Analysis Workshop 14 data, the NTTH (non-target case of Visual Oddball experiment for 4 electrode placements) phenotypes: ntth1, ntth2, ntth3, and ntth4. We focused on the analysis of the 786 Affymetrix markers on chromosome 4. Our desire was to find at least a partial answer to the question of whether ntth1, ntth2, ntth3, and ntth4 are separately or jointly genetically controlled, so we studied the principal components that explain most of the covariation of the four quantitative traits. The first principal component, which explains 70% of the covariation, showed association but not genetic linkage to two markers: tsc0272102 and tsc0560854. On the other hand, ntth1 appeared to be the trait driving the variation in the second principal component, which showed association and genetic linkage at markers in four regions: tsc0045058, tsc1213381, tsc0055068, and tsc0051777 at map distances 53.26, 85.42, 89.31, and 172.86, respectively. These results show that the partial answer to our starting question for this brief analysis is that the NTTH phenotypes are not jointly genetically controlled. The component ntth1 displays marked genetic linkage.  相似文献   

Multivariate phenotypes underlie complex traits. Thus, instead of using the end-point trait, it may be statistically more powerful to use a multivariate phenotype correlated to the end-point trait for detecting linkage. In this study, we develop a reverse regression method to analyze linkage of Kofendrerd Personality Disorder affection status in the New York population of the Genetic Analysis Workshop 14 (GAW14) simulated dataset. When we used the multivariate phenotype, we obtained significant evidence of linkage near four of the six putative loci in at least 25% of the replicates. On the other hand, the linkage analysis based on Kofendrerd Personality Disorder status as a phenotype produced significant findings only near two of the loci and in a smaller proportion of replicates.  相似文献   

Viral evolution remains to be a main obstacle in the effectiveness of antiviral treatments. The ability to predict this evolution will help in the early detection of drug-resistant strains and will potentially facilitate the design of more efficient antiviral treatments. Various tools has been utilized in genome studies to achieve this goal. One of these tools is machine learning, which facilitates the study of structure-activity relationships, secondary and tertiary structure evolution prediction, and sequence error correction. This work proposes a novel machine learning technique for the prediction of the possible point mutations that appear on alignments of primary RNA sequence structure. It predicts the genotype of each nucleotide in the RNA sequence, and proves that a nucleotide in an RNA sequence changes based on the other nucleotides in the sequence. Neural networks technique is utilized in order to predict new strains, then a rough set theory based algorithm is introduced to extract these point mutation patterns. This algorithm is applied on a number of aligned RNA isolates time-series species of the Newcastle virus. Two different data sets from two sources are used in the validation of these techniques. The results show that the accuracy of this technique in predicting the nucleotides in the new generation is as high as 75 %. The mutation rules are visualized for the analysis of the correlation between different nucleotides in the same RNA sequence.  相似文献   

The goal of this study is to evaluate, compare, and contrast several standard and new linkage analysis methods. First, we compare a recently proposed confidence set approach with MAPMAKER/SIBS. Then, we evaluate a new Bayesian approach that accounts for heterogeneity. Finally, the newly developed software SIMPLE is compared with GENEHUNTER. We apply these methods to several replicates of the Genetic Analysis Workshop 13 simulated data to assess their ability to detect the high blood pressure genes on chromosome 21, whose positions were known to us prior to the analyses. In contrast to the standard methods, most of the new approaches are able to identify at least one of the disease genes in all the replicates considered.  相似文献   

ABSTRACT: BACKGROUND: The detection of genomic copy number alterations (CNA) in cancer based on SNP arrays requires methods that take into account tumour specific factors such as normal cell contamination and tumour heterogeneity. A number of tools have been recently developed but their performance needs yet to be thoroughly assessed. To this aim, a comprehensive model that integrates the factors of normal cell contamination and intra-tumour heterogeneity and that can be translated to synthetic data on which to perform benchmarks is indispensable. METHODS: We propose such model and implement it in an R package called CnaGen to synthetically generate a wide range of alterations under different normal cell contamination levels. Six recently published methods for CNA and loss of heterozygosity (LOH) detection on tumour samples were assessed on this synthetic data and on a dilution series of a breast cancer cell-line: ASCAT, GAP, GenoCNA, GPHMM, MixHMM and OncoSNP. We report the recall rates in terms of normal cell contamination levels and alteration characteristics: length, copy number and LOH state, as well as the false discovery rate distribution for each copy number under different normal cell contamination levels. RESULTS: Assessed methods are in general better at detecting alterations with low copy number and under a little normal cell contamination levels. All methods except GPHMM, which failed to recognize the alteration pattern in the cell-line samples, provided similar results for the synthetic and cell-line sample sets. MixHMM and GenoCNA are the poorliest performing methods, while GAP and ASCAT, the two segmentation-based methods, generally performed better . This supports the viability of approaches other than the common hidden Markov model (HMM)-based. CONCLUSIONS: We devised and implemented a comprehensive model to generate data that simulate tumoural samples genotyped using SNP arrays. The validity of the model is supported by the similarity of the results obtained with synthetic and real data. Based on these results and on the software implementation of the methods, we recommend GAP for advanced users, ASCAT for users of basic R and GPHMM for a fully driven analysis.  相似文献   

Oldfield TJ 《Proteins》2002,49(4):510-528
The protein databank contains a vast wealth of structural and functional information. The analysis of this macromolecular information has been the subject of considerable work in order to advance knowledge beyond the collection of molecular coordinates. This article presents a method that determines local structural information within proteins using mathematical data mining techniques. The mine program described returns many known configurations of residues such as the catalytic triad, metal binding sites and the N-linked glycosylation site; as well as many other multiple residue interactions not previously categorized. Because mathematical constructs are used as targets, this method can identify new information not previously known, and also provide unbiased results of typical structure and their expected deviations. Because the results are defined mathematically, they cannot indicate the biological implications of the results. Therefore two support programs are described that provide insight into the biological context for the mine results. The first allows a weighted RMSD search between a template set of coordinates and a list of PDB files, and the second allows the labeling of a protein with the template results from mining to aid in the classification of this protein.  相似文献   

