首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 531 毫秒
1.
Zhang J 《PloS one》2010,5(11):e13734
Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference.  相似文献   

2.
Population structure and kinship are widespread confounding factors in genome-wide association studies (GWAS). It has been standard practice to include principal components of the genotypes in a regression model in order to account for population structure. More recently, the linear mixed model (LMM) has emerged as a powerful method for simultaneously accounting for population structure and kinship. The statistical theory underlying the differences in empirical performance between modeling principal components as fixed versus random effects has not been thoroughly examined. We undertake an analysis to formalize the relationship between these widely used methods and elucidate the statistical properties of each. Moreover, we introduce a new statistic, effective degrees of freedom, that serves as a metric of model complexity and a novel low rank linear mixed model (LRLMM) to learn the dimensionality of the correction for population structure and kinship, and we assess its performance through simulations. A comparison of the results of LRLMM and a standard LMM analysis applied to GWAS data from the Multi-Ethnic Study of Atherosclerosis (MESA) illustrates how our theoretical results translate into empirical properties of the mixed model. Finally, the analysis demonstrates the ability of the LRLMM to substantially boost the strength of an association for HDL cholesterol in Europeans.  相似文献   

3.
Lee S  Wright FA  Zou F 《Biometrics》2011,67(3):967-974
In genome-wide association studies, population stratification is recognized as producing inflated type I error due to the inflation of test statistics. Principal component-based methods applied to genotypes provide information about population structure, and have been widely used to control for stratification. Here we explore the precise relationship between genotype principal components and inflation of association test statistics, thereby drawing a connection between principal component-based stratification control and the alternative approach of genomic control. Our results provide an inherent justification for the use of principal components, but call into question the popular practice of selecting principal components based on significance of eigenvalues alone. We propose a new approach, called EigenCorr, which selects principal components based on both their eigenvalues and their correlation with the (disease) phenotype. Our approach tends to select fewer principal components for stratification control than does testing of eigenvalues alone, providing substantial computational savings and improvements in power. Analyses of simulated and real data demonstrate the usefulness of the proposed approach.  相似文献   

4.
Association mapping in structured populations   总被引:43,自引:0,他引:43       下载免费PDF全文
The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.  相似文献   

5.
Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage.  相似文献   

6.
A major aim of landscape genetics is to understand how landscapes resist gene flow and thereby influence population genetic structure. An empirical understanding of this process provides a wealth of information that can be used to guide conservation and management of species in fragmented landscapes and also to predict how landscape change may affect population viability. Statistical approaches to infer the true model among competing alternatives are based on the strength of the relationship between pairwise genetic distances and landscape distances among sampled individuals in a population. A variety of methods have been devised to quantify individual genetic distances, but no study has yet compared their relative performance when used for model selection in landscape genetics. In this study, we used population genetic simulations to assess the accuracy of 16 individual‐based genetic distance metrics under varying sample sizes and degree of population genetic structure. We found most metrics performed well when sample size and genetic structure was high. However, it was much more challenging to infer the true model when sample size and genetic structure was low. Under these conditions, we found genetic distance metrics based on principal components analysis were the most accurate (although several other metrics performed similarly), but only when they were derived from multiple principal components axes (the optimal number varied depending on the degree of population genetic structure). Our results provide guidance for which genetic distance metrics maximize model selection accuracy and thereby better inform conservation and management decisions based upon landscape genetic analysis.  相似文献   

7.
Although inversions have occasionally been found to be associated with disease susceptibility through interrupting a gene or its regulatory region, or by increasing the risk for deleterious secondary rearrangements, no association study has been specifically conducted for risks associated with inversions, mainly because existing approaches to detecting and genotyping inversions do not readily scale to a large number of samples. Based on our recently proposed approach to identifying and genotyping inversions using principal components analysis (PCA), we herein develop a method of detecting association between inversions and disease in a genome-wide fashion. Our method uses genotype data for single nucleotide polymorphisms (SNPs), and is thus cost-efficient and computationally fast. For an inversion polymorphism, local PCA around the inversion region is performed to infer the inversion genotypes of all samples. For many inversions, we found that some of the SNPs inside an inversion region are fixed in the two lineages of different orientations and thus can serve as surrogate markers. Our method can be applied to case–control and quantitative trait association studies to identify inversions that may interrupt a gene or the connection between a gene and its regulatory agents. Our method also offers a new venue to identify inversions that are responsible for disease-causing secondary rearrangements. We illustrated our proposed approach to case–control data for psoriasis and identified novel associations with a few inversion polymorphisms.  相似文献   

8.
The brain is one of the most studied and highly complex systems in the biological world. While much research has concentrated on studying the brain directly, our focus is the structure of the brain itself: at its core an interconnected network of nodes (neurons). A better understanding of the structural connectivity of the brain should elucidate some of its functional properties. In this paper we analyze the connectome of the nematode Caenorhabditis elegans. Consisting of only 302 neurons, it is one of the better-understood neural networks. Using a Laplacian Matrix of the 279-neuron “giant component” of the network, we use an eigenvalue counting function to look for fractal-like self similarity. This matrix representation is also used to plot visualizations of the neural network in eigenfunction coordinates. Small-world properties of the system are examined, including average path length and clustering coefficient. We test for localization of eigenfunctions, using graph energy and spacial variance on these functions. To better understand results, all calculations are also performed on random networks, branching trees, and known fractals, as well as fractals which have been “rewired” to have small-world properties. We propose algorithms for generating Laplacian matrices of each of these graphs.  相似文献   

9.
There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available.In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.  相似文献   

10.
The purpose of many microarray studies is to find the association between gene expression and sample characteristics such as treatment type or sample phenotype. There has been a surge of efforts developing different methods for delineating the association. Aside from the high dimensionality of microarray data, one well recognized challenge is the fact that genes could be complicatedly inter-related, thus making many statistical methods inappropriate to use directly on the expression data. Multivariate methods such as principal component analysis (PCA) and clustering are often used as a part of the effort to capture the gene correlation, and the derived components or clusters are used to describe the association between gene expression and sample phenotype. We propose a method for patient population dichotomization using maximally selected test statistics in combination with the PCA method, which shows favorable results. The proposed method is compared with a currently well-recognized method.  相似文献   

11.
Genome-wide association study (GWAS) has become an obvious general approach for studying traits of agricultural importance in higher plants, especially crops. Here, we present a GWAS of 32 morphologic and 10 agronomic traits in a collection of 615 barley cultivars genotyped by genome-wide polymorphisms from a recently developed barley oligonucleotide pool assay. Strong population structure effect related to mixed sampling based on seasonal growth habit and ear row number is present in this barley collection. Comparison of seven statistical approaches in a genome-wide scan for significant associations with or without correction for confounding by population structure, revealed that in reducing false positive rates while maintaining statistical power, a mixed linear model solution outperforms genomic control, structured association, stepwise regression control and principal components adjustment. The present study reports significant associations for sixteen morphologic and nine agronomic traits and demonstrates the power and feasibility of applying GWAS to explore complex traits in highly structured plant samples.  相似文献   

12.
基于Fiedler向量的基因表达谱数据分类方法   总被引:1,自引:0,他引:1  
尝试将一种基于图的Fiedler向量的聚类算法引入到基因表达谱数据的肿瘤分类中来。该方法将分属不同类的所有样本通过高斯权构造Laplace完全图,经SVD分解后获得Fiedler向量,利用各样本所对应的Fiedler向量分量的符号差异来进行基因表达谱数据的分类。通过模拟数据仿真实验和对白血病两个亚型(ALL与AML)及结肠癌真实数据实验,证明了这一方法的有效性。  相似文献   

13.
This paper presents a novel method to detect side-chain clusters in protein three-dimensional structures using a graph spectral approach. Protein side-chain interactions are represented by a labeled graph in which the nodes of the graph represent the Cbeta atoms and the edges represent the distance between the Cbeta atoms. The distance information and the non-bonded connectivity of the residues are represented in the form of a matrix called the Laplacian matrix. The constructed matrix is diagonalized and clustering information is obtained from the vector components associated with the second lowest eigenvalue and cluster centers are obtained from the vector components associated with the top eigenvalues. The method uses global information for clustering and a single numeric computation is required to detect clusters of interest. The approach has been adopted here to detect a variety of side-chain clusters and identify the residue which makes the largest number of interactions among the residues forming the cluster (cluster centers). Detecting such clusters and cluster centers are important from a protein structure and folding point of view. The crucial residues which are important in the folding pathway as determined by PhiF values (which is a measure of the effect of a mutation on the stability of the transition state of folding) as obtained from protein engineering methods, can be identified from the vector components corresponding to the top eigenvalues. Expanded clusters are detected near the active and binding site of the protein, supporting the nucleation condensation hypothesis for folding. The method is also shown to detect domains in protein structures and conserved side-chain clusters in topologically similar proteins.  相似文献   

14.
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.  相似文献   

15.
Large-scale association studies are being undertaken with the hope of uncovering the genetic determinants of complex disease. We describe a computationally efficient method for inferring genealogies from population genotype data and show how these genealogies can be used to fine map disease loci and interpret association signals. These genealogies take the form of the ancestral recombination graph (ARG). The ARG defines a genealogical tree for each locus, and, as one moves along the chromosome, the topologies of consecutive trees shift according to the impact of historical recombination events. There are two stages to our analysis. First, we infer plausible ARGs, using a heuristic algorithm, which can handle unphased and missing data and is fast enough to be applied to large-scale studies. Second, we test the genealogical tree at each locus for a clustering of the disease cases beneath a branch, suggesting that a causative mutation occurred on that branch. Since the true ARG is unknown, we average this analysis over an ensemble of inferred ARGs. We have characterized the performance of our method across a wide range of simulated disease models. Compared with simpler tests, our method gives increased accuracy in positioning untyped causative loci and can also be used to estimate the frequencies of untyped causative alleles. We have applied our method to Ueda et al.'s association study of CTLA4 and Graves disease, showing how it can be used to dissect the association signal, giving potentially interesting results of allelic heterogeneity and interaction. Similar approaches analyzing an ensemble of ARGs inferred using our method may be applicable to many other problems of inference from population genotype data.  相似文献   

16.
We study nonlinear electrical oscillator networks, the smallest example of which consists of a voltage-dependent capacitor, an inductor, and a resistor driven by a pure tone source. By allowing the network topology to be that of any connected graph, such circuits generalize spatially discrete nonlinear transmission lines/lattices that have proven useful in high-frequency analog devices. For such networks, we develop two algorithms to compute the steady-state response when a subset of nodes are driven at the same fixed frequency. The algorithms we devise are orders of magnitude more accurate and efficient than stepping towards the steady-state using a standard numerical integrator. We seek to enhance a given network''s nonlinear behavior by altering the eigenvalues of the graph Laplacian, i.e., the resonances of the linearized system. We develop a Newton-type method that solves for the network inductances such that the graph Laplacian achieves a desired set of eigenvalues; this method enables one to move the eigenvalues while keeping the network topology fixed. Running numerical experiments using three different random graph models, we show that shrinking the gap between the graph Laplacian''s first two eigenvalues dramatically improves a network''s ability to (i) transfer energy to higher harmonics, and (ii) generate large-amplitude signals. Our results shed light on the relationship between a network''s structure, encoded by the graph Laplacian, and its function, defined in this case by the presence of strongly nonlinear effects in the frequency response.  相似文献   

17.

Background  

Understanding the mechanisms that control species genetic structure has always been a major objective in evolutionary studies. The association between genetic structure and species attributes has received special attention. As species attributes are highly taxonomically constrained, phylogenetically controlled methods are necessary to infer causal relationships. In plants, a previous study controlling for phylogenetic signal has demonstrated that Wright's F ST, a measure of genetic differentiation among populations, is best predicted by the mating system (outcrossing, mixed-mating or selfing) and that plant traits such as perenniality and growth form have only an indirect influence on F ST via their association with the mating system. The objective of this study is to further outline the determinants of plant genetic structure by distinguishing the effects of mating system on gene flow and on genetic drift. The association of biparental inbreeding and inbreeding depression with population genetic structure, mating system and plant traits are also investigated.  相似文献   

18.
The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or “population” formulation. The eigen-equation for samples from an arbitrary number () of populations is reduced to that of a matrix of dimension , the elements of which are determined by the variance-covariance matrix for the random vector of the allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.  相似文献   

19.
In recent years it has emerged that structural variants have a substantial impact on genomic variation. Inversion polymorphisms represent a significant class of structural variant, and despite the challenges in their detection, data on inversions in the human genome are increasing rapidly. Statistical methods for inferring parameters such as the recombination rate and the selection coefficient have generally been developed without accounting for the presence of inversions. Here we exploit new software for simulating inversions in population genetic data, invertFREGENE, to assess the potential impact of inversions on such methods. Using data simulated by invertFREGENE, as well as real data from several sources, we test whether large inversions have a disruptive effect on widely applied population genetics methods for inferring recombination rates, for detecting selection, and for controlling for population structure in genome-wide association studies (GWAS). We find that recombination rates estimated by LDhat are biased downward at inversion loci relative to the true contemporary recombination rates at the loci but that recombination hotspots are not falsely inferred at inversion breakpoints as may have been expected. We find that the integrated haplotype score (iHS) method for detecting selection appears robust to the presence of inversions. Finally, we observe a strong bias in the genome-wide results of principal components analysis (PCA), used to control for population structure in GWAS, in the presence of even a single large inversion, confirming the necessity to thin SNPs by linkage disequilibrium at large physical distances to obtain unbiased results.  相似文献   

20.
Our cognition relies on the ability of the brain to segment hierarchically structured events on multiple scales. Recent evidence suggests that the brain performs this event segmentation based on the structure of state-transition graphs behind sequential experiences. However, the underlying circuit mechanisms are poorly understood. In this paper we propose an extended attractor network model for graph-based hierarchical computation which we call the Laplacian associative memory. This model generates multiscale representations for communities (clusters) of associative links between memory items, and the scale is regulated by the heterogenous modulation of inhibitory circuits. We analytically and numerically show that these representations correspond to graph Laplacian eigenvectors, a popular method for graph segmentation and dimensionality reduction. Finally, we demonstrate that our model exhibits chunked sequential activity patterns resembling hippocampal theta sequences. Our model connects graph theory and attractor dynamics to provide a biologically plausible mechanism for abstraction in the brain.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号