期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Ancestral informative marker selection and population structure visualization using sparse Laplacian eigenfunctions

Zhang J 《PloS one》2010,5(11):e13734

Identification of a small panel of population structure informative markers can reduce genotyping cost and is useful in various applications, such as ancestry inference in association mapping, forensics and evolutionary theory in population genetics. Traditional methods to ascertain ancestral informative markers usually require the prior knowledge of individual ancestry and have difficulty for admixed populations. Recently Principal Components Analysis (PCA) has been employed with success to select SNPs which are highly correlated with top significant principal components (PCs) without use of individual ancestral information. The approach is also applicable to admixed populations. Here we propose a novel approach based on our recent result on summarizing population structure by graph laplacian eigenfunctions, which differs from PCA in that it is geometric and robust to outliers. Our approach also takes advantage of the priori sparseness of informative markers in the genome. Through simulation of a ring population and the real global population sample HGDP of 650K SNPs genotyped in 940 unrelated individuals, we validate the proposed algorithm at selecting most informative markers, a small fraction of which can recover the similar underlying population structure efficiently. Employing a standard Support Vector Machine (SVM) to predict individuals' continental memberships on HGDP dataset of seven continents, we demonstrate that the selected SNPs by our method are more informative but less redundant than those selected by PCA. Our algorithm is a promising tool in genome-wide association studies and population genetics, facilitating the selection of structure informative markers, efficient detection of population substructure and ancestral inference. 相似文献

2.

A genome‐wide association study indicates LCORL/NCAPG as a candidate locus for withers height in German Warmblood horses

J. Tetens P. Widmann C. Kühn G. Thaller 《Animal genetics》2013,44(4):467-471

A genome‐wide association scan for loci affecting withers height was conducted in 782 German Warmblood stallions, which were genotyped using the Illumina EquineSNP50 Bead Chip. A principal components approach was applied to correct for population structure. The analysis revealed a single major QTL on ECA3 explaining ~18 per cent of the phenotypic variance, which is in concordance with recent reports from other horse populations. The LCORL/NCAPG locus represents a strong candidate gene for this QTL. This locus is among a small number that have consistently been identified to influence human height in several large meta‐analyses. Furthermore, a mutation within the NCAPG gene was found to affect growth and body frame size in cattle. Together with the results of this study in German Warmbloods, these findings strongly indicate LCORL/NCAPG as a candidate locus for withers height in horses. Further studies are, however, needed to confirm this. 相似文献

3.

Analysis and application of European genetic substructure using 300 K SNP information

Tian C Plenge RM Ransom M Lee A Villoslada P Selmi C Klareskog L Pulver AE Qi L Gregersen PK Seldin MF 《PLoS genetics》2008,4(1):e4

European population genetic substructure was examined in a diverse set of >1,000 individuals of European descent, each genotyped with >300 K SNPs. Both STRUCTURE and principal component analyses (PCA) showed the largest division/principal component (PC) differentiated northern from southern European ancestry. A second PC further separated Italian, Spanish, and Greek individuals from those of Ashkenazi Jewish ancestry as well as distinguishing among northern European populations. In separate analyses of northern European participants other substructure relationships were discerned showing a west to east gradient. Application of this substructure information was critical in examining a real dataset in whole genome association (WGA) analyses for rheumatoid arthritis in European Americans to reduce false positive signals. In addition, two sets of European substructure ancestry informative markers (ESAIMs) were identified that provide substantial substructure information. The results provide further insight into European population genetic substructure and show that this information can be used for improving error rates in association testing of candidate genes and in replication studies of WGA scans. 相似文献

4.

LANDSCAPE GENOMICS IN ATLANTIC SALMON (SALMO SALAR): SEARCHING FOR GENE–ENVIRONMENT INTERACTIONS DRIVING LOCAL ADAPTATION

Bourret Vincent Mélanie Dionne Matthew P. Kent Sigbjørn Lien Louis Bernatchez 《Evolution; international journal of organic evolution》2013,67(12):3469-3487

A growing number of studies are examining the factors driving historical and contemporary evolution in wild populations. By combining surveys of genomic variation with a comprehensive assessment of environmental parameters, such studies can increase our understanding of the genomic and geographical extent of local adaptation in wild populations. We used a large‐scale landscape genomics approach to examine adaptive and neutral differentiation across 54 North American populations of Atlantic salmon representing seven previously defined genetically distinct regional groups. Over 5500 genome‐wide single nucleotide polymorphisms were genotyped in 641 individuals and 28 bulk assays of 25 pooled individuals each. Genome scans, linkage map, and 49 environmental variables were combined to conduct an innovative landscape genomic analysis. Our results provide valuable insight into the links between environmental variation and both neutral and potentially adaptive genetic divergence. In particular, we identified markers potentially under divergent selection, as well as associated selective environmental factors and biological functions with the observed adaptive divergence. Multivariate landscape genetic analysis revealed strong associations of both genetic and environmental structures. We found an enrichment of growth‐related functions among outlier markers. Climate (temperature–precipitation) and geological characteristics were significantly associated with both potentially adaptive and neutral genetic divergence and should be considered as candidate loci involved in adaptation at the regional scale in Atlantic salmon. Hence, this study significantly contributes to the improvement of tools used in modern conservation and management schemes of Atlantic salmon wild populations. 相似文献

5.

Genetic Structure of Europeans: A View from the North–East

《PloS one》2009,4(5)

Using principal component (PC) analysis, we studied the genetic constitution of 3,112 individuals from Europe as portrayed by more than 270,000 single nucleotide polymorphisms (SNPs) genotyped with the Illumina Infinium platform. In cohorts where the sample size was >100, one hundred randomly chosen samples were used for analysis to minimize the sample size effect, resulting in a total of 1,564 samples. This analysis revealed that the genetic structure of the European population correlates closely with geography. The first two PCs highlight the genetic diversity corresponding to the northwest to southeast gradient and position the populations according to their approximate geographic origin. The resulting genetic map forms a triangular structure with a) Finland, b) the Baltic region, Poland and Western Russia, and c) Italy as its vertexes, and with d) Central- and Western Europe in its centre. Inter- and intra- population genetic differences were quantified by the inflation factor lambda (λ) (ranging from 1.00 to 4.21), fixation index (F_st) (ranging from 0.000 to 0.023), and by the number of markers exhibiting significant allele frequency differences in pair-wise population comparisons. The estimated lambda was used to assess the real diminishing impact to association statistics when two distinct populations are merged directly in an analysis. When the PC analysis was confined to the 1,019 Estonian individuals (0.1% of the Estonian population), a fine structure emerged that correlated with the geography of individual counties. With at least two cohorts available from several countries, genetic substructures were investigated in Czech, Finnish, German, Estonian and Italian populations. Together with previously published data, our results allow the creation of a comprehensive European genetic map that will greatly facilitate inter-population genetic studies including genome wide association studies (GWAS). 相似文献

6.

A genome‐wide significant association on chromosome 2 for footrot resistance/susceptibility in Swiss White Alpine sheep

下载免费PDF全文

A. Niggeler J. Tetens A. Stäuble A. Steiner C. Drögemüller 《Animal genetics》2017,48(6):712-715

Footrot is one of the most important causes of lameness in global sheep populations and is characterized by a bacterial infection of the interdigital skin. As a multifactorial disease, its clinical representation depends not only on pathogen factors and environmental components but also on the individual resistance/susceptibility of the host. A genetic component has been shown in previous studies; however, so far no causative genetic variant influencing the risk of developing footrot has been identified. In this study, we genotyped 373 Swiss White Alpine sheep, using the ovine high‐density 600k SNP chip, in order to run a DNA‐based comparison of individuals with known clinical footrot status. We performed a case–control genome‐wide association study, which revealed a genome‐wide significant association for SNP rs418747104 on ovine chromosome 2 at 81.2 Mb. The three best associated SNP markers were located at the MPDZ gene, which codes for the multiple PDZ domain crumbs cell polarity complex component protein, also known as multi‐PDZ domain protein 1 (MUPP1). This protein is possibly involved in maintaining the barrier function and integrity of tight junctions. Therefore, we speculate that individuals carrying MPDZ variants may differ in their footrot resistance/susceptibility due to modified horn and interdigital skin integrity. In conclusion, our study reveals that MPDZ might represent a functional candidate gene, and further research is needed to explore its role in footrot affected sheep. 相似文献

7.

Substantial differences in bias between single‐digest and double‐digest RAD‐seq libraries: A case study

下载免费PDF全文

Sarah P. Flanagan Adam G. Jones 《Molecular ecology resources》2018,18(2):264-280

The trade‐offs of using single‐digest vs. double‐digest restriction site‐associated DNA sequencing (RAD‐seq) protocols have been widely discussed. However, no direct empirical comparisons of the two methods have been conducted. Here, we sampled a single population of Gulf pipefish (Syngnathus scovelli) and genotyped 444 individuals using RAD‐seq. Sixty individuals were subjected to single‐digest RAD‐seq (sdRAD‐seq), and the remaining 384 individuals were genotyped using a double‐digest RAD‐seq (ddRAD‐seq) protocol. We analysed the resulting Illumina sequencing data and compared the two genotyping methods when reads were analysed either together or separately. Coverage statistics, observed heterozygosity, and allele frequencies differed significantly between the two protocols, as did the results of selection components analysis. We also performed an in silico digestion of the Gulf pipefish genome and modelled five major sources of bias: PCR duplicates, polymorphic restriction sites, shearing bias, asymmetric sampling (i.e., genotyping fewer individuals with sdRAD‐seq than with ddRAD‐seq) and higher major allele frequencies. This combination of approaches allowed us to determine that polymorphic restriction sites, an asymmetric sampling scheme, mean allele frequencies and to some extent PCR duplicates all contribute to different estimates of allele frequencies between samples genotyped using sdRAD‐seq versus ddRAD‐seq. Our finding that sdRAD‐seq and ddRAD‐seq can result in different allele frequencies has implications for comparisons across studies and techniques that endeavour to identify genomewide signatures of evolutionary processes in natural populations. 相似文献

8.

In search of polymorphic Alu insertions with restricted geographic distributions

Cordaux R Srikanta D Lee J Stoneking M Batzer MA 《Genomics》2007,90(1):154-158

Alu elements are transposable elements that have reached over one million copies in the human genome. Some Alu elements inserted in the genome so recently that they are still polymorphic for insertion presence or absence in human populations. Recently, there has been an increasing interest in using Alu variation for studies of human population genetic structure and inference of individual geographic origin. Currently, this requires a high number of Alu loci. Here, we used a linker-mediated polymerase chain reaction method to preferentially identify low-frequency Alu elements in various human DNA samples with different geographic origins. The candidate Alu loci were subsequently genotyped in 18 worldwide human populations (approximately 370 individuals), resulting in the identification of two new Alu insertions restricted to populations of African ancestry. Our results suggest that it may ultimately become possible to correctly infer the geographic affiliation of unknown samples with high levels of confidence without having to genotype as many as 100 Alu loci. This is desirable if Alu insertion polymorphisms are to be used for human evolution studies or forensic applications. 相似文献

9.

PPC: an algorithm for accurate estimation of SNP allele frequencies in small equimolar pools of DNA using data from high density microarrays 总被引：2，自引：1，他引：1

下载免费PDF全文

Brohede J Dunne R McKay JD Hannan GN 《Nucleic acids research》2005,33(17):e142

Robust estimation of allele frequencies in pools of DNA has the potential to reduce genotyping costs and/or increase the number of individuals contributing to a study where hundreds of thousands of genetic markers need to be genotyped in very large populations sample sets, such as genome wide association studies. In order to make accurate allele frequency estimations from pooled samples a correction for unequal allele representation must be applied. We have developed the polynomial based probe specific correction (PPC) which is a novel correction algorithm for accurate estimation of allele frequencies in data from high-density microarrays. This algorithm was validated through comparison of allele frequencies from a set of 10 individually genotyped DNA's and frequencies estimated from pools of these 10 DNAs using GeneChip 10K Mapping Xba 131 arrays. Our results demonstrate that when using the PPC to correct for allelic biases the accuracy of the allele frequency estimates increases dramatically. 相似文献

10.

Enhanced Localization of Genetic Samples through Linkage-Disequilibrium Correction

Yael Baran Inés Quintela ángel Carracedo Bogdan Pasaniuc Eran Halperin 《American journal of human genetics》2013,92(6):882-894

Characterizing the spatial patterns of genetic diversity in human populations has a wide range of applications, from detecting genetic mutations associated with disease to inferring human history. Current approaches, including the widely used principal-component analysis, are not suited for the analysis of linked markers, and local and long-range linkage disequilibrium (LD) can dramatically reduce the accuracy of spatial localization when unaccounted for. To overcome this, we have introduced an approach that performs spatial localization of individuals on the basis of their genetic data and explicitly models LD among markers by using a multivariate normal distribution. By leveraging external reference panels, we derive closed-form solutions to the optimization procedure to achieve a computationally efficient method that can handle large data sets. We validate the method on empirical data from a large sample of European individuals from the POPRES data set, as well as on a large sample of individuals of Spanish ancestry. First, we show that by modeling LD, we achieve accuracy superior to that of existing methods. Importantly, whereas other methods show decreased performance when dense marker panels are used in the inference, our approach improves in accuracy as more markers become available. Second, we show that accurate localization of genetic data can be achieved with only a part of the genome, and this could potentially enable the spatial localization of admixed samples that have a fraction of their genome originating from a given continent. Finally, we demonstrate that our approach is resistant to distortions resulting from long-range LD regions; such distortions can dramatically bias the results when unaccounted for. 相似文献

11.

Association Tests for a Censored Quantitative Trait and Candidate Genes in Structured Populations with Multilevel Genetic Relatedness

Meijuan Li Cavan Reilly Tim Hanson 《Biometrics》2010,66(3):925-933

Summary Several statistical methods for detecting associations between quantitative traits and candidate genes in structured populations have been developed for fully observed phenotypes. However, many experiments are concerned with failure‐time phenotypes, which are usually subject to censoring. In this article, we propose statistical methods for detecting associations between a censored quantitative trait and candidate genes in structured populations with complex multiple levels of genetic relatedness among sampled individuals. The proposed methods correct for continuous population stratification using both population structure variables as covariates and the frailty terms attributable to kinship. The relationship between the time‐at‐onset data and genotypic scores at a candidate marker is modeled via a parametric Weibull frailty accelerated failure time (AFT) model as well as a semiparametric frailty AFT model, where the baseline survival function is flexibly modeled as a mixture of Polya trees centered around a family of Weibull distributions. For both parametric and semiparametric models, the frailties are modeled via an intrinsic Gaussian conditional autoregressive prior distribution with the kinship matrix being the adjacency matrix connecting subjects. Simulation studies and applications to the Arabidopsis thaliana line flowering time data sets demonstrated the advantage of the new proposals over existing approaches. 相似文献

12.

Are shed hair genomes the most effective noninvasive resource for estimating relationships in the wild?

Anubhab Khan Kaushalkumar Patel Subhadeep Bhattacharjee Sudarshan Sharma Anup N. Chugani Karthikeyan Sivaraman Vinayak Hosawad Yogesh Kumar Sahu Goddilla V. Reddy Uma Ramakrishnan 《Ecology and evolution》2020,10(11):4583-4594

Knowledge of relationships in wild populations is critical for better understanding mating systems and inbreeding scenarios to inform conservation strategies for endangered species. To delineate pedigrees in wild populations, study genetic connectivity, study genotype‐phenotype associations, trace individuals, or track wildlife trade, many identified individuals need to be genotyped at thousands of loci, mostly from noninvasive samples. This requires us to (a) identify the most common noninvasive sample available from identified individuals, (b) assess the ability to acquire genome‐wide data from such samples, and (c) evaluate the quality of such genome‐wide data, and its ability to reconstruct relationships between animals within a population.
We followed identified individuals from a wild endangered tiger population and found that shed hair samples were the most common compared to scat samples, opportunistically found carcasses, and opportunistic invasive samples. We extracted DNA from these samples, prepared whole genome sequencing libraries, and sequenced genomes from these.
Whole genome sequencing methods resulted in between 25%–98% of the genome sequenced for five such samples. Exploratory population genetic analyses revealed that these data were free of holistic biases and could recover expected population structure and relatedness. Mitochondrial genomes recovered matrilineages in accordance with long‐term monitoring data. Even with just five samples, we were able to uncover the matrilineage for three individuals with unknown ancestry.
In summary, we demonstrated that noninvasive shed hair samples yield adequate quality and quantity of DNA in conjunction with sensitive library preparation methods, and provide reliable data from hundreds of thousands of SNPs across the genome. This makes shed hair an ideal noninvasive resource for studying individual‐based genetics of elusive endangered species in the wild.

相似文献

13.

Analytical correction for multiple testing in admixture mapping

Sha Q Zhang X Zhu X Zhang S 《Human heredity》2006,62(2):55-63

Admixture mapping, using unrelated individuals from the admixture populations that result from recent mating between members of each parental population, is an efficient approach to localize disease-causing variants that differ in frequency between two or more historically separated populations. Recently, several methods have been proposed to test linkage between a susceptibility gene and a disease locus by using admixture-generated linkage disequilibrium (LD) for each of the genotyped markers. In a genome scan, admixture mapping usually tests 2,000 to 3,000 markers across the genome. Currently, either a very conservative Sidak (or Bonferroni) correction or a very time consuming simulation-based method is used to correct for the multiple tests and evaluate the overall p value. In this report, we propose a computationally efficient analytical approach for correction of the multiple tests and for calculating the overall p value for an admixture genome scan. Except for the Sidak (or Bonferroni) correction, our proposed method is the first analytical approach for correction of the multiple tests and for calculating the overall p value for a genome scan. Our simulation studies show that the proposed method gives correct overall type I error rates for genome scans in all cases, and is much more computationally efficient than simulation-based methods. 相似文献

14.

Detection of linkage between quantitative trait loci and restriction fragment length polymorphisms using inbred lines 总被引：5，自引：0，他引：5

S. P. Simpson 《TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik》1989,77(6):815-819

Summary In segregating populations, large numbers of individuals are needed to detect linkage between markers, such as restriction fragment length polymorphisms (RFLPs), and quantitative trait loci (QTL), limiting the potential use of such markers for detecting linkage. Fewer individuals from inbred lines are needed to detect linkage. Simulation data were used to test the utility of two methods to detect linkage: maximum likelihood and comparison of marker genotype means. When there is tight linkage, the two methods have similar power, but when there is loose linkage, maximum likelihood is much more powerful. Once inbred lines have been established, they can be screened rapidly to detect QTL for several traits simultaneously. If there is sufficient coverage of the genome with RFLPs, several QTL for each trait may be detected. 相似文献

15.

Development of two multiplex mini-sequencing panels of ancestry informative SNPs for studies in Latin Americans: an application to populations of the State of Minas Gerais (Brazil)

Silva MC Zuccherato LW Soares-Souza GB Vieira ZM Cabrera L Herrera P Balqui J Romero C Jahuira H Gilman RH Martins ML Tarazona-Santos E 《Genetics and molecular research : GMR》2010,9(4):2069-2085

Admixture occurs when individuals from parental populations that have been isolated for hundreds of generations form a new hybrid population. Currently, interest in measuring biogeographic ancestry has spread from anthropology to forensic sciences, direct-to-consumers personal genomics, and civil rights issues of minorities, and it is critical for genetic epidemiology studies of admixed populations. Markers with highly differentiated frequencies among human populations are informative of ancestry and are called ancestry informative markers (AIMs). For tri-hybrid Latin American populations, ancestry information is required for Africans, Europeans and Native Americans. We developed two multiplex panels of AIMs (for 14 SNPs) to be genotyped by two mini-sequencing reactions, suitable for investigators of medium-small laboratories to estimate admixture of Latin American populations. We tested the performance of these AIMs by comparing results obtained with our 14 AIMs with those obtained using 108 AIMs genotyped in the same individuals, for which DNA samples is available for other investigators. We emphasize that this type of comparison should be made when new admixture/population structure panels are developed. At the population level, our 14 AIMs were useful to estimate European admixture, though they overestimated African admixture and underestimated Native American admixture. Combined with more AIMs, our panel could be used to infer individual admixture. We used our panel to infer the pattern of admixture in two urban populations (Montes Claros and Manhua?u) of the State of Minas Gerais (southeastern Brazil), obtaining a snapshot of their genetic structure in the context of their demographic history. 相似文献

16.

A Robust Test for Two‐Stage Design in Genome‐Wide Association Studies

Minjung Kwak Jungnam Joo Gang Zheng 《Biometrics》2009,65(4):1288-1295

Summary A two‐stage design is cost‐effective for genome‐wide association studies (GWAS) testing hundreds of thousands of single nucleotide polymorphisms (SNPs). In this design, each SNP is genotyped in stage 1 using a fraction of case–control samples. Top‐ranked SNPs are selected and genotyped in stage 2 using additional samples. A joint analysis, combining statistics from both stages, is applied in the second stage. Follow‐up studies can be regarded as a two‐stage design. Once some potential SNPs are identified, independent samples are further genotyped and analyzed separately or jointly with previous data to confirm the findings. When the underlying genetic model is known, an asymptotically optimal trend test (TT) can be used at each analysis. In practice, however, genetic models for SNPs with true associations are usually unknown. In this case, the existing methods for analysis of the two‐stage design and follow‐up studies are not robust across different genetic models. We propose a simple robust procedure with genetic model selection to the two‐stage GWAS. Our results show that, if the optimal TT has about 80% power when the genetic model is known, then the existing methods for analysis of the two‐stage design have minimum powers about 20% across the four common genetic models (when the true model is unknown), while our robust procedure has minimum powers about 70% across the same genetic models. The results can be also applied to follow‐up and replication studies with a joint analysis. 相似文献

17.

A simulation-based evaluation of methods for inferring linear barriers to gene flow

Blair C Weigel DE Balazik M Keeley AT Walker FM Landguth E Cushman S Murphy M Waits L Balkenhol N 《Molecular ecology resources》2012,12(5):822-833

Different analytical techniques used on the same data set may lead to different conclusions about the existence and strength of genetic structure. Therefore, reliable interpretation of the results from different methods depends on the efficacy and reliability of different statistical methods. In this paper, we evaluated the performance of multiple analytical methods to detect the presence of a linear barrier dividing populations. We were specifically interested in determining if simulation conditions, such as dispersal ability and genetic equilibrium, affect the power of different analytical methods for detecting barriers. We evaluated two boundary detection methods (Monmonier's algorithm and WOMBLING), two spatial Bayesian clustering methods (TESS and GENELAND), an aspatial clustering approach (STRUCTURE), and two recently developed, non-Bayesian clustering methods [PSMIX and discriminant analysis of principal components (DAPC)]. We found that clustering methods had higher success rates than boundary detection methods and also detected the barrier more quickly. All methods detected the barrier more quickly when dispersal was long distance in comparison to short-distance dispersal scenarios. Bayesian clustering methods performed best overall, both in terms of highest success rates and lowest time to barrier detection, with GENELAND showing the highest power. None of the methods suggested a continuous linear barrier when the data were generated under an isolation-by-distance (IBD) model. However, the clustering methods had higher potential for leading to incorrect barrier inferences under IBD unless strict criteria for successful barrier detection were implemented. Based on our findings and those of previous simulation studies, we discuss the utility of different methods for detecting linear barriers to gene flow. 相似文献

18.

Imputation of missing genotypes from low‐ to high‐density SNP panel in different population designs

下载免费PDF全文

S. He S. Wang W. Fu X. Ding Q. Zhang 《Animal genetics》2015,46(1):1-7

Imputation of missing genotypes, in particular from low density to high density, is an important issue in genomic selection and genome‐wide association studies. Given the marker densities, the most important factors affecting imputation accuracy are the size of the reference population and the relationship between individuals in the reference (genotyped with high‐density panel) and study (genotyped with low‐density panel) populations. In this study, we investigated the imputation accuracies when the reference population (genotyped with Illumina BovineSNP50 SNP panel) contained sires, halfsibs, or both sires and halfsibs of the individuals in the study population (genotyped with Illumina BovineLD SNP panel) using three imputation programs (fimpute v2.2, findhap v2, and beagle v3.3.2). Two criteria, correlation between true and imputed genotypes and missing rate after imputation, were used to evaluate the performance of the three programs in different scenarios. Our results showed that fimpute performed the best in all cases, with correlations from 0.921 to 0.978 when imputing from sires to their daughters or between halfsibs. In general, the accuracies of imputing between halfsibs or from sires to their daughters were higher than were those imputing between non‐halfsibs or from sires to non‐daughters. Including both sires and halfsibs in the reference population did not improve the imputation performance in comparison with when only including halfsibs in the reference population for all the three programs. 相似文献

19.

PCA-correlated SNPs for structure identification in worldwide human populations 总被引：1，自引：0，他引：1

下载免费PDF全文

Paschou P Ziv E Burchard EG Choudhry S Rodriguez-Cintron W Mahoney MW Drineas P 《PLoS genetics》2007,3(9):1672-1686

Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations. 相似文献

20.

Evaluating supervised and unsupervised background noise correction in human gut microbiome data

Leah Briscoe Brunilda Balliu Sriram Sankararaman Eran Halperin Nandita R. Garud 《PLoS computational biology》2022,18(2)

The ability to predict human phenotypes and identify biomarkers of disease from metagenomic data is crucial for the development of therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Supervised methods to correct for background noise, originally designed for gene expression and RNA-seq data, are commonly applied to microbiome data but may be limited because they cannot account for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach that is presently used in other domains but has not been applied to microbiome data to date. We find that the unsupervised principal component correction approach has comparable ability in reducing false discovery of biomarkers as the supervised approaches, with the added benefit of not needing to know the sources of variation apriori. However, in prediction tasks, it appears to only improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses. 相似文献