首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the “most diverse reference panel”, defined as the subset with the maximal “phylogenetic diversity”, thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.  相似文献   

2.
3.
A growing variety of “genotype-by-sequencing” (GBS) methods use restriction enzymes and high throughput DNA sequencing to generate data for a subset of genomic loci, allowing the simultaneous discovery and genotyping of thousands of polymorphisms in a set of multiplexed samples. We evaluated a “double-digest” restriction-site associated DNA sequencing (ddRAD-seq) protocol by 1) comparing results for a zebra finch (Taeniopygia guttata) sample with in silico predictions from the zebra finch reference genome; 2) assessing data quality for a population sample of indigobirds (Vidua spp.); and 3) testing for consistent recovery of loci across multiple samples and sequencing runs. Comparison with in silico predictions revealed that 1) over 90% of predicted, single-copy loci in our targeted size range (178–328 bp) were recovered; 2) short restriction fragments (38–178 bp) were carried through the size selection step and sequenced at appreciable depth, generating unexpected but nonetheless useful data; 3) amplification bias favored shorter, GC-rich fragments, contributing to among locus variation in sequencing depth that was strongly correlated across samples; 4) our use of restriction enzymes with a GC-rich recognition sequence resulted in an up to four-fold overrepresentation of GC-rich portions of the genome; and 5) star activity (i.e., non-specific cutting) resulted in thousands of “extra” loci sequenced at low depth. Results for three species of indigobirds show that a common set of thousands of loci can be consistently recovered across both individual samples and sequencing runs. In a run with 46 samples, we genotyped 5,996 loci in all individuals and 9,833 loci in 42 or more individuals, resulting in <1% missing data for the larger data set. We compare our approach to similar methods and discuss the range of factors (fragment library preparation, natural genetic variation, bioinformatics) influencing the recovery of a consistent set of loci among samples.  相似文献   

4.
Keightley PD  Halligan DL 《Genetics》2011,188(4):931-940
Sequencing errors and random sampling of nucleotide types among sequencing reads at heterozygous sites present challenges for accurate, unbiased inference of single-nucleotide polymorphism genotypes from high-throughput sequence data. Here, we develop a maximum-likelihood approach to estimate the frequency distribution of the number of alleles in a sample of individuals (the site frequency spectrum), using high-throughput sequence data. Our method assumes binomial sampling of nucleotide types in heterozygotes and random sequencing error. By simulations, we show that close to unbiased estimates of the site frequency spectrum can be obtained if the error rate per base read does not exceed the population nucleotide diversity. We also show that these estimates are reasonably robust if errors are nonrandom. We then apply the method to infer site frequency spectra for zerofold degenerate, fourfold degenerate, and intronic sites of protein-coding genes using the low coverage human sequence data produced by the 1000 Genomes Project phase-one pilot. By fitting a model to the inferred site frequency spectra that estimates parameters of the distribution of fitness effects of new mutations, we find evidence for significant natural selection operating on fourfold sites. We also find that a model with variable effects of mutations at synonymous sites fits the data significantly better than a model with equal mutational effects. Under the variable effects model, we infer that 11% of synonymous mutations are subject to strong purifying selection.  相似文献   

5.
In this study, we characterize changes in the genome during a swift evolutionary adaptation, by combining experimental selection with high-throughput sequencing. We imposed strong experimental selection on an ecologically relevant trait, parasitoid resistance in Drosophila melanogaster against Asobara tabida. Replicated selection lines rapidly evolved towards enhanced immunity. Larval survival after parasitization increased twofold after just five generations of selection. Whole-genome sequencing revealed that the fast and strong selection response in innate immunity produced multiple, highly localized genomic changes. We identified narrow genomic regions carrying a significant signature of selection, which were present across all chromosomes and covered in total less than 5% of the whole D. melanogaster genome. We identified segregating sites with highly significant changes in frequency between control and selection lines that fell within these narrow ‘selected regions’. These segregating sites were associated with 42 genes that constitute possible targets of selection. A region on chromosome 2R was highly enriched in significant segregating sites and may be of major effect on parasitoid defence. The high genetic variability and small linkage blocks in our base population are likely responsible for allowing this complex trait to evolve without causing widespread erosive effects in the genome, even under such a fast and strong selective regime.  相似文献   

6.
Siu H  Jin L  Xiong M 《PloS one》2012,7(1):e29901
The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the "intrinsic dimensionality" of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.  相似文献   

7.
Soil fungi play a major role in terrestrial ecosystem functioning through interactions with soil structure, plants, micro- and mesofauna, and nutrient cycling through predation, pathogenesis, mutualistic, and saprotrophic roles. The diversity of soil fungi was assessed by sequencing their 28S rRNA gene in Alaskan permafrost and Oklahoma tallgrass prairie soils at experimental sites where the effect of climate warming is under investigation. A total of 226,695 reads were classified into 1,063 genera, covering 62% of the reference data set. Using the Bayesian Classifier offered by the Ribosomal Database Project (RDP) with 50% bootstrapping classification confidence, approximately 70% of sequences were returned as “unclassified” at the genus level, although the majority (∼65%) were classified at the class level, which provided insight into these lesser-known fungal lineages. Those unclassified at the genus level were subjected to BLAST analysis against the ARB-SILVA database, where ∼50% most closely matched nonfungal taxa. Compared to the more abundant sequences, a higher proportion of rare operational taxonomic units (OTU) were successfully classified to genera at 50% bootstrap confidence, indicating that the fungal rare biosphere in these sites is not composed of sequencing artifacts. There was no significant effect after 1 year of warming on the fungal community structure at both sites, except perhaps for a few minor members, but there was a significant effect of sample depth in the permafrost soils. Despite overall significant community structure differences driven by variations in OTU dominance, the prairie and permafrost soils shared 90% and 63% of all fungal sequences, respectively, indicating a fungal “seed bank” common between both sites.  相似文献   

8.
Accurate sex identification is crucial for elucidating the biology of a species. In the absence of directly observable sexual characteristics, sex identification of wild fauna can be challenging, if not impossible. Molecular sexing offers a powerful alternative to morphological sexing approaches. Here, we present SeXY, a novel sex‐identification pipeline, for very low‐coverage shotgun sequencing data from a single individual. SeXY was designed to utilize low‐effort screening data for sex identification and does not require a conspecific sex‐chromosome assembly as reference. We assess the accuracy of our pipeline to data quantity by downsampling sequencing data from 100,000 to 1000 mapped reads and to reference genome selection by mapping to a variety of reference genomes of various qualities and phylogenetic distance. We show that our method is 100% accurate when mapping to a high‐quality (highly contiguous N50 > 30 Mb) conspecific genome, even down to 1000 mapped reads. For lower‐quality reference assemblies (N50 < 30 Mb), our method is 100% accurate with 50,000 mapped reads, regardless of reference assembly quality or phylogenetic distance. The SeXY pipeline provides several advantages over previously implemented methods; SeXY (i) requires sequencing data from only a single individual, (ii) does not require assembled conspecific sex chromosomes, or even a conspecific reference assembly, (iii) takes into account variation in coverage across the genome, and (iv) is accurate with only 1000 mapped reads in many cases.  相似文献   

9.
Multiple disease resistance has important implications for plant fitness, given the selection pressure that many pathogens exert directly on natural plant populations and indirectly via crop improvement programs. Evidence of a locus conditioning resistance to multiple pathogens was found in bin 1.06 of the maize genome with the allele from inbred line “Tx303” conditioning quantitative resistance to northern leaf blight (NLB) and qualitative resistance to Stewart’s wilt. To dissect the genetic basis of resistance in this region and to refine candidate gene hypotheses, we mapped resistance to the two diseases. Both resistance phenotypes were localized to overlapping regions, with the Stewart’s wilt interval refined to a 95.9-kb segment containing three genes and the NLB interval to a 3.60-Mb segment containing 117 genes. Regions of the introgression showed little to no recombination, suggesting structural differences between the inbred lines Tx303 and “B73,” the parents of the fine-mapping population. We examined copy number variation across the region using next-generation sequencing data, and found large variation in read depth in Tx303 across the region relative to the reference genome of B73. In the fine-mapping region, association mapping for NLB implicated candidate genes, including a putative zinc finger and pan1. We tested mutant alleles and found that pan1 is a susceptibility gene for NLB and Stewart’s wilt. Our data strongly suggest that structural variation plays an important role in resistance conditioned by this region, and pan1, a gene conditioning susceptibility for NLB, may underlie the QTL.  相似文献   

10.
Determination of sequence variation within a genetic locus to develop clinically relevant databases is critical for molecular assay design and clinical test interpretation, so multisample pooling for Illumina genome analyzer (GA) sequencing was investigated using the RET proto-oncogene as a model. Samples were Sanger-sequenced for RET exons 10, 11, and 13–16. Ten samples with 13 known unique variants (“singleton variants” within the pool) and seven common changes were amplified and then equimolar-pooled before sequencing on a single flow cell lane, generating 36 base reads. For comparison, a single “control” sample was run in a different lane. After alignment, a 24-base quality score-screening threshold and 3` read end trimming of three bases yielded low background error rates with a 27% decrease in aligned read coverage. Sequencing data were evaluated using an established variant detection method (percent variant reads), by the presented subtractive correction method, and with SNPSeeker software. In total, 41 variants (of which 23 were singleton variants) were detected in the 10 pool data, which included all Sanger-identified variants. The 23 singleton variants were detected near the expected 5% allele frequency (average 5.17%±0.90% variant reads), well above the highest background error (1.25%). Based on background error rates, read coverage, simulated 30, 40, and 50 sample pool data, expected singleton allele frequencies within pools, and variant detection methods; ≥30 samples (which demonstrated a minimum 1% variant reads for singletons) could be pooled to reliably detect singleton variants by GA sequencing.  相似文献   

11.
Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman’s coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico, using simulations of a Wright–Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1–0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks.  相似文献   

12.
Problem NHS patients requiring elective surgery usually have to wait before being treated and are usually told when a date becomes available.Design 18 month pilot programme to enable day case patients to book date of hospital admission at time of decision to operate.Background and setting 24 pilot sites in England with relatively short waiting times and some experience of booking appointments.Key measures for improvement Proportion of patients with booked or “to come in” date during and after pilot programme, proportion not attending for admission, and proportion waiting ≥ 6 months. Comparison of pilot sites with non-pilot sites.Strategies for change National Patients'' Access Team established to help pilot sites enable patients to book admission dates. Provision of £9.9m to pilot sites to employ project managers, purchase equipment, buy extra time from clinical and other staff, and invest in information and communications technology.Effects of change Proportion of patients with booked or “to come in” date increased from 51.1% to 72.7% between end of March 1999 and end of March 2000, and then fell to 66.2% by end of March 2001. Over the same periods, the proportion of patients waiting ≥ 6 months fell from 10.9% to 10.5% and then increased to 11.9%. The proportion of patients failing to attend fell from 5.7% to 3.1% between the first quarter of 1999 and the first quarter of 2000, and then increased to 4.0% in the first quarter of 2001. Pilot sites varied widely in performance during and after the pilot phase. Pilot sites had higher proportions of patients with booked or “to come in” date than non-pilot sites at end of each period.Lessons learnt Increasing the proportion of patients who book their date of hospital admission is possible, but there are difficulties in sustaining this. Several factors facilitated or hindered the implementation of booking, and the roll out of the programme across the NHS is seeking to incorporate these factors.  相似文献   

13.
Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.  相似文献   

14.
  1. Spatial capture–recapture (SCR) models have increasingly been used as a basis for combining capture–recapture data types with variable levels of individual identity information to estimate population density and other demographic parameters. Recent examples are the unmarked SCR (or spatial count model), where no individual identities are available and spatial mark–resight (SMR) where individual identities are available for only a marked subset of the population. Currently lacking, though, is a model that allows unidentified samples to be combined with identified samples when there are no separate classes of “marked” and “unmarked” individuals and when the two sample types cannot be considered as arising from two independent observation models. This is a common scenario when using noninvasive sampling methods, for example, when analyzing data on identified and unidentified photographs or scats from the same sites.
  2. Here we describe a “random thinning” SCR model that utilizes encounters of both known and unknown identity samples using a natural mechanistic dependence between samples arising from a single observation model. Our model was fitted in a Bayesian framework using NIMBLE.
  3. We investigate the improvement in parameter estimates by including the unknown identity samples, which was notable (up to 79% more precise) in low‐density populations with a low rate of identified encounters. We then applied the random thinning SCR model to a noninvasive genetic sampling study of brown bear (Ursus arctos) density in Oriental Cantabrian Mountains (North Spain).
  4. Our model can improve density estimation for noninvasive sampling studies for low‐density populations with low rates of individual identification, by making use of available data that might otherwise be discarded.
  相似文献   

15.
Untranslated gene regions (UTRs) play an important role in controlling gene expression. 3′-UTRs are primarily targeted by microRNA (miRNA) molecules that form complex gene regulatory networks. Cancer genomes are replete with non-coding mutations, many of which are connected to changes in tumor gene expression that accompany the development of cancer and are associated with resistance to therapy. Therefore, variants that occurred in 3′-UTR under cancer progression should be analysed to predict their phenotypic effect on gene expression, e.g., by evaluating their impact on miRNA target sites. Here, we analyze 3′-UTR variants in DICER1 and DROSHA genes in the context of myelodysplastic syndrome (MDS) development. The key features of this analysis include an assessment of both “canonical” and “non-canonical” types of mRNA-miRNA binding and tissue-specific profiling of miRNA interactions with wild-type and mutated genes. As a result, we obtained a list of DICER1 and DROSHA variants likely altering the miRNA sites and, therefore, potentially leading to the observed tissue-specific gene downregulation. All identified variants have low population frequency consistent with their potential association with pathology progression.  相似文献   

16.
Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel—minimizing the average distance to the closest leaf (ADCL)—and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.  相似文献   

17.
Understanding the assembly processes of symbiont communities, including viromes and microbiomes, is important for improving predictions on symbionts’ biogeography and disease ecology. Here, we use phylogenetic, functional, and geographic filters to predict the similarity between symbiont communities, using as a test case the assembly process in viral communities of Mexican bats. We construct generalized linear models to predict viral community similarity, as measured by the Jaccard index, as a function of differences in host phylogeny, host functionality, and spatial co‐occurrence, evaluating the models using the Akaike information criterion. Two model classes are constructed: a “known” model, where virus–host relationships are based only on data reported in Mexico, and a “potential” model, where viral reports of all the Americas are used, but then applied only to bat species that are distributed in Mexico. Although the “known” model shows only weak dependence on any of the filters, the “potential” model highlights the importance of all three filter types—phylogeny, functional traits, and co‐occurrence—in the assemblage of viral communities. The differences between the “known” and “potential” models highlight the utility of modeling at different “scales” so as to compare and contrast known information at one scale to another one, where, for example, virus information associated with bats is much scarcer.  相似文献   

18.
Few areas of science have benefited more from the expansion in sequencing capability than the study of microbial communities. Can sequence data, besides providing hypotheses of the functions the members possess, detect the evolutionary and ecological processes that are occurring? For example, can we determine if a species is adapting to one niche, or if it is diversifying into multiple specialists that inhabit distinct niches? Fortunately, adaptation of populations in the laboratory can serve as a model to test our ability to make such inferences about evolution and ecology from sequencing. Even adaptation to a single niche can give rise to complex temporal dynamics due to the transient presence of multiple competing lineages. If there are multiple niches, this complexity is augmented by segmentation of the population into multiple specialists that can each continue to evolve within their own niche. For a known example of parallel diversification that occurred in the laboratory, sequencing data gave surprisingly few obvious, unambiguous signs of the ecological complexity present. Whereas experimental systems are open to direct experimentation to test hypotheses of selection or ecological interaction, the difficulty in “seeing ecology” from sequencing for even such a simple system suggests translation to communities like the human microbiome will be quite challenging. This will require both improved empirical methods to enhance the depth and time resolution for the relevant polymorphisms and novel statistical approaches to rigorously examine time-series data for signs of various evolutionary and ecological phenomena within and between species.  相似文献   

19.
The host‐associated microbiome is an important player in the ecology and evolution of species. Despite growing interest in the medical, veterinary, and conservation communities, there remain numerous questions about the primary factors underlying microbiota, particularly in wildlife. We bridged this knowledge gap by leveraging microbial, genetic, and observational data collected in a wild, pedigreed population of gray wolves (Canis lupus) inhabiting Yellowstone National Park. We characterized body site‐specific microbes across six haired and mucosal body sites (and two fecal samples) using 16S rRNA amplicon sequencing. At the phylum level, we found that the microbiome of gray wolves primarily consists of Actinobacteria, Bacteroidetes, Firmicutes, Fusobacteria, and Proteobacteria, consistent with previous studies within Mammalia and Canidae. At the genus level, we documented body site‐specific microbiota with functions relevant to microenvironment and local physiological processes. We additionally employed observational and RAD sequencing data to examine genetic, demographic, and environmental correlates of skin and gut microbiota. We surveyed individuals across several levels of pedigree relationships, generations, and social groups, and found that social environment (i.e., pack) and genetic relatedness were two primary factors associated with microbial community composition to differing degrees between body sites. We additionally reported body condition and coat color as secondary factors underlying gut and skin microbiomes, respectively. We concluded that gray wolf microbiota resemble similar host species, differ between body sites, and are shaped by numerous endogenous and exogenous factors. These results provide baseline information for this long‐term study population and yield important insights into the evolutionary history, ecology, and conservation of wild wolves and their associated microbes.  相似文献   

20.
Studies in ecology, evolution, and conservation often rely on noninvasive samples, making it challenging to generate large amounts of high‐quality genetic data for many elusive and at‐risk species. We developed and optimized a Genotyping‐in‐Thousands by sequencing (GT‐seq) panel using noninvasive samples to inform the management of invasive Sitka black‐tailed deer (Odocoileus hemionus sitkensis) in Haida Gwaii (Canada). We validated our panel using paired high‐quality tissue and noninvasive fecal and hair samples to simultaneously distinguish individuals, identify sex, and reconstruct kinship among deer sampled across the archipelago, then provided a proof‐of‐concept application using field‐collected feces on SGang Gwaay, an island of high ecological and cultural value. Genotyping success across 244 loci was high (90.3%) and comparable to that of high‐quality tissue samples genotyped using restriction‐site associated DNA sequencing (92.4%), while genotyping discordance between paired high‐quality tissue and noninvasive samples was low (0.50%). The panel will be used to inform future invasive species operations in Haida Gwaii by providing individual and population information to inform management. More broadly, our GT‐seq workflow that includes quality control analyses for targeted SNP selection and a modified protocol may be of wider utility for other studies and systems where noninvasive genetic sampling is employed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号