首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With rapid decline of the sequencing cost, researchers today rush to embrace whole genome sequencing (WGS), or whole exome sequencing (WES) approach as the next powerful tool for relating genetic variants to human diseases and phenotypes. A fundamental step in analyzing WGS and WES data is mapping short sequencing reads back to the reference genome. This is an important issue because incorrectly mapped reads affect the downstream variant discovery, genotype calling and association analysis. Although many read mapping algorithms have been developed, the majority of them uses the universal reference genome and do not take sequence variants into consideration. Given that genetic variants are ubiquitous, it is highly desirable if they can be factored into the read mapping procedure. In this work, we developed a novel strategy that utilizes genotypes obtained a priori to customize the universal haploid reference genome into a personalized diploid reference genome. The new strategy is implemented in a program named RefEditor. When applying RefEditor to real data, we achieved encouraging improvements in read mapping, variant discovery and genotype calling. Compared to standard approaches, RefEditor can significantly increase genotype calling consistency (from 43% to 61% at 4X coverage; from 82% to 92% at 20X coverage) and reduce Mendelian inconsistency across various sequencing depths. Because many WGS and WES studies are conducted on cohorts that have been genotyped using array-based genotyping platforms previously or concurrently, we believe the proposed strategy will be of high value in practice, which can also be applied to the scenario where multiple NGS experiments are conducted on the same cohort. The RefEditor sources are available at https://github.com/superyuan/refeditor.
This is a PLOS Computational Biology Software Article.
  相似文献   

2.
3.
Genetic variants and de novo mutations in regulatory regions of the genome are typically discovered by whole-genome sequencing (WGS), however WGS is expensive and most WGS reads come from non-regulatory regions. The Assay for Transposase-Accessible Chromatin (ATAC-seq) generates reads from regulatory sequences and could potentially be used as a low-cost ‘capture’ method for regulatory variant discovery, but its use for this purpose has not been systematically evaluated. Here we apply seven variant callers to bulk and single-cell ATAC-seq data and evaluate their ability to identify single nucleotide variants (SNVs) and insertions/deletions (indels). In addition, we develop an ensemble classifier, VarCA, which combines features from individual variant callers to predict variants. The Genome Analysis Toolkit (GATK) is the best-performing individual caller with precision/recall on a bulk ATAC test dataset of 0.92/0.97 for SNVs and 0.87/0.82 for indels within ATAC-seq peak regions with at least 10 reads. On bulk ATAC-seq reads, VarCA achieves superior performance with precision/recall of 0.99/0.95 for SNVs and 0.93/0.80 for indels. On single-cell ATAC-seq reads, VarCA attains precision/recall of 0.98/0.94 for SNVs and 0.82/0.82 for indels. In summary, ATAC-seq reads can be used to accurately discover non-coding regulatory variants in the absence of whole-genome sequencing data and our ensemble method, VarCA, has the best overall performance.  相似文献   

4.
The proliferation of genomic sequencing approaches has significantly impacted the field of phylogenetics. Target capture approaches provide a cost-effective, fast and easily applied strategy for phylogenetic inference of non-model organisms. However, several existing target capture processing pipelines are incapable of incorporating whole genome sequencing (WGS). Here, we develop a new pipeline for capture and de novo assembly of the targeted regions using whole genome re-sequencing reads. This new pipeline captured targeted loci accurately, and given its unbiased nature, can be used with any target capture probe set. Moreover, due to its low computational demand, this new pipeline may be ideal for users with limited resources and when high-coverage sequencing outputs are required. We demonstrate the utility of our approach by incorporating WGS data into the first comprehensive phylogenomic reconstruction of the freshwater mussel family Margaritiferidae. We also provide a catalogue of well-curated functional annotations of these previously uncharacterized freshwater mussel-specific target regions, representing a complementary tool for scrutinizing phylogenetic inferences while expanding future applications of the probe set.  相似文献   

5.
Ciliates are unicellular eukaryotes with separate germline and somatic genomes and diverse life cycles, which make them a unique model to improve our understanding of population genetics through the detection of genetic variations. However, traditional sequencing methods cannot be directly applied to ciliates because the majority are uncultivated. Single‐cell whole‐genome sequencing (WGS) is a powerful tool for studying genetic variation in microbes, but no studies have been performed in ciliates. We compared the use of single‐cell WGS and bulk DNA WGS to detect genetic variation, specifically single nucleotide polymorphisms (SNPs), in the model ciliate Tetrahymena thermophila. Our analyses showed that (i) single‐cell WGS has excellent performance regarding mapping rate and genome coverage but lower sequencing uniformity compared with bulk DNA WGS due to amplification bias (which was reproducible); (ii) false‐positive SNP sites detected by single‐cell WGS tend to occur in genomic regions with particularly high sequencing depth and high rate of C:G to T:A base changes; (iii) SNPs detected in three or more cells should be reliable (an detection efficiency of 83.4–97.4% was obtained for combined data from three cells). This analytical method could be adapted to measure genetic variation in other ciliates and broaden research into ciliate population genetics.  相似文献   

6.

Background

Less than two percent of the human genome is protein coding, yet that small fraction harbours the majority of known disease causing mutations. Despite rapidly falling whole genome sequencing (WGS) costs, much research and increasingly the clinical use of sequence data is likely to remain focused on the protein coding exome. We set out to quantify and understand how WGS compares with the targeted capture and sequencing of the exome (exome-seq), for the specific purpose of identifying single nucleotide polymorphisms (SNPs) in exome targeted regions.

Results

We have compared polymorphism detection sensitivity and systematic biases using a set of tissue samples that have been subject to both deep exome and whole genome sequencing. The scoring of detection sensitivity was based on sequence down sampling and reference to a set of gold-standard SNP calls for each sample. Despite evidence of incremental improvements in exome capture technology over time, whole genome sequencing has greater uniformity of sequence read coverage and reduced biases in the detection of non-reference alleles than exome-seq. Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. Known disease causing mutations are not biased towards easy or hard to sequence areas of the genome for either exome-seq or WGS.

Conclusions

From an economic perspective, WGS is at parity with exome-seq for variant detection in the targeted coding regions. WGS offers benefits in uniformity of read coverage and more balanced allele ratio calls, both of which can in most cases be offset by deeper exome-seq, with the caveat that some exome-seq targets will never achieve sufficient mapped read depth for variant detection due to technical difficulties or probe failures. As WGS is intrinsically richer data that can provide insight into polymorphisms outside coding regions and reveal genomic rearrangements, it is likely to progressively replace exome-seq for many applications.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-247) contains supplementary material, which is available to authorized users.  相似文献   

7.
Whole-genome sequencing (WGS) is becoming a fast and cost-effective method to pinpoint molecular lesions in mutagenized genetic model systems, such as Caenorhabditis elegans. As mutagenized strains contain a significant mutational load, it is often still necessary to map mutations to a chromosomal interval to elucidate which of the WGS-identified sequence variants is the phenotype-causing one. We describe here our experience in setting up and testing a simple strategy that incorporates a rapid SNP-based mapping step into the WGS procedure. In this strategy, a mutant retrieved from a genetic screen is crossed with a polymorphic C. elegans strain, individual F2 progeny from this cross is selected for the mutant phenotype, the progeny of these F2 animals are pooled and then whole-genome-sequenced. The density of polymorphic SNP markers is decreased in the region of the phenotype-causing sequence variant and therefore enables its identification in the WGS data. As a proof of principle, we use this strategy to identify the molecular lesion in a mutant strain that produces an excess of dopaminergic neurons. We find that the molecular lesion resides in the Pax-6/Eyeless ortholog vab-3. The strategy described here will further reduce the time between mutant isolation and identification of the molecular lesion.  相似文献   

8.
《Genomics》2022,114(4):110389
Disorders of sex development (DSDs) are congenital malformations defined as discrepancies between sex chromosomes and phenotypical sex. Testicular or ovotesticular XX DSDs are frequently observed in female dogs, while monogenic XY DSDs are less frequent. Here, we applied whole genome sequencing (WGS) to search for causative mutations in XX DSD females in French Bulldogs (FB) and American Staffordshire Terries (AST) and in XY DSD Yorkshire Terries (YT). The WGS results were validated by Sanger sequencing and ddPCR. It was shown that a missense SNP of the PADI6 gene, is significantly associated with the XX DSD (SRY-negative) phenotype in AST (P = 0.0051) and FB (P = 0.0306). On the contrary, we did not find any associated variant with XY DSD in YTs. Our study suggests that the genetic background of the XX DSD may be more complex and breed-specific.  相似文献   

9.
White blood cell count (WBC) is an important clinical marker that varies among different ethnic groups. African Americans are known to have a lower WBC than European Americans. We surveyed the entire genome for loci underlying this difference in WBC by using admixture mapping. We analyzed data from African American participants in the Health, Aging, and Body Composition Study and the Jackson Heart Study. Participants of both studies were genotyped across ≥ 1322 single nucleotide polymorphisms that were pre-selected to be informative for African versus European ancestry and span the entire genome. We used these markers to estimate genetic ancestry in each chromosomal region and then tested the association between WBC and genetic ancestry at each locus. We found a locus on chromosome 1q strongly associated with WBC (p < 10−12). The strongest association was with a marker known to affect the expression of the Duffy blood group antigen. Participants who had both copies of the common West African allele had a mean WBC of 4.9 (SD 1.3); participants who had both common European alleles had a mean WBC of 7.1 (SD 1.3). This variant explained ~20% of population variation in WBC. We used admixture mapping, a novel method for conducting genetic-association studies, to find a region that was significantly associated with WBC on chromosome 1q. Additional studies are needed to determine the biological mechanism for this effect and its clinical implications.  相似文献   

10.
The determination of the relationship between a pair of individuals is a fundamental application of genetics. Previously, we and others have demonstrated that identity-by-descent (IBD) information generated from high-density single-nucleotide polymorphism (SNP) data can greatly improve the power and accuracy of genetic relationship detection. Whole-genome sequencing (WGS) marks the final step in increasing genetic marker density by assaying all single-nucleotide variants (SNVs), and thus has the potential to further improve relationship detection by enabling more accurate detection of IBD segments and more precise resolution of IBD segment boundaries. However, WGS introduces new complexities that must be addressed in order to achieve these improvements in relationship detection. To evaluate these complexities, we estimated genetic relationships from WGS data for 1490 known pairwise relationships among 258 individuals in 30 families along with 46 population samples as controls. We identified several genomic regions with excess pairwise IBD in both the pedigree and control datasets using three established IBD methods: GERMLINE, fastIBD, and ISCA. These spurious IBD segments produced a 10-fold increase in the rate of detected false-positive relationships among controls compared to high-density microarray datasets. To address this issue, we developed a new method to identify and mask genomic regions with excess IBD. This method, implemented in ERSA 2.0, fully resolved the inflated cryptic relationship detection rates while improving relationship estimation accuracy. ERSA 2.0 detected all 1st through 6th degree relationships, and 55% of 9th through 11th degree relationships in the 30 families. We estimate that WGS data provides a 5% to 15% increase in relationship detection power relative to high-density microarray data for distant relationships. Our results identify regions of the genome that are highly problematic for IBD mapping and introduce new software to accurately detect 1st through 9th degree relationships from whole-genome sequence data.  相似文献   

11.
Entropion is a known congenital disorder in sheep presumed to be heritable but no causative genetic variant has been reported. Affected lambs show a variable inward rolling of the lower eyelids leading to blindness in severe cases. In Switzerland, the Swiss White Alpine (SWA) breed showed a significantly higher prevalence for entropion than other breeds. A GWAS using 150 SWA sheep (90 affected lambs and 60 controls), based on 600k SNP data, revealed a genome-wide significant signal on chromosome 15. The 0.2 Mb associated region contains functional candidate genes, SMTNL1 and CTNND1. Pathogenic variants in human CTNND1 cause blepharocheilodontic syndrome 2, a rare disorder including eyelid anomalies, and SMTNL1 regulates contraction and relaxation of skeletal and smooth muscle. WGS of a single entropion-affected lamb revealed two private missense variants in SMTNL1 and CTNND1. Subsequent genotyping of both variants in 231 phenotyped SWA sheep was performed. The SMTNL1 variant p.(Asp452Asn) affects an evolutionary conserved residue within an important domain and represents a rare allele, which occurred also in controls. The p.(Glu943Lys) variant in CTNND1 represents a common variant unlikely to cause entropion as the mutant allele occurred more frequently in non-affected sheep. Therefore, we propose that these protein-changing variants are unlikely to explain the phenotype. Additionally, WGS of three further disconcordant pairs of full siblings was carried out but revealed no obvious causative variant. Finally, we conclude that entropion represents a more complex disease caused by different non-coding regulatory variants.  相似文献   

12.
Accounting for historical demographic features, such as the strength and timing of gene flow and divergence times between closely related lineages, is vital for many inferences in evolutionary biology. Approximate Bayesian computation (ABC) is one method commonly used to estimate demographic parameters. However, the DNA sequences used as input for this method, often microsatellites or RADseq loci, usually represent a small fraction of the genome. Whole genome sequencing (WGS) data, on the other hand, have been used less often with ABC, and questions remain about the potential benefit of, and how to best implement, this type of data; we used pseudo‐observed data sets to explore such questions. Specifically, we addressed the potential improvements in parameter estimation accuracy that could be associated with WGS data in multiple contexts; namely, we quantified the effects of (a) more data, (b) haplotype‐based summary statistics, and (c) locus length. Compared with a hypothetical RADseq data set with 2.5 Mbp of data, using a 1 Gbp data set consisting of 100 Kbp sequences led to substantial gains in the accuracy of parameter estimates, which was mostly due to haplotype statistics and increased data. We also quantified the effects of including (a) locus‐specific recombination rates, and (b) background selection information in ABC analyses. Importantly, assuming uniform recombination or ignoring background selection had a negative effect on accuracy in many cases. Software and results from this method validation study should be useful for future demographic history analyses.  相似文献   

13.
A draft sequence of the chicken genome will be available by early 2004. This event conveniently marks the start of the second century of poultry genetics, coming 100 years after the use of the chicken to demonstrate Mendelian inheritance in animals by William Bateson. How will the second, post-genomic century of poultry genetics differ from the first? A whole genome shotgun (WGS) approach is being used to obtain the chicken sequence, with the goal of generating approximately six-fold coverage of the genome. Bacterial artificial chromosome (BAC) and fosmid clone end sequences, along with a BAC contig map integrated with genetic linkage and radiation hybrid maps, will form the platform for assembly of the WGS data. Rapid progress in global analysis of chicken gene expression patterns is also being made. Comparative genomics will link these new discoveries to the knowledge base for all other animal species. It's hoped that the genome sequence will also provide common ground on which to unite studies of the chicken as a model species with those aimed at agriculturally-relevant applications. The current status of chicken genomics will be assessed with projections for its near and long term future.  相似文献   

14.
As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace. With well-established methodologies, larger and more complex genomes are being tackled, including polyploid plant genomes. Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions. Unfortunately, such collapse is often not ideal, as keeping contigs separate can lead both to improved assembly and also insights about how haplotypes influence phenotype. Here, we describe a first step in avoiding inappropriate collapse during assembly. In particular, we describe ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, we report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. We also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low. We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.  相似文献   

15.
Tularaemia, caused by the bacterium Francisella tularensis, is endemic in Sweden and is poorly understood. The aim of this study was to evaluate the effectiveness of three different genetic typing systems to link a genetic type to the source and place of tularemia infection in Sweden. Canonical single nucleotide polymorphisms (canSNPs), MLVA including five variable number of tandem repeat loci and PmeI-PFGE were tested on 127 F. tularensis positive specimens collected from Swedish case-patients. All three typing methods identified two major genetic groups with near-perfect agreement. Higher genetic resolution was obtained with canSNP and MLVA compared to PFGE; F. tularensis samples were first assigned into ten phylogroups based on canSNPs followed by 33 unique MLVA types. Phylogroups were geographically analysed to reveal complex phylogeographic patterns in Sweden. The extensive phylogenetic diversity found within individual counties posed a challenge to linking specific genetic types with specific geographic locations. Despite this, a single phylogroup (B.22), defined by a SNP marker specific to a lone Swedish sequenced strain, did link genetic type with a likely geographic place. This result suggests that SNP markers, highly specific to a particular reference genome, may be found most frequently among samples recovered from the same location where the reference genome originated. This insight compels us to consider whole-genome sequencing (WGS) as the appropriate tool for effectively linking specific genetic type to geography. Comparing the WGS of an unknown sample to WGS databases of archived Swedish strains maximizes the likelihood of revealing those rare geographically informative SNPs.  相似文献   

16.
Restriction‐site associated DNA sequencing (RADSeq) facilitates rapid generation of thousands of genetic markers at relatively low cost; however, several sources of error specific to RADSeq methods often lead to biased estimates of allele frequencies and thereby to erroneous population genetic inference. Estimating the distribution of sample allele frequencies without calling genotypes was shown to improve population inference from whole genome sequencing data, but the ability of this approach to account for RADSeq‐specific biases remains unexplored. Here we assess in how far genotype‐free methods of allele frequency estimation affect demographic inference from empirical RADSeq data. Using the well‐studied pied flycatcher (Ficedula hypoleuca) as a study system, we compare allele frequency estimation and demographic inference from whole genome sequencing data with that from RADSeq data matched for samples using both genotype‐based and genotype free methods. The demographic history of pied flycatchers as inferred from RADSeq data was highly congruent with that inferred from whole genome resequencing (WGS) data when allele frequencies were estimated directly from the read data. In contrast, when allele frequencies were derived from called genotypes, RADSeq‐based estimates of most model parameters fell outside the 95% confidence interval of estimates derived from WGS data. Notably, more stringent filtering of the genotype calls tended to increase the discrepancy between parameter estimates from WGS and RADSeq data, respectively. The results from this study demonstrate the ability of genotype‐free methods to improve allele frequency spectrum‐ (AFS‐) based demographic inference from empirical RADSeq data and highlight the need to account for uncertainty in NGS data regardless of sequencing method.  相似文献   

17.
Whole genome sequencing is increasingly used to diagnose medical conditions of genetic origin. While both coding and non-coding DNA variants contribute to a wide range of diseases, most patients who receive a WGS-based diagnosis today harbour a protein-coding mutation. Functional interpretation and prioritization of non-coding variants represents a persistent challenge, and disease-causing non-coding variants remain largely unidentified. Depending on the disease, WGS fails to identify a candidate variant in 20–80% of patients, severely limiting the usefulness of sequencing for personalised medicine. Here we present FINSURF, a machine-learning approach to predict the functional impact of non-coding variants in regulatory regions. FINSURF outperforms state-of-the-art methods, owing in particular to optimized control variants selection during training. In addition to ranking candidate variants, FINSURF breaks down the score for each variant into contributions from individual annotations, facilitating the evaluation of their functional relevance. We applied FINSURF to a diverse set of 30 diseases with described causative non-coding mutations, and correctly identified the disease-causative non-coding variant within the ten top hits in 22 cases. FINSURF is implemented as an online server to as well as custom browser tracks, and provides a quick and efficient solution to prioritize candidate non-coding variants in realistic clinical settings.  相似文献   

18.
19.

Background

An understanding of linkage disequilibrium (LD) structures in the human genome underpins much of medical genetics and provides a basis for disease gene mapping and investigating biological mechanisms such as recombination and selection. Whole genome sequencing (WGS) provides the opportunity to determine LD structures at maximal resolution.

Results

We compare LD maps constructed from WGS data with LD maps produced from the array-based HapMap dataset, for representative European and African populations. WGS provides up to 5.7-fold greater SNP density than array-based data and achieves much greater resolution of LD structure, allowing for identification of up to 2.8-fold more regions of intense recombination. The absence of ascertainment bias in variant genotyping improves the population representativeness of the WGS maps, and highlights the extent of uncaptured variation using array genotyping methodologies. The complete capture of LD patterns using WGS allows for higher genome-wide association study (GWAS) power compared to array-based GWAS, with WGS also allowing for the analysis of rare variation. The impact of marker ascertainment issues in arrays has been greatest for Sub-Saharan African populations where larger sample sizes and substantially higher marker densities are required to fully resolve the LD structure.

Conclusions

WGS provides the best possible resource for LD mapping due to the maximal marker density and lack of ascertainment bias. WGS LD maps provide a rich resource for medical and population genetics studies. The increasing availability of WGS data for large populations will allow for improved research utilising LD, such as GWAS and recombination biology studies.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1854-0) contains supplementary material, which is available to authorized users.  相似文献   

20.

Background

Understanding Mycobacterium tuberculosis (Mtb) transmission is essential to guide efficient tuberculosis control strategies. Traditional strain typing lacks sufficient discriminatory power to resolve large outbreaks. Here, we tested the potential of using next generation genome sequencing for identification of outbreak-related transmission chains.

Methods and Findings

During long-term (1997 to 2010) prospective population-based molecular epidemiological surveillance comprising a total of 2,301 patients, we identified a large outbreak caused by an Mtb strain of the Haarlem lineage. The main performance outcome measure of whole genome sequencing (WGS) analyses was the degree of correlation of the WGS analyses with contact tracing data and the spatio-temporal distribution of the outbreak cases. WGS analyses of the 86 isolates revealed 85 single nucleotide polymorphisms (SNPs), subdividing the outbreak into seven genome clusters (two to 24 isolates each), plus 36 unique SNP profiles. WGS results showed that the first outbreak isolates detected in 1997 were falsely clustered by classical genotyping. In 1998, one clone (termed “Hamburg clone”) started expanding, apparently independently from differences in the social environment of early cases. Genome-based clustering patterns were in better accordance with contact tracing data and the geographical distribution of the cases than clustering patterns based on classical genotyping. A maximum of three SNPs were identified in eight confirmed human-to-human transmission chains, involving 31 patients. We estimated the Mtb genome evolutionary rate at 0.4 mutations per genome per year. This rate suggests that Mtb grows in its natural host with a doubling time of approximately 22 h (400 generations per year). Based on the genome variation discovered, emergence of the Hamburg clone was dated back to a period between 1993 and 1997, hence shortly before the discovery of the outbreak through epidemiological surveillance.

Conclusions

Our findings suggest that WGS is superior to conventional genotyping for Mtb pathogen tracing and investigating micro-epidemics. WGS provides a measure of Mtb genome evolution over time in its natural host context. Please see later in the article for the Editors'' Summary  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号