共查询到20条相似文献,搜索用时 0 毫秒
1.
Various algorithms have been developed for variant calling using next-generation sequencing data, and various methods have been applied to reduce the associated false positive and false negative rates. Few variant calling programs, however, utilize the pedigree information when the family-based sequencing data are available. Here, we present a program, FamSeq, which reduces both false positive and false negative rates by incorporating the pedigree information from the Mendelian genetic model into variant calling. To accommodate variations in data complexity, FamSeq consists of four distinct implementations of the Mendelian genetic model: the Bayesian network algorithm, a graphics processing unit version of the Bayesian network algorithm, the Elston-Stewart algorithm and the Markov chain Monte Carlo algorithm. To make the software efficient and applicable to large families, we parallelized the Bayesian network algorithm that copes with pedigrees with inbreeding loops without losing calculation precision on an NVIDIA graphics processing unit. In order to compare the difference in the four methods, we applied FamSeq to pedigree sequencing data with family sizes that varied from 7 to 12. When there is no inbreeding loop in the pedigree, the Elston-Stewart algorithm gives analytical results in a short time. If there are inbreeding loops in the pedigree, we recommend the Bayesian network method, which provides exact answers. To improve the computing speed of the Bayesian network method, we parallelized the computation on a graphics processing unit. This allowed the Bayesian network method to process the whole genome sequencing data of a family of 12 individuals within two days, which was a 10-fold time reduction compared to the time required for this computation on a central processing unit.
This is a PLOS Computational Biology Software Article相似文献
2.
3.
Shunichi Kosugi Satoshi Natsume Kentaro Yoshida Daniel MacLean Liliana Cano Sophien Kamoun Ryohei Terauchi 《PloS one》2013,8(10)
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/. 相似文献
4.
Tianshi Lu Seongoh Park James Zhu Yunguan Wang Xiaowei Zhan Xinlei Wang Li Wang Hao Zhu Tao Wang 《Cell reports》2021,34(1):108589
- Download : Download high-res image (265KB)
- Download : Download full-size image
5.
相邻的反向重复DNA片段有形成单链内二级结构的倾向,属于一种测序困难的DNA模板。解决RNAi载体插入的反向重复片段的测序问题,为该类载体正确性的测序验证奠定基础。采用常规分子克隆方法构建表达小麦TaATG2串联反向重复片段的RNAi载体,设计2种策略对经菌落PCR初步鉴定的载体进行测序验证:一种是以完整的载体质粒为模板进行测序;另一种是先对载体进行酶切处理,切除反向重复片段中的一个后对保留另一个片段的线性载体进行测序。结果表明,第一种测序策略受到串联反向重复片段形成的单链内部二级结构的影响,测序信号在反向重复片段处出现衰减或乱峰,无法读取序列。第二种测序策略排除了2个反向重复片段之间的干扰,保留在载体上的片段测序信号清晰,序列准确。采用酶切切除一个片段后进行测序的方法,经过2次酶切和2次测序可以有效地对载体上的2个反向重复片段分别进行序列测定,进而确认构建载体的正确性。 相似文献
6.
7.
Jean-Michel Claverie 《Genomics》1994,23(3)
The random (shotgun) DNA sequencing strategy is used for most large-scale sequencing projects, including the identification of human disease genes after positional cloning. The principle of the method--sequence assembly from overlap--requires the candidate gene region to be partitioned into 15- to 20-kb pieces (usually λ inserts), themselves randomly subcloned into M13 prior to sequencing with a 6- to 8-fold redundancy. Most often, a time-consuming directed strategy must be invoked to close the remaining gaps. Ultimately, computer-based methods are invoked to locate putative coding exons within the finished genomic sequence. Given the small average size of vertebrate exons, I show here that they can be detected from the computer analysis of the individual runs, much before completion of contiguity. However, the successful assessment of coding potential from the raw data depends on a combination of new sequence masking techniques. When the identification of coding exons is the primary goal, the usual random sequencing strategy can thus be greatly optimized. The streamlined approach requires only a 2- to 2.5-fold sequencing redundancy, can dispense with the subcloning in λ and the closure of gaps, and can be fully automated. The feasibility of this strategy is demonstrated using data from the X-linked Kallmann syndrome gene region. 相似文献
8.
Seunggeun Lee Tanya?M. Teslovich Michael Boehnke Xihong Lin 《American journal of human genetics》2013,93(1):42-53
We propose a general statistical framework for meta-analysis of gene- or region-based multimarker rare variant association tests in sequencing association studies. In genome-wide association studies, single-marker meta-analysis has been widely used to increase statistical power by combining results via regression coefficients and standard errors from different studies. In analysis of rare variants in sequencing studies, region-based multimarker tests are often used to increase power. We propose meta-analysis methods for commonly used gene- or region-based rare variants tests, such as burden tests and variance component tests. Because estimation of regression coefficients of individual rare variants is often unstable or not feasible, the proposed method avoids this difficulty by calculating score statistics instead that only require fitting the null model for each study and then aggregating these score statistics across studies. Our proposed meta-analysis rare variant association tests are conducted based on study-specific summary statistics, specifically score statistics for each variant and between-variant covariance-type (linkage disequilibrium) relationship statistics for each gene or region. The proposed methods are able to incorporate different levels of heterogeneity of genetic effects across studies and are applicable to meta-analysis of multiple ancestry groups. We show that the proposed methods are essentially as powerful as joint analysis by directly pooling individual level genotype data. We conduct extensive simulations to evaluate the performance of our methods by varying levels of heterogeneity across studies, and we apply the proposed methods to meta-analysis of rare variant effects in a multicohort study of the genetics of blood lipid levels. 相似文献
9.
Martina Miju?kovi? Stuart M. Brown Zuojian Tang Cory R. Lindsay Efstratios Efstathiadis Ludovic Deriano David B. Roth 《PloS one》2012,7(10)
Defining the architecture of a specific cancer genome, including its structural variants, is essential for understanding tumor biology, mechanisms of oncogenesis, and for designing effective personalized therapies. Short read paired-end sequencing is currently the most sensitive method for detecting somatic mutations that arise during tumor development. However, mapping structural variants using this method leads to a large number of false positive calls, mostly due to the repetitive nature of the genome and the difficulty of assigning correct mapping positions to short reads. This study describes a method to efficiently identify large tumor-specific deletions, inversions, duplications and translocations from low coverage data using SVDetect or BreakDancer software and a set of novel filtering procedures designed to reduce false positive calls. Applying our method to a spontaneous T cell lymphoma arising in a core RAG2/p53-deficient mouse, we identified 40 validated tumor-specific structural rearrangements supported by as few as 2 independent read pairs. 相似文献
10.
Qing Xie Qi Liu Fengbiao Mao Wanshi Cai Honghu Wu Mingcong You Zhen Wang Bingyu Chen Zhong Sheng Sun Jinyu Wu 《PLoS computational biology》2014,10(9)
High-throughput bisulfite sequencing technologies have provided a comprehensive and well-fitted way to investigate DNA methylation at single-base resolution. However, there are substantial bioinformatic challenges to distinguish precisely methylcytosines from unconverted cytosines based on bisulfite sequencing data. The challenges arise, at least in part, from cell heterozygosis caused by multicellular sequencing and the still limited number of statistical methods that are available for methylcytosine calling based on bisulfite sequencing data. Here, we present an algorithm, termed Bycom, a new Bayesian model that can perform methylcytosine calling with high accuracy. Bycom considers cell heterozygosis along with sequencing errors and bisulfite conversion efficiency to improve calling accuracy. Bycom performance was compared with the performance of Lister, the method most widely used to identify methylcytosines from bisulfite sequencing data. The results showed that the performance of Bycom was better than that of Lister for data with high methylation levels. Bycom also showed higher sensitivity and specificity for low methylation level samples (<1%) than Lister. A validation experiment based on reduced representation bisulfite sequencing data suggested that Bycom had a false positive rate of about 4% while maintaining an accuracy of close to 94%. This study demonstrated that Bycom had a low false calling rate at any methylation level and accurate methylcytosine calling at high methylation levels. Bycom will contribute significantly to studies aimed at recalibrating the methylation level of genomic regions based on the presence of methylcytosines. 相似文献
11.
12.
13.
Rocco Piazza Vera Magistroni Alessandra Pirola Sara Redaelli Roberta Spinelli Serena Redaelli Marta Galbiati Simona Valletta Giovanni Giudici Giovanni Cazzaniga Carlo Gambacorti-Passerini 《PloS one》2013,8(10)
Copy number alterations (CNA) are common events occurring in leukaemias and solid tumors. Comparative Genome Hybridization (CGH) is actually the gold standard technique to analyze CNAs; however, CGH analysis requires dedicated instruments and is able to perform only low resolution Loss of Heterozygosity (LOH) analyses. Here we present CEQer (Comparative Exome Quantification analyzer), a new graphical, event-driven tool for CNA/allelic-imbalance (AI) coupled analysis of exome sequencing data. By using case-control matched exome data, CEQer performs a comparative digital exonic quantification to generate CNA data and couples this information with exome-wide LOH and allelic imbalance detection. This data is used to build mixed statistical/heuristic models allowing the identification of CNA/AI events. To test our tool, we initially used in silico generated data, then we performed whole-exome sequencing from 20 leukemic specimens and corresponding matched controls and we analyzed the results using CEQer. Taken globally, these analyses showed that the combined use of comparative digital exon quantification and LOH/AI allows generating very accurate CNA data. Therefore, we propose CEQer as an efficient, robust and user-friendly graphical tool for the identification of CNA/AI in the context of whole-exome sequencing data. 相似文献
14.
Erin L. Crowgey Deborah L. Stabley Chuming Chen Hongzhan Huang Katherine M. Robbins Shawn W. Polson Katia Sol-Church Cathy H. Wu 《Journal of biomolecular techniques》2015,26(1):19-28
Next-generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis for data interpretation. We have developed an integrated approach for end-to-end clinical NGS data analysis from variant detection to functional profiling. Robust bioinformatics pipelines were implemented for genome alignment, single nucleotide polymorphism (SNP), small insertion/deletion (InDel), and copy number variation (CNV) detection of whole exome sequencing (WES) data from the Illumina platform. Quality-control metrics were analyzed at each step of the pipeline by use of a validated training dataset to ensure data integrity for clinical applications. We annotate the variants with data regarding the disease population and variant impact. Custom algorithms were developed to filter variants based on criteria, such as quality of variant, inheritance pattern, and impact of variant on protein function. The developed clinical variant pipeline links the identified rare variants to Integrated Genome Viewer for visualization in a genomic context and to the Protein Information Resource’s iProXpress for rich protein and disease information. With the application of our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for downstream variant filtering that empowers clinicians and researchers to interpret more effectively the relevance of genomic alterations within a rare genetic disease. 相似文献
15.
Matthew Flickinger Goo Jun Gon?alo?R. Abecasis Michael Boehnke Hyun?Min Kang 《American journal of human genetics》2015,97(2):284-290
DNA sample contamination is a frequent problem in DNA sequencing studies and can result in genotyping errors and reduced power for association testing. We recently described methods to identify within-species DNA sample contamination based on sequencing read data, showed that our methods can reliably detect and estimate contamination levels as low as 1%, and suggested strategies to identify and remove contaminated samples from sequencing studies. Here we propose methods to model contamination during genotype calling as an alternative to removal of contaminated samples from further analyses. We compare our contamination-adjusted calls to calls that ignore contamination and to calls based on uncontaminated data. We demonstrate that, for moderate contamination levels (5%–20%), contamination-adjusted calls eliminate 48%–77% of the genotyping errors. For lower levels of contamination, our contamination correction methods produce genotypes nearly as accurate as those based on uncontaminated data. Our contamination correction methods are useful generally, but are particularly helpful for sample contamination levels from 2% to 20%. 相似文献
16.
Philip C. Zuzarte Robert E. Denroche Gordon Fehringer Hagit Katzov-Eckert Rayjean J. Hung John D. McPherson 《PloS one》2014,9(4)
We describe a method for pooling and sequencing DNA from a large number of individual samples while preserving information regarding sample identity. DNA from 576 individuals was arranged into four 12 row by 12 column matrices and then pooled by row and by column resulting in 96 total pools with 12 individuals in each pool. Pooling of DNA was carried out in a two-dimensional fashion, such that DNA from each individual is present in exactly one row pool and exactly one column pool. By considering the variants observed in the rows and columns of a matrix we are able to trace rare variants back to the specific individuals that carry them. The pooled DNA samples were enriched over a 250 kb region previously identified by GWAS to significantly predispose individuals to lung cancer. All 96 pools (12 row and 12 column pools from 4 matrices) were barcoded and sequenced on an Illumina HiSeq 2000 instrument with an average depth of coverage greater than 4,000×. Verification based on Ion PGM sequencing confirmed the presence of 91.4% of confidently classified SNVs assayed. In this way, each individual sample is sequenced in multiple pools providing more accurate variant calling than a single pool or a multiplexed approach. This provides a powerful method for rare variant detection in regions of interest at a reduced cost to the researcher. 相似文献
17.
Mutant screens have proven powerful for genetic dissection of a myriad of biological processes, but subsequent identification and isolation of the causative mutations are usually complex and time consuming. We have made the process easier by establishing a novel strategy that employs whole-genome sequencing to simultaneously map and identify mutations without the need for any prior genetic mapping.THE challenges posed by the identification of a causal mutation in a mutant of interest have in effect restricted the use of forward genetics to those organisms benefiting from a solid genetic toolbox. Whole-genome sequencing (WGS) is promising to revolutionize the way phenotypic traits are assigned to genes. However, current strategies to identify causal mutations using WGS require first the identification of an approximate genomic location containing the mutation of interest (Sarin et al. 2008; Smith et al. 2008; Srivatsan et al. 2008; Blumenstiel et al. 2009; Irvine et al. 2009). This is because genomes contain many natural sequence variations (Denver et al. 2004; Hillier et al. 2008; Sarin et al. 2010), which, along with mutagen-induced ones, complicate the identification of the causal mutation when an approximate genomic location has not been previously identified. Mapping has previously been achieved with time-consuming and laborious techniques that, in addition, rely on an organism''s single-nucleotide polymorphism (SNP) map and established variant strains. For example, traditional SNP-based mapping (Wicks et al. 2001; Davis et al. 2005) has previously been used in Caenorhabditis elegans to narrow down the genomic region containing the mutation of interest, prior to conducting WGS (Sarin et al. 2008). In Arabidopsis, simultaneous SNP mapping and mutation identification has been achieved with WGS, but this requires the generation of a mapping population of up to 500 F2 progeny to identify only one allele (Schneeberger et al. 2009). This is a challenging prospect for many model systems. Indeed, if the mutant phenotype is subtle, the isolation of such numbers of recombinants is very tedious. Furthermore, it is not applicable in those organisms where a mapping population cannot be generated, simply because of a lack of intercrossable variants or because of life cycles (parasitic organisms, for example) that would make it extremely difficult to follow and isolate many recombinant individuals.Here, we describe a strategy to simultaneously and rapidly locate and identify multiple mutations from a mutagenesis screen with WGS that circumvents these limitations. This powerful and straightforward method directly uses mutagen-induced nucleotide changes that are linked to the causal mutation to identify its specific genomic location, thus negating the construction of genetic mapping populations and subsequent mapping.Treatment of organisms with a chemical mutagen induces nucleotide changes throughout the genome. Following mutagenesis, backcrossing or outcrossing of the mutagenized organism to unmutagenized counterparts is performed to eliminate mutagen-induced mutations (Figure 1A; supporting information, File S2). The phenotype-causing mutation remains as only backcrossed individuals showing the phenotype of interest are retained. In addition, mutagen-induced nucleotide changes that are genetically linked to the causal mutation and physically surround it on the chromosome will remain, in contrast to unlinked nucleotide changes (Figure 1A). As a result of this genetic linkage, a high-density cluster of typical mutagen-induced variants is visualized from sequence data obtained by WGS, which is positioned around the causal mutation. By locating such high-density regions, one maps the approximate genomic location of the causal mutation and subsequently identifies the affected gene within this region.Open in a separate windowFigure 1.—Mapping mutations on the basis of density of mutagen-induced DNA damage across the genome. (A) Visual representation of our WGS cloning strategy. Mutagen treatment induces point mutations throughout the genome (red asterisks). Backcrossing to the original unmutated parent strain removes much of the mutagen-induced nucleotide changes except for the causal mutation (green asterisk) and those genetically linked to it. WGS sequencing can be used to detect canonical mutagen-induced point mutations, thus revealing a physical position for the causal mutation. Shared background variants (yellow crosses) are filtered out from WGS data by comparing the sequences of mutants sequenced side-by-side, revealing a high-density variant cluster in only one genomic region. Importantly, genomic sequences of mutants derived from the same starting strain must be compared, to allow subtraction of nucleotide variants that are common to this particular strain, through sequence comparison. (B) Physical map of total nucleotide variations per megabase across the genome compared to the wild-type reference genome for each mutant (fp6, fp9, and fp12) after WGS. (C) After sequence quality filtering, subtraction of common variants between the 3 mutants, and filtering out noncanonical EMS nucleotide changes, high-density variant peaks are obtained in one genomic location for each mutant (red boxes). Steps 1 and 3 are essential for clear visualization of the high-density peaks whereas step 2 improves visualization. (D) Close-up of variants on chromosome III for fp6. Within this peak we identified only 6 candidate mutations that could potentially affect a protein sequence. We confirmed that the missense mutation in egl-5 was the causal mutation (Figure S2). For fp9 and fp12 we identified only 10 (9 missense and 1 3′-UTR) and 4 (2 premature stop and 2 missense) candidate mutations, respectively, within each mutant''s EMS-based mapped region. Thus, our method consistently allowed precise mapping in 3 different mutants to a region small enough to contain only a handful of candidate mutations.As a proof-of-principle, we simultaneously mapped and sequenced the causal mutations of multiple C. elegans mutants isolated from an EMS mutagenesis screen using this strategy. The mutagenesis screen itself was undertaken to identify genes that controlled the reprogramming of a single cell called Y into another cell called PDA during C. elegans development (Jarriault et al. 2008). After EMS treatment, three distinct mutant alleles (fp6, fp9, and fp12) were backcrossed to the original unmutagenized strain 4-6X. It is important to note that a backcrossing or outcrossing step is necessary for the analysis of mutants obtained from all mutagenesis screens, irrespective of the type of mutant identification strategy used or the type of mutagen or organism used (and, as such, does not represent an extra step introduced by our method). The mutants then underwent WGS side-by-side (Table S1, Table S2, Figure S1, and File S2). After alignment to the wild-type N2 reference genome using MAQgene software (Bigelow et al. 2009), the sequencing data obtained for each mutant were compared, and we subtracted common nucleotide variants that were shared between at least two of our three mutants (File S1). These shared variants, which are very unlikely to be either the causal mutation or EMS-induced mutations from the screen itself, represent strain differences between the N2 used to generate the reference genome and the PS3662 strain used here for mutagenesis. Note that this step eliminated ∼2000 point mutations as potential candidates for our causal mutation. This result strongly emphasizes the advantage of conducting WGS on two or more mutants side-by-side, as reference genomes may contain many nucleotide variations when compared to organisms sequenced from the laboratory (Denver et al. 2004; Hillier et al. 2008; Sarin et al. 2010; this study) and as such would confound mutation identification.To identify EMS-induced changes linked to the causal mutation and expose its location, we looked only at variants that matched the canonical EMS-induced G/C > A/T transitions (Drake and Baltz 1976), revealing localized peaks of high-density variation on a single chromosome for each mutant (Figure 1, B and C). These peaks correspond to regions of high mutagen-induced damage that were not removed during backcrossing and therefore are most likely genetically linked to the causal mutation. We therefore focused our attention on these physical regions to identify candidate mutations within them. We localized fp6 to a 4.29-Mb region on chromosome III, fp9 to a 7.11-Mb region on chromosome X, and fp12 to a 1.28-Mb region on a different part of chromosome X (Figure 1C).As a proof of principle, we further examined the nucleotide changes present in the interval to which fp6 was linked. Taking into consideration all variant types (point mutations and indels), we identified only six candidate mutations that potentially affected a gene''s function (Figure 1D and Table S3). One of these, affecting the egl-5/hox gene, lies almost perfectly in the middle of the predicted EMS-based mapped region. We confirmed the existence of the mutation in egl-5 by manual resequencing. Both egl-5 targeted RNAi and noncomplementation with the egl-5(n945) null allele confirmed that fp6 affected egl-5 and caused the Y-to-PDA reprogramming defect (Figure S2). fp9 and fp12 each map to distinct regions on chromosome X that also contain only a handful of candidate mutations (10 and 4, respectively) (Figure 1C). Thus, our method consistently allowed precise mapping in 3 different mutants to a region small enough to contain only a handful of candidate mutations and subsequent identification of the causal mutation.We calculated that comparison of WGS data for only two mutants of the same mutagenesis screen is sufficient to localize and sequence the causal mutation (Table S4). Thirteen times sequence coverage has been found to be sufficient to identify a mutation in a pre-SNP mapped C. elegans mutant (Shen et al. 2008). Here, we tested the sequence coverage necessary to perform simultaneous mapping and mutant identification using our strategy and found that 13× was more than enough (Table S4). In addition, by performing longer reads and/or paired-end sequencing, our method can be scaled up to bigger genomes or allow multiple mutant sequencing on each flow cell lane [for, e.g., using multiplex WGS (Cronn et al. 2008)]. Furthermore, because direct sequence comparison is ultimately made between two mutants sequenced side-by-side, the quality of an organism''s reference genome (which is used only for alignment purposes) does not have a bearing on the mapping or mutant identification outcome. Moreover, recent advances in de novo alignment of short reads generated from next generation sequencing platforms (Li et al. 2010; Nowrousian et al. 2010; Webb and Rosenthal 2010; Young et al. 2010) suggest that a reference genome may not even be required to perform mutagen-based mapping and mutant identification with WGS. We predict that technical advances in these areas will make it possible to perform mutagenesis screens on any nonsequenced and genetically uncharacterized organism and use our strategy to quickly identify the causal mutation of an interesting mutant.
Open in a separate windowWe found that all of the minimal requirements tested here were more than adequate to use our mapping strategy. Therefore, it is possible that fewer backcrosses and less sequencing coverage may suffice than is shown here. For example, for genomes with a similar size to C. elegans (∼100 Mb), this method can easily be scaled up by sequencing eight mutants per flow cell. As for any WGS experiments, total cost depends on genome size.By eliminating any prior work except for back/outcrossing, a necessary step for any mutant characterization, our simple and quick strategy provides a significant saving of time and labor as the time needed to map and identify a candidate causal mutation is trimmed down to the sequencing time (currently 7 days) and sequence analysis time (<1 day, see 相似文献
TABLE 1
Summary of WGS cloning strategyConditions used | Minimal requirements tested | |
---|---|---|
Backcrossing | 4–6× | 4× enough |
No. of mutants sequenced | 3 | 2 enough |
Sequencing of mutant | 2× flow cell lanes, paired-end reads (57mer) | 1× flow cell lane enough, single-end reads (57mer) enough |
Average sequence coverage | 52.2–55.3× | 13.6× enough |
Advantages | ||
Any SNP or genetic map information is not necessary | ||
No prior wet lab work necessary: generation of a recombinant mapping population is not necessary | ||
Multiple alleles identified at once | ||
Amenable to scaling up: can be equally used for bigger genomes | ||
Fast: 7 days sequencing, 12 hr MAQGene alignment, and 1 hr mapping | ||
Modest sequence coverage requirements limit cost | ||
Reference genome sequence quality is not important and may not even be necessary | ||
Very straightforward without any specialized software | ||
Requirement | ||
Species must be amenable to mutagenesis and backcrossing |
18.
Yi-Juan Hu Peizhou Liao H. Richard Johnston Andrew S. Allen Glen A. Satten 《PLoS genetics》2016,12(5)
Next-generation sequencing of DNA provides an unprecedented opportunity to discover rare genetic variants associated with complex diseases and traits. However, the common practice of first calling underlying genotypes and then treating the called values as known is prone to false positive findings, especially when genotyping errors are systematically different between cases and controls. This happens whenever cases and controls are sequenced at different depths, on different platforms, or in different batches. In this article, we provide a likelihood-based approach to testing rare variant associations that directly models sequencing reads without calling genotypes. We consider the (weighted) burden test statistic, which is the (weighted) sum of the score statistic for assessing effects of individual variants on the trait of interest. Because variant locations are unknown, we develop a simple, computationally efficient screening algorithm to estimate the loci that are variants. Because our burden statistic may not have mean zero after screening, we develop a novel bootstrap procedure for assessing the significance of the burden statistic. We demonstrate through extensive simulation studies that the proposed tests are robust to a wide range of differential sequencing qualities between cases and controls, and are at least as powerful as the standard genotype calling approach when the latter controls type I error. An application to the UK10K data reveals novel rare variants in gene BTBD18 associated with childhood onset obesity. The relevant software is freely available. 相似文献
19.
20.
Genome wide association studies have been usually analyzed in a univariate manner. The commonly used univariate tests have one degree of freedom and assume an additive mode of inheritance. The experiment-wise significance of these univariate statistics is obtained by adjusting for multiple testing. Next generation sequencing studies, which assay 10-20 million variants, are beginning to come online. For these studies, the strategy of additive univariate testing and multiple testing adjustment is likely to result in a loss of power due to (1) the substantial multiple testing burden and (2) the possibility of a non-additive causal mode of inheritance. To reduce the power loss we propose: a new method (1) to summarize in a single statistic the strength of the association signals coming from all not-very-rare variants in a linkage disequilibrium block and (2) to incorporate, in any linkage disequilibrium block statistic, the strength of the association signals under multiple modes of inheritance. The proposed linkage disequilibrium block test consists of the sum of squares of nominally significant univariate statistics. We compare the performance of this method to the performance of existing linkage disequilibrium block/gene-based methods. Simulations show that (1) extending methods to combine testing for multiple modes of inheritance leads to substantial power gains, especially for a recessive mode of inheritance, and (2) the proposed method has a good overall performance. Based on simulation results, we provide practical advice on choosing suitable methods for applied analyses. 相似文献