首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (∼96%) have a minor allele frequency less than 5%. On average, each individual genome carried ∼3.3 million SNPs and ∼492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely “knock out” the corresponding genes. Across all the 44 genomes, a total of 182 genes were “knocked-out” in at least one individual genome, among which 46 genes were “knocked out” in over 30% of our samples, suggesting that a number of genes are commonly “knocked-out” in general populations. Gene ontology analysis suggested that these commonly “knocked-out” genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.  相似文献   

2.
Genome and exome sequencing yield extensive catalogues of human genetic variation. However, pinpointing the few phenotypically causal variants among the many variants present in human genomes remains a major challenge, particularly for rare and complex traits wherein genetic information alone is often insufficient. Here, we review approaches to estimate the deleteriousness of single nucleotide variants (SNVs), which can be used to prioritize disease-causal variants. We describe recent advances in comparative and functional genomics that enable systematic annotation of both coding and non-coding variants. Application and optimization of these methods will be essential to find the genetic answers that sequencing promises to hide in plain sight.  相似文献   

3.
Yang H  Wu Y  Feng J  Yang S  Tian D 《Genomics》2009,93(1):90-97
Mutations, which can alter amino acid constitution, contribute greatly to protein evolution. However, little is reported of their pattern during protein structural evolution. We investigated the distribution of non-synonymous single nucleotide polymorphisms (nsSNPs) and insertions/deletions (indels) along mammal and fruit fly proteins. We found the nsSNPs (and d(N)) and indels increased in protein boundary regions, and this pattern is inversely correlated with the distribution of protein domain density. Additionally, synonymous substitutions (and d(S)) are reduced in 5' and 3' regions, indicating more variable protein boundaries, compared with central interior. All evidence suggests that the inner part of coding sequences (CDSs) is comparatively conserved, whereas the 5' and 3' regions, with higher evolution rates, are more variable. We assumed that due to greater frequencies of nsSNPs and indels in adaptive regions of CDSs it could be easier to ultimately alter, gain, or lose amino acids, thus becoming the front line of protein evolution.  相似文献   

4.
Recent advances in genomics technologies have spurred unprecedented efforts in genome and exome re-sequencing aiming to unravel the genetic component of rare and complex disorders. While in rare disorders this allowed the identification of novel causal genes, the missing heritability paradox in complex diseases remains so far elusive. Despite rapid advances of next-generation sequencing, both the technology and the analysis of the data it produces are in its infancy. At present there is abundant knowledge pertaining to the role of rare single nucleotide variants (SNVs) in rare disorders and of common SNVs in common disorders. Although the 1,000 genome project has clearly highlighted the prevalence of rare variants and more complex variants (e.g. insertions, deletions), their role in disease is as yet far from elucidated.We set out to analyse the properties of sequence variants identified in a comprehensive collection of exome re-sequencing studies performed on samples from patients affected by a broad range of complex and rare diseases (N = 173). Given the known potential for Loss of Function (LoF) variants to be false positive, we performed an extensive validation of the common, rare and private LoF variants identified, which indicated that most of the private and rare variants identified were indeed true, while common novel variants had a significantly higher false positive rate. Our results indicated a strong enrichment of very low-frequency insertion/deletion variants, so far under-investigated, which might be difficult to capture with low coverage and imputation approaches and for which most of study designs would be under-powered. These insertions and deletions might play a significant role in disease genetics, contributing specifically to the underlining rare and private variation predicted to be discovered through next generation sequencing.  相似文献   

5.
The genetic architecture of ischemic stroke is complex and is likely to include rare or low frequency variants with high penetrance and large effect sizes. Such variants are likely to provide important insights into disease pathogenesis compared to common variants with small effect sizes. Because a significant portion of human functional variation may derive from the protein-coding portion of genes we undertook a pilot study to identify variation across the human exome (i.e., the coding exons across the entire human genome) in 10 ischemic stroke cases. Our efforts focused on evaluating the feasibility and identifying the difficulties in this type of research as it applies to ischemic stroke. The cases included 8 African-Americans and 2 Caucasians selected on the basis of similar stroke subtypes and by implementing a case selection algorithm that emphasized the genetic contribution of stroke risk. Following construction of paired-end sequencing libraries, all predicted human exons in each sample were captured and sequenced. Sequencing generated an average of 25.5 million read pairs (75 bp×2) and 3.8 Gbp per sample. After passing quality filters, screening the exomes against dbSNP demonstrated an average of 2839 novel SNPs among African-Americans and 1105 among Caucasians. In an aggregate analysis, 48 genes were identified to have at least one rare variant across all stroke cases. One gene, CSN3, identified by screening our prior GWAS results in conjunction with our exome results, was found to contain an interesting coding polymorphism as well as containing excess rare variation as compared with the other genes evaluated. In conclusion, while rare coding variants may predispose to the risk of ischemic stroke, this fact has yet to be definitively proven. Our study demonstrates the complexities of such research and highlights that while exome data can be obtained, the optimal analytical methods have yet to be determined.  相似文献   

6.
The domestic dog serves as an excellent model to investigate the genetic basis of disease. More than 400 heritable traits analogous to human diseases have been described in dogs. To further canine medical genetics research, we established the Dog Biomedical Variant Database Consortium (DBVDC) and present a comprehensive list of functionally annotated genome variants that were identified with whole genome sequencing of 582 dogs from 126 breeds and eight wolves. The genomes used in the study have a minimum coverage of 10× and an average coverage of ~24×. In total, we identified 23 133 692 single‐nucleotide variants (SNVs) and 10 048 038 short indels, including 93% undescribed variants. On average, each individual dog genome carried ~4.1 million single‐nucleotide and ~1.4 million short‐indel variants with respect to the reference genome assembly. About 2% of the variants were located in coding regions of annotated genes and loci. Variant effect classification showed 247 141 SNVs and 99 562 short indels having moderate or high impact on 11 267 protein‐coding genes. On average, each genome contained heterozygous loss‐of‐function variants in 30 potentially embryonic lethal genes and 97 genes associated with developmental disorders. More than 50 inherited disorders and traits have been unravelled using the DBVDC variant catalogue, enabling genetic testing for breeding and diagnostics. This resource of annotated variants and their corresponding genotype frequencies constitutes a highly useful tool for the identification of potential variants causative for rare inherited disorders in dogs.  相似文献   

7.
We report an algorithm to detect structural variation and indels from 1 base pair (bp) to 1 Mbp within exome sequence data sets. Splitread uses one end-anchored placements to cluster the mappings of subsequences of unanchored ends to identify the size, content and location of variants with high specificity and sensitivity. The algorithm discovers indels, structural variants, de novo events and copy number-polymorphic processed pseudogenes missed by other methods.  相似文献   

8.
The diploid genome sequence of an individual human   总被引:4,自引:1,他引:3  
Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.  相似文献   

9.
10.

Background

Several genomes have now been sequenced, with millions of genetic variants annotated. While significant progress has been made in mapping single nucleotide polymorphisms (SNPs) and small (<10 bp) insertion/deletions (indels), the annotation of larger structural variants has been less comprehensive. It is still unclear to what extent a typical genome differs from the reference assembly, and the analysis of the genomes sequenced to date have shown varying results for copy number variation (CNV) and inversions.

Results

We have combined computational re-analysis of existing whole genome sequence data with novel microarray-based analysis, and detect 12,178 structural variants covering 40.6 Mb that were not reported in the initial sequencing of the first published personal genome. We estimate a total non-SNP variation content of 48.8 Mb in a single genome. Our results indicate that this genome differs from the consensus reference sequence by approximately 1.2% when considering indels/CNVs, 0.1% by SNPs and approximately 0.3% by inversions. The structural variants impact 4,867 genes, and >24% of structural variants would not be imputed by SNP-association.

Conclusions

Our results indicate that a large number of structural variants have been unreported in the individual genomes published to date. This significant extent and complexity of structural variants, as well as the growing recognition of their medical relevance, necessitate they be actively studied in health-related analyses of personal genomes. The new catalogue of structural variants generated for this genome provides a crucial resource for future comparison studies.  相似文献   

11.
Here we present an adaptation of NimbleGen 2.1M-probe array sequence capture for whole exome sequencing using the Illumina Genome Analyzer (GA) platform. The protocol involves two-stage library construction. The specificity of exome enrichment was approximately 80% with 95.6% even coverage of the 34 Mb target region at an average sequencing depth of 33-fold. Comparison of our results with whole genome shot-gun resequencing results showed that the exome SNP calls gave only 0.97% false positive and 6.27% false negative variants. Our protocol is also well suited for use with whole genome amplified DNA. The results presented here indicate that there is a promising future for large-scale population genomics and medical studies using a whole exome sequencing approach.  相似文献   

12.
Over the next few years, the efficient use of next-generation sequencing (NGS) in human genetics research will depend heavily upon the effective mechanisms for the selective enrichment of genomic regions of interest. Recently, comprehensive exome capture arrays have become available for targeting approximately 33 Mb or ∼180,000 coding exons across the human genome. Selective genomic enrichment of the human exome offers an attractive option for new experimental designs aiming to quickly identify potential disease-associated genetic variants, especially in family-based studies. We have evaluated a 2.1 M feature human exome capture array on eight individuals from a three-generation family pedigree. We were able to cover up to 98% of the targeted bases at a long-read sequence read depth of ≥3, 86% at a read depth of ≥10, and over 50% of all targets were covered with ≥20 reads. We identified up to 14,284 SNPs and small indels per individual exome, with up to 1,679 of these representing putative novel polymorphisms. Applying the conservative genotype calling approach HCDiff, the average rate of detection of a variant allele based on Illumina 1 M BeadChips genotypes was 95.2% at ≥10x sequence. Further, we propose an advantageous genotype calling strategy for low covered targets that empirically determines cut-off thresholds at a given coverage depth based on existing genotype data. Application of this method was able to detect >99% of SNPs covered ≥8x. Our results offer guidance for “real-world” applications in human genetics and provide further evidence that microarray-based exome capture is an efficient and reliable method to enrich for chromosomal regions of interest in next-generation sequencing experiments.  相似文献   

13.
Exome sequencing - the targeted sequencing of the subset of the human genome that is protein coding - is a powerful and cost-effective new tool for dissecting the genetic basis of diseases and traits that have proved to be intractable to conventional gene-discovery strategies. Over the past 2 years, experimental and analytical approaches relating to exome sequencing have established a rich framework for discovering the genes underlying unsolved Mendelian disorders. Additionally, exome sequencing is being adapted to explore the extent to which rare alleles explain the heritability of complex diseases and health-related traits. These advances also set the stage for applying exome and whole-genome sequencing to facilitate clinical diagnosis and personalized disease-risk profiling.  相似文献   

14.
《PloS one》2014,9(8)
The genetic sequence variation of people from the Indian subcontinent who comprise one-quarter of the world''s population, is not well described. We carried out whole genome sequencing of 168 South Asians, along with whole-exome sequencing of 147 South Asians to provide deeper characterisation of coding regions. We identify 12,962,155 autosomal sequence variants, including 2,946,861 new SNPs and 312,738 novel indels. This catalogue of SNPs and indels amongst South Asians provides the first comprehensive map of genetic variation in this major human population, and reveals evidence for selective pressures on genes involved in skin biology, metabolism, infection and immunity. Our results will accelerate the search for the genetic variants underlying susceptibility to disorders such as type-2 diabetes and cardiovascular disease which are highly prevalent amongst South Asians.  相似文献   

15.
《Genomics》2020,112(5):3722-3728
Whole exome sequencing is an adept method to reveal novel and disease-related SNPs and INDELs as it screen the actionable areas of the genome. We evaluated the exome sequenced datasets of patients with Parkinson's disease (PD) in South African ethnic origin. The primary focus of this study was to discover the SNPs and INDELs patterns responsible for PD. The variant discovery was performed with genome analysis tool kit best practices variant detection pipelines. The SNPs were linked to the genes and categorized based on the filter-based annotation from ANNOVAR. We identified a total of 7955 SNPs and 9952 INDELs in all seven datasets together. A total of 130 missense nsSNPs were prioritized based on its damaging effect predicted from SIFT and Polyphen2 annotation. We noticed a novel nsSNP rs111655870 in gene LRRK2 that shows the mutation of a Leucine to Phenylalanine at position 208 which can alter the protein function. The study also filtered seven nsSNPs in genes NAGA, SULT4A1, MYH8, FLNA, TPM3, ATP13A1, CLN8 that have potentially deleterious effects predicted by various computational tools. This analysis suggested that the above filtered nsSNPs and INDELs have a functional impact and provide the footing for genetic studies related to PD. Further screening of these variations provides deeper insight for molecular mechanism of disease progression.  相似文献   

16.
Whole genome sequencing of buffalo is yet to be completed,and in the near future it may not be possible to identify an exome(coding region of genome) through bioinformatics for designing probes to capture it.In the present study,we employed in solution hybridization to sequence tissue specific temporal exomes(TST exome) in buffalo.We utilized cDNA prepared from buffalo muscle tissue as a probe to capture TST exomes from the buffalo genome.This resulted in a prominent reduction of repeat sequences(up to 40%) and an enrichment of coding sequences(up to 60%).Enriched targets were sequenced on a 454 pyro-sequencing platform,generating 101,244 reads containing 24,127,779 high quality bases.The data revealed 40,100 variations,of which 403 were indels and 39,218 SNPs containing 195 nonsynonymous candidate SNPs in protein-coding regions.The study has indicated that 80% of the total genes identified from capture data were expressed in muscle tissue.The present study is the first of its kind to sequence TST exomes captured by use of cDNA molecules for SNPs found in the coding region without any prior sequence information of targeted molecules.  相似文献   

17.
Proteins are under selection to maintain central functions and to accommodate needs that arise in ever‐changing environments. The positive selection and neutral drift that preserve functions result in a diversity of protein variants. The amount of diversity differs between proteins: multifunctional or disease‐related proteins tend to have fewer variants than proteins involved in some aspects of immunity. Our work focuses on the extensively studied protein Vitellogenin (Vg), which in honey bees (Apis mellifera) is multifunctional and highly expressed and plays roles in immunity. Yet, almost nothing is known about the natural variation in the coding sequences of this protein or how amino acid‐altering variants might impact structure–function relationships. Here, we map out allelic variation in honey bee Vg using biological samples from 15 countries. The successful barcoded amplicon Nanopore sequencing of 543 bees revealed 121 protein variants, indicating a high level of diversity in Vg. We find that the distribution of non‐synonymous single nucleotide polymorphisms (nsSNPs) differs between protein regions with different functions; domains involved in DNA and protein–protein interactions contain fewer nsSNPs than the protein''s lipid binding cavities. We outline how the central functions of the protein can be maintained in different variants and how the variation pattern may inform about selection from pathogens and nutrition.  相似文献   

18.
The main objectives of this study were to identify and functionally classify SNPs and indels by exome sequencing of animals of the racing line of Quarter Horses. Based on the individual genomic estimated breeding values (GEBVs) for maximum speed index (SImax) obtained for 349 animals, two groups of 20 extreme animals were formed. Of these individuals, 20 animals with high GEBVs for SImax and 19 with low GEBVs for SImax had their exons and 5′ and 3′ UTRs sequenced. Considering SNPs and indels, 105 182 variants were identified in the expressed regions of the Quarter Horse genome. Of these, 72 166 variants were already known and 33 016 are new variants and were deposited in a database. The analysis of the set of gene variants significantly related (Padjusted < 0.05) to extreme animals in conjunction with the predicted impact of the changes and the physiological role of protein product pointed to two candidate genes potentially related to racing performance: SLC3A1 on ECA15 and CCN6 on ECA10.  相似文献   

19.
Indels in the coding regions of a gene can either cause frameshifts or amino acid insertions/deletions. Frameshifting indels are indels that have a length that is not divisible by 3 and subsequently cause frameshifts. Indels that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions; we call these 3n indels. The new amino acid changes resulting from 3n indels could potentially affect protein function. Therefore, we construct a SIFT Indel prediction algorithm for 3n indels which achieves 82% accuracy, 81% sensitivity, 82% specificity, 82% precision, 0.63 MCC, and 0.87 AUC by 10-fold cross-validation. We have previously published a prediction algorithm for frameshifting indels. The rules for the prediction of 3n indels are different from the rules for the prediction of frameshifting indels and reflect the biological differences of these two different types of variations. SIFT Indel was applied to human 3n indels from the 1000 Genomes Project and the Exome Sequencing Project. We found that common variants are less likely to be deleterious than rare variants. The SIFT indel prediction algorithm for 3n indels is available at http://sift-dna.org/  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号