首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
Brandström M  Ellegren H 《Genetics》2007,176(3):1691-1701
It is increasingly recognized that insertions and deletions (indels) are an important source of genetic as well as phenotypic divergence and diversity. We analyzed length polymorphisms identified through partial (0.25x) shotgun sequencing of three breeds of domestic chicken made by the International Chicken Polymorphism Map Consortium. A data set of 140,484 short indel polymorphisms in unique DNA was identified after filtering for microsatellite structures. There was a significant excess of tandem duplicates at indel sites, with deletions of a duplicate motif outnumbering the generation of duplicates through insertion. Indel density was lower in microchromosomes than in macrochromosomes, in the Z chromosome than in autosomes, and in 100 bp of upstream sequence, 5'-UTR, and first introns than in intergenic DNA and in other introns. Indel density was highly correlated with single nucleotide polymorphism (SNP) density. The mean density of indels in pairwise sequence comparisons was 1.9 x 10(-4) indel events/bp, approximately 5% the density of SNPs segregating in the chicken genome. The great majority of indels involved a limited number of nucleotides (median 1 bp), with A-rich motifs being overrepresented at indel sites. The overrepresentation of deletions at tandem duplicates indicates that replication slippage in duplicate sequences is a common mechanism behind indel mutation. The correlation between indel and SNP density indicates common effects of mutation and/or selection on the occurrence of indels and point mutations.  相似文献   

2.
Li JG  Liljedahl U  Heng CK 《Genomics》2006,87(1):151-157
This study demonstrates an array-based platform to genotype simultaneously single nucleotide polymorphisms (SNPs) and some short insertions/deletions (indels) by the integration of the universal tag/anti-tag (TAT) system, liquid-phase primer extension (LIPEX), and a novel two-color detection strategy on an array format (TATLIPEXA). The TAT system permits a universal chip to be used for many applications, and the LIPEX simplifies the sample preparation but improves the sensitivity significantly. More importantly, all SNPs and some short indels can be interrogated in a single reaction with only two fluorescent ddNTPs. The concept of TATLIPEXA is demonstrated for nine SNPs (eight point mutations and one single-base insertion), and genotypes obtained show a remarkable concordance rate of 100% with both DNA sequencing and restriction fragment length polymorphism. Moreover, TATLIPEXA is able to provide quantitative information on allele frequency in pooled DNA samples, which could serve as a rapid screening tool for SNPs associated with diseases.  相似文献   

3.
As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.  相似文献   

4.
Bemisia tabaci (Gennadius) (Hemiptera: Aleyrodidae) Middle East-Asia Minor 1 (MEAM1) is invasive and adaptive to varied environments throughout the world. The adaptability is closely related to genomic variation such as single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). In order to elucidate the feature of SNPs and indels in MEAM1, and reveal the association between SNPs/indels and adaptive capacity to various environments, a computational approach with QualitySNP was used to identify reliable SNPs and indels on the basis of 9110-expressed sequence tags of MEAM1 present in the NCBI database. There were 575 SNPs detected with a density of 10.1 SNPs/kb and 6.4 SNPs/contig. Also, 237 transitions (39.3%) and 366 transversions (60.7%) were obtained, where the ratio of transitions to transversions was 0.65:1. In addition, 581 indels with a density of 14.1 indels/kb and 9.2 indels/contig were detected. Collectively, it showed that invasive MEAM1 has high SNPs density, and higher SNPs percentage than non-invasive B. tabaci species. A high SNPs density/percentage in MEAM1 yielded a high genomic variation that might have allowed it to adapt to varied environments, which provides some support to understand the invasive nature of MEAM1 at the genomic level. High levels of genomic variation are implicated in the level of adaptive capacity and invasive species are thought to exhibit higher levels of adaptive capacity than non-invasive species.  相似文献   

5.
6.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.  相似文献   

7.
Iengar P 《Nucleic acids research》2012,40(14):6401-6413
Cancer-associated mutations in cancer genes constitute a diverse set of mutations associated with the disease. To gain insight into features of the set, substitution, deletion and insertion mutations were analysed at the nucleotide level, from the COSMIC database. The most frequent substitutions were c → t, g → a, g → t, and the most frequent codon changes were to termination codons. Deletions more than insertions, FS (frameshift) indels more than I-F (in-frame) ones, and single-nucleotide indels, were frequent. FS indels cause loss of significant fractions of proteins. The 5'-cut in FS deletions, and 5'-ligation in FS insertions, often occur between pairs of identical bases. Interestingly, the cut-site and 3'-ligation in insertions, and 3'-cut and join-pair in deletions, were each found to be the same significantly often (p < 0.001). It is suggested that these features aid the incorporation of indel mutations. Tumor suppressors undergo larger numbers of mutations, especially disruptive ones, over the entire protein length, to inactivate two alleles. Proto-oncogenes undergo fewer, less-disruptive mutations, in selected protein regions, to activate a single allele. Finally, catalogues, in ranked order, of genes mutated in each cancer, and cancers in which each gene is mutated, were created. The study highlights the nucleotide level preferences and disruptive nature of cancer mutations.  相似文献   

8.
Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (∼96%) have a minor allele frequency less than 5%. On average, each individual genome carried ∼3.3 million SNPs and ∼492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely “knock out” the corresponding genes. Across all the 44 genomes, a total of 182 genes were “knocked-out” in at least one individual genome, among which 46 genes were “knocked out” in over 30% of our samples, suggesting that a number of genes are commonly “knocked-out” in general populations. Gene ontology analysis suggested that these commonly “knocked-out” genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.  相似文献   

9.
Three types of sequence variations--single-nucleotide polymorphisms (SNPs), insertions and deletions (indels), and short tandem repeats (STRs)--have been extensively reported in mammalian genomes. In this study, we discovered a novel type of sequence variation, i.e., multiple-nucleotide length polymorphisms (MNLPs) in bovine UCN3 (Urocortin 3) and its receptor CRHR2 (corticotropin-releasing hormone receptor 2) genes. Both MNLPs featured involvement of multiple-nucleotide length polymorphisms (5-18 bases), low sequence identity, and 1.7- to 11-fold changes in promoter activity between two alleles. Therefore, this novel genetic complexity would contribute significantly to the evolutionary, functional, and phenotypic complexity of genomes within or among species.  相似文献   

10.
Single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) are increasingly used for cultivar identification, construction of genetic maps, genetic diversity assessment, association mapping and marker-assisted breeding. Although there are several highly sensitive methods for the detection of polymorphisms, most of them are often beyond the budget of medium-throughput academic laboratories or seed companies. Heteroduplex analysis by enzymatic cleavage (CEL1CH) or denaturing high-performance liquid chromatography (dHPLC) has been successfully used to examine genetic variation in several plant and animal species. In this work, we assess and compare the performance of both methods in sunflower by genotyping SNPs from a set of 24 selected polymorphic candidate genes. The CEL1CH method allowed us to accurately detect allele differences in 10 out of 24 regions using an in-house prepared CEL1 enzyme (celery single strand endonuclease 1, Apium graveolens L.). Similarly, a total of 11 regions were successfully optimized for dHPLC analysis. As a scaling-up approach, both strategies were tested to genotype either 42 SNPs/indels in 22 sunflower accessions from the local germplasm bank or 33 SNPs/indels in 90 recombinant inbred lines (RILs) for genetic mapping purposes. Summarizing, a total of 601 genotypes were efficiently analyzed either with CEL1CH (110) or dHPCL (491). In conclusion, CEL1CH and dHPLC proved to be robust, complementary methods, allowing medium-scale laboratories to scale up the number of both SNPs and individuals to be included in genetic studies and targeted germplasm diversity characterization (EcoTILLING).  相似文献   

11.
Single nucleotide polymorphisms (SNPs) were discovered in common bean (Phaseolus vulgaris L.) via resequencing of sequence-tagged sites (STSs) developed by PCR primers previously designed to soybean shotgun and bacterial artificial chromosome (BAC) end sequences, and by primers designed to common bean genes and microsatellite flanking regions. DNA fragments harboring SNPs were identified in single amplicons from six contrasting P. vulgaris genotypes of the Andean (Jalo EEP 558, G 19833, and AND 277) and Mesoamerican (BAT 93, DOR 364, and Rudá) gene pools. These genotypes are the parents of three common bean recombinant inbred line mapping populations. From an initial set of 1,880 PCR primer pairs tested, 265 robust STSs were obtained, which could be sequenced in each one of the six common bean genotypes. In the resulting 131,120?bp of aligned sequence, a total of 677 SNPs were identified, including 555 single-base changes (295 transitions and 260 transversions) and 122 small nucleotide insertions/deletions (indels). The frequency of SNPs was 5.16 SNPs/kb and the mean nucleotide diversity, expressed as Halushka??s theta, was 0.00226. This work represents one of the first efforts aimed at detecting SNPs in P. vulgaris. The SNPs identified should be an important resource for common bean geneticists and breeders for quantitative trait locus discovery, marker-assisted selection, and map-based cloning. These SNPS will be also useful for diversity analysis and microsynteny studies among legume species.  相似文献   

12.
In this work, we examined the genetic diversity and evolution of the WAG-2 gene based on new WAG-2 alleles isolated from wheat and its relatives. Only single nucleotide polymorphisms (SNP) and no insertions and deletions (indels) were found in exon sequences of WAG-2 from different species. More SNPs and indels occurred in introns than in exons. For exons, exons+introns and introns, the nucleotide polymorphism π decreased from diploid and tetraploid genotypes to hexaploid genotypes. This finding indicated that the diversity of WAG-2 in diploids was greater than in hexaploids because of the strong selection pressure on the latter. All dn/ds ratios were < 1.0, indicating that WAG-2 belongs to a conserved gene affected by negative selection. Thirty-nine of the 57 particular SNPs and eight of the 10 indels were detected in diploid species. The degree of divergence in intron length among WAG-2 clones and phylogenetic tree topology suggested the existence of three homoeologs in the A, B or D genome of common wheat. Wheat AG-like genes were divided into WAG-1 and WAG-2 clades. The latter clade contained WAG-2, OsMADS3 and ZMM2 genes, indicating functional homoeology among them.  相似文献   

13.

Background

The very recent availability of fully sequenced individual human genomes is a major revolution in biology which is certainly going to provide new insights into genetic diseases and genomic rearrangements.

Results

We mapped the insertions, deletions and SNPs (single nucleotide polymorphisms) that are present in Craig Venter''s genome, more precisely on chromosomes 17 to 22, and compared them with the human reference genome hg17. Our results show that insertions and deletions are almost absent in L1 and generally scarce in L2 isochore families (GC-poor L1+L2 isochores represent slightly over half of the human genome), whereas they increase in GC-rich isochores, largely paralleling the densities of genes, retroviral integrations and Alu sequences. The distributions of insertions/deletions are in striking contrast with those of SNPs which exhibit almost the same density across all isochore families with, however, a trend for lower concentrations in gene-rich regions.

Conclusions

Our study strongly suggests that the distribution of insertions/deletions is due to the structure of chromatin which is mostly open in gene-rich, GC-rich isochores, and largely closed in gene-poor, GC-poor isochores. The different distributions of insertions/deletions and SNPs are clearly related to the two different responsible mechanisms, namely recombination and point mutations.  相似文献   

14.

Background

Several genomes have now been sequenced, with millions of genetic variants annotated. While significant progress has been made in mapping single nucleotide polymorphisms (SNPs) and small (<10 bp) insertion/deletions (indels), the annotation of larger structural variants has been less comprehensive. It is still unclear to what extent a typical genome differs from the reference assembly, and the analysis of the genomes sequenced to date have shown varying results for copy number variation (CNV) and inversions.

Results

We have combined computational re-analysis of existing whole genome sequence data with novel microarray-based analysis, and detect 12,178 structural variants covering 40.6 Mb that were not reported in the initial sequencing of the first published personal genome. We estimate a total non-SNP variation content of 48.8 Mb in a single genome. Our results indicate that this genome differs from the consensus reference sequence by approximately 1.2% when considering indels/CNVs, 0.1% by SNPs and approximately 0.3% by inversions. The structural variants impact 4,867 genes, and >24% of structural variants would not be imputed by SNP-association.

Conclusions

Our results indicate that a large number of structural variants have been unreported in the individual genomes published to date. This significant extent and complexity of structural variants, as well as the growing recognition of their medical relevance, necessitate they be actively studied in health-related analyses of personal genomes. The new catalogue of structural variants generated for this genome provides a crucial resource for future comparison studies.  相似文献   

15.
Over 16,000 high quality expressed sequence tags (ESTs) from red junglefowl (RJ) and White Leghorn (WL) brain and testis cDNA libraries were generated. Here, we have used this resource for detection of single nucleotide polymorphisms (SNPs), and also completed full-length sequencing of 46 pairs of clones, representing the same gene from both the RJ and WL libraries. From the main set of ESTs, which were assembled using Phrap, 746 putative SNPs were identified, of which 76% were transitions and 24% were transversions. A subset of SNPs was evaluated by sequence analysis of five RJ and five WL birds. Nine of 12 SNPs were verified in this limited sample, suggesting that a majority of the putative polymorphisms documented in this study represent real SNPs. During full-length sequencing of the 46 RJ/WL clones 100 SNPs were identified, which translated to a frequency of 1.90 SNPs/1000 bp. The number of transitions and transversions were 77% and 23%, respectively, and the proportion of non-synonymous vs. synonymous SNPs was 20% and 80%, respectively. Four large insertions/deletions were identified between the RJ and WL full-length sequences, and they appear to represent different splice variants.  相似文献   

16.
Genetic epidemiological studies of complex diseases often rely on data from the International HapMap Consortium for identification of single nucleotide polymorphisms (SNPs), particularly those that tag haplotypes. However, little is known about the relevance of the African populations used to collect HapMap data for study populations conducted elsewhere in Africa. Toll-like receptor (TLR) genes play a key role in susceptibility to various infectious diseases, including tuberculosis. We conducted full-exon sequencing in samples obtained from Uganda (n = 48) and South Africa (n = 48), in four genes in the TLR pathway: TLR2, TLR4, TLR6, and TIRAP. We identified one novel TIRAP SNP (with minor allele frequency [MAF] 3.2%) and a novel TLR6 SNP (MAF 8%) in the Ugandan population, and a TLR6 SNP that is unique to the South African population (MAF 14%). These SNPs were also not present in the 1000 Genomes data. Genotype and haplotype frequencies and linkage disequilibrium patterns in Uganda and South Africa were similar to African populations in the HapMap datasets. Multidimensional scaling analysis of polymorphisms in all four genes suggested broad overlap of all of the examined African populations. Based on these data, we propose that there is enough similarity among African populations represented in the HapMap database to justify initial SNP selection for genetic epidemiological studies in Uganda and South Africa. We also discovered three novel polymorphisms that appear to be population-specific and would only be detected by sequencing efforts.  相似文献   

17.
Large indels greatly impact the observable phenotypes in different organisms including plants and human. Hence, extracting large indels with high precision and sensitivity is important. Here, we developed IndelEnsembler to detect large indels in 1047 Arabidopsis whole-genome sequencing data. IndelEnsembler identified 34 093 deletions, 12 913 tandem duplications and 9773 insertions. Our large indel dataset was more comprehensive and accurate compared with the previous dataset of AthCNV (1). We captured nearly twice of the ground truth deletions and on average 27% more ground truth duplications compared with AthCNV, though our dataset has less number of large indels compared with AthCNV. Our large indels were positively correlated with transposon elements across the Arabidopsis genome. The non-homologous recombination events were the major formation mechanism of deletions in Arabidopsis genome. The Neighbor joining (NJ) tree constructed based on IndelEnsembler''s deletions clearly divided the geographic subgroups of 1047 Arabidopsis. More importantly, our large indels represent a previously unassessed source of genetic variation. Approximately 49% of the deletions have low linkage disequilibrium (LD) with surrounding single nucleotide polymorphisms. Some of them could affect trait performance. For instance, using deletion-based genome-wide association study (DEL-GWAS), the accessions containing a 182-bp deletion in AT1G11520 had delayed flowering time and all accessions in north Sweden had the 182-bp deletion. We also found the accessions with 65-bp deletion in the first exon of AT4G00650 (FRI) flowered earlier than those without it. These two deletions cannot be detected in AthCNV and, interestingly, they do not co-occur in any Arabidopsis thaliana accession. By SNP-GWAS, surrounding SNPs of these two deletions do not correlate with flowering time. This example demonstrated that existing large indel datasets miss phenotypic variations and our large indel dataset filled in the gap.  相似文献   

18.
Nucleotide insertions and deletions (indels) are responsible for gaps in the sequence alignments. Indel is one of the major sources of evolutionary change at the molecular level. We have examined the patterns of insertions and deletions in the 19 mammalian genomes, and found that deletion events are more common than insertions in the mammalian genomes. Both the number of insertions and deletions decrease rapidly when the gap length increases and single nucleotide indel is the most frequent in all indel events. The frequencies of both insertions and deletions can be described well by power law.Key Words: Insertion, deletion, gap, indel, mammalian genome.  相似文献   

19.
We have developed a computer based method to identify candidate single nucleotide polymorphisms (SNPs) and small insertions/deletions from expressed sequence tag data. Using a redundancy-based approach, valid SNPs are distinguished from erroneous sequence by their representation multiple times in an alignment of sequence reads. A second measure of validity was also calculated based on the cosegregation of the SNP pattern between multiple SNP loci in an alignment. The utility of this method was demonstrated by applying it to 102,551 maize (Zea mays) expressed sequence tag sequences. A total of 14,832 candidate polymorphisms were identified with an SNP redundancy score of two or greater. Segregation of these SNPs with haplotype indicates that candidate SNPs with high redundancy and cosegregation confidence scores are likely to represent true SNPs. This was confirmed by validation of 264 candidate SNPs from 27 loci, with a range of redundancy and cosegregation scores, in four inbred maize lines. The SNP transition/transversion ratio and insertion/deletion size frequencies correspond to those observed by direct sequencing methods of SNP discovery and suggest that the majority of predicted SNPs and insertion/deletions identified using this approach represent true genetic variation in maize.  相似文献   

20.
Next-generation sequencing (NGS) is a powerful tool for massive detection of DNA sequence variants such as single nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs) and insertions/deletions (indels). For routine screening of numerous samples, these variants are often converted into cleaved amplified polymorphic sequence (CAPS) markers which are based on the presence versus absence of restriction sites within PCR products. Current computational tools for SNP to CAPS conversion are limited and usually infeasible to use for large datasets as those generated with NGS. Moreover, there is no available tool for massive conversion of MNPs and indels into CAPS markers. Here, we present VCF2CAPS–a new software for identification of restriction endonucleases that recognize SNP/MNP/indel-containing sequences from NGS experiments. Additionally, the program contains filtration utilities not available in other SNP to CAPS converters–selection of markers with a single polymorphic cut site within a user-specified sequence length, and selection of markers that differentiate up to three user-defined groups of individuals from the analyzed population. Performance of VCF2CAPS was tested on a thoroughly analyzed dataset from a genotyping-by-sequencing (GBS) experiment. A selection of CAPS markers picked by the program was subjected to experimental verification. CAPS markers, also referred to as PCR-RFLPs, belong to basic tools exploited in plant, animal and human genetics. Our new software–VCF2CAPS–fills the gap in the current inventory of genetic software by high-throughput CAPS marker design from next-generation sequencing (NGS) data. The program should be of interest to geneticists involved in molecular diagnostics. In this paper we show a successful exemplary application of VCF2CAPS and we believe that its usefulness is guaranteed by the growing availability of NGS services.

This is a PLOS Computational Biology Software paper.
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号