首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Fluorescence-based sequencing is playing an increasingly important role in efforts to identify DNA polymorphisms and mutations of biological and medical interest. The application of this technology in generating the reference sequence of simple and complex genomes is also driving the development of new computer programs to automate base calling (Phred), sequence assembly (Phrap) and sequence assembly editing (Consed) in high throughput settings. In this report we describe a new computer program known as PolyPhred that automatically detects the presence of heterozygous single nucleotide substitutions by fluorescencebased sequencing of PCR products. Its operations are integrated with the use of the Phred, Phrap and Consed programs and together these tools generate a high throughput system for detecting DNA polymorphisms and mutations by large scale fluorescence-based resequencing. Analysis of sequences containing known DNA variants demonstrates that the accuracy of PolyPhred with single pass data is >99% when the sequences are generated with fluorescent dye-labeled primers and approximately 90% for those prepared with dye-labeled terminators.  相似文献   

2.
马俊平  杨犀  律娜  刘飞  陈燕  朱宝利 《遗传》2015,37(6):568-574
动物T细胞受体(T cell receptor,TCR)基因由多个不同的高度同源的基因家族组成,通过全基因组测序很难获得准确的基因序列和排列位置。文章通过在NCBI中发布的鸡TCR的γ链(TCRγ或TRG)基因片段序列定位了鸡TRG基因所在区域,并确定了与鸡TRG基因位点对应的细菌人工染色体(BAC)克隆(CH261-174P24)。对该克隆进行高通量的重新测序和组装后,得到含有10个scaffolds的基因组草图,较完整地覆盖了鸡TRG基因位点及两侧区域。通过PCR扩增和测序证明了scaffold内部结构的正确性,校正了鸡参考基因组TRG基因位点一个可变基因和一个缺口序列(gap)附近各一处错误序列,以及可变基因区多处序列错误。文章通过校正鸡参考基因组TRG基因位点的序列,为鸡TRA/D和TRB基因位点的基因组序列分析提供了新方法。  相似文献   

3.
DNA sequencing can be used to gain important information on genes, genetic variation and gene function for biological and medical studies. The growing collection of publicly available reference genome sequences will underpin a new era of whole genome re-sequencing, but sequencing costs need to fall and throughput needs to rise by several orders of magnitude. Novel technologies are being developed to meet this need by generating massive amounts of sequence that can be aligned to the reference sequence. The challenge is to maintain the high standards of accuracy and completeness that are hallmarks of the previous genome projects. One or more new sequencing technologies are expected to become the mainstay of future research, and to make DNA sequencing centre stage as a routine tool in genetic research in the coming years.  相似文献   

4.
ABSTRACT: BACKGROUND: A genome-wide set of single nucleotide polymorphisms (SNPs) is a valuable resource in genetic research and breeding and is usually developed by re-sequencing a genome. If a genome sequence is not available, an alternative strategy must be used. We previously reported the development of a pipeline (AGSNP) for genome-wide SNP discovery in coding sequences and other single-copy DNA without a complete genome sequence in self-pollinating (autogamous) plants. Here we updated this pipeline for SNP discovery in outcrossing (allogamous) species and demonstrated its efficacy in SNP discovery in walnut (Juglans regia L.). RESULTS: The first step in the original implementation of the AGSNP pipeline was the construction of a reference sequence and the identification of single-copy sequences in it. To identify single-copy sequences, multiple genome equivalents of short SOLiD reads of another individual were mapped to shallow genome coverage of long Sanger or Roche 454 reads making up the reference sequence. The relative depth of SOLiD reads was used to filter out repeated sequences from single-copy sequences in the reference sequence. The second step was a search for SNPs between SOLiD reads and the reference sequence. Polymorphism within the mapped SOLiD reads would have precluded SNP discovery; hence both individuals had to be homozygous. The AGSNP pipeline was updated here for using SOLiD or other type of short reads of a heterozygous individual for these two principal steps. A total of 32.6X walnut genome equivalents of SOLiD reads of vegetatively propagated walnut scion cultivar 'Chandler' were mapped to 48,661 'Chandler' bacterial artificial chromosome (BAC) end sequences (BESs) produced by Sanger sequencing during the construction of a walnut physical map. A total of 22,799 putative SNPs were initially identified. A total of 6,000 Infinium II type SNPs evenly distributed along the walnut physical map were selected for the construction of an Infinium BeadChip, which was used to genotype a walnut mapping population having 'Chandler' as one of the parents. Genotyping results were used to adjust the filtering parameters of the updated AGSNP pipeline. With the adjusted filtering criteria, 69.6% of SNPs discovered with the updated pipeline were real and could be mapped on the walnut genetic map. A total of 13,439 SNPs were discovered by BES re-sequencing. BESs harboring SNPs were in 677 FPC contigs covering 98% of the physical map of the walnut genome. CONCLUSION: The updated AGSNP pipeline is a versatile SNP discovery tool for a high-throughput, genome-wide SNP discovery in both autogamous and allogamous species. With this pipeline, a large set of SNPs were identified in a single walnut cultivar.  相似文献   

5.
6.
Standard methods of DNA sequence analysis assume that sequences evolve independently, yet this assumption may not be appropriate for segmental duplications that exchange variants via interlocus gene conversion (IGC). Here, we use high quality multiple sequence alignments from well-annotated segmental duplications to systematically identify IGC signals in the human reference genome. Our analysis combines two complementary methods: (i) a paralog quartet method that uses DNA sequence simulations to identify a statistical excess of sites consistent with inter-paralog exchange, and (ii) the alignment-based method implemented in the GENECONV program. One-quarter (25.4%) of the paralog families in our analysis harbor clear IGC signals by the quartet approach. Using GENECONV, we identify 1477 gene conversion tracks that cumulatively span 1.54 Mb of the genome. Our analyses confirm the previously reported high rates of IGC in subtelomeric regions and Y-chromosome palindromes, and identify multiple novel IGC hotspots, including the pregnancy specific glycoproteins and the neuroblastoma breakpoint gene families. Although the duplication history of a paralog family is described by a single tree, we show that IGC has introduced incredible site-to-site variation in the evolutionary relationships among paralogs in the human genome. Our findings indicate that IGC has left significant footprints in patterns of sequence diversity across segmental duplications in the human genome, out-pacing the contributions of single base mutation by orders of magnitude. Collectively, the IGC signals we report comprise a catalog that will provide a critical reference for interpreting observed patterns of DNA sequence variation across duplicated genomic regions, including targets of recent adaptive evolution in humans.  相似文献   

7.
8.
Comparisons between haplotypes from affected patients and the human reference genome are frequently used to identify candidates for disease-causing mutations, even though these alignments are expected to reveal a high level of background neutral polymorphism. This limits the scope of genetic studies to relatively small genomic intervals, because current methods for distinguishing potential causal mutations from neutral variation are inefficient. Here we describe a new strategy for detecting mutations that is based on comparing affected haplotypes with closely matched control sequences from healthy individuals, rather than with the human reference genome. We use theory, simulation, and a real data set to show that this approach is expected to reduce the number of sequence variants that must be subjected to follow-up analysis by at least a factor of 20 when closely matched control sequences are selected from a reference panel with as few as 100 control genomes. We also define a reference data resource that would allow efficient application of this strategy to large critical intervals across the genome.  相似文献   

9.

Background  

Next-generation sequencing (NGS) offers a unique opportunity for high-throughput genomics and has potential to replace Sanger sequencing in many fields, including de-novo sequencing, re-sequencing, meta-genomics, and characterisation of infectious pathogens, such as viral quasispecies. Although methodologies and software for whole genome assembly and genome variation analysis have been developed and refined for NGS data, reconstructing a viral quasispecies using NGS data remains a challenge. This application would be useful for analysing intra-host evolutionary pathways in relation to immune responses and antiretroviral therapy exposures. Here we introduce a set of formulae for the combinatorial analysis of a quasispecies, given a NGS re-sequencing experiment and an algorithm for quasispecies reconstruction. We require that sequenced fragments are aligned against a reference genome, and that the reference genome is partitioned into a set of sliding windows (amplicons). The reconstruction algorithm is based on combinations of multinomial distributions and is designed to minimise the reconstruction of false variants, called in-silico recombinants.  相似文献   

10.
Sequencing and analysis of an Irish human genome   总被引:1,自引:0,他引:1  

Background

Recent studies generating complete human sequences from Asian, African and European subgroups have revealed population-specific variation and disease susceptibility loci. Here, choosing a DNA sample from a population of interest due to its relative geographical isolation and genetic impact on further populations, we extend the above studies through the generation of 11-fold coverage of the first Irish human genome sequence.

Results

Using sequence data from a branch of the European ancestral tree as yet unsequenced, we identify variants that may be specific to this population. Through comparisons with HapMap and previous genetic association studies, we identified novel disease-associated variants, including a novel nonsense variant putatively associated with inflammatory bowel disease. We describe a novel method for improving SNP calling accuracy at low genome coverage using haplotype information. This analysis has implications for future re-sequencing studies and validates the imputation of Irish haplotypes using data from the current Human Genome Diversity Cell Line Panel (HGDP-CEPH). Finally, we identify gene duplication events as constituting significant targets of recent positive selection in the human lineage.

Conclusions

Our findings show that there remains utility in generating whole genome sequences to illustrate both general principles and reveal specific instances of human biology. With increasing access to low cost sequencing we would predict that even armed with the resources of a small research group a number of similar initiatives geared towards answering specific biological questions will emerge.  相似文献   

11.
Recent advances in genomics technologies have spurred unprecedented efforts in genome and exome re-sequencing aiming to unravel the genetic component of rare and complex disorders. While in rare disorders this allowed the identification of novel causal genes, the missing heritability paradox in complex diseases remains so far elusive. Despite rapid advances of next-generation sequencing, both the technology and the analysis of the data it produces are in its infancy. At present there is abundant knowledge pertaining to the role of rare single nucleotide variants (SNVs) in rare disorders and of common SNVs in common disorders. Although the 1,000 genome project has clearly highlighted the prevalence of rare variants and more complex variants (e.g. insertions, deletions), their role in disease is as yet far from elucidated.We set out to analyse the properties of sequence variants identified in a comprehensive collection of exome re-sequencing studies performed on samples from patients affected by a broad range of complex and rare diseases (N = 173). Given the known potential for Loss of Function (LoF) variants to be false positive, we performed an extensive validation of the common, rare and private LoF variants identified, which indicated that most of the private and rare variants identified were indeed true, while common novel variants had a significantly higher false positive rate. Our results indicated a strong enrichment of very low-frequency insertion/deletion variants, so far under-investigated, which might be difficult to capture with low coverage and imputation approaches and for which most of study designs would be under-powered. These insertions and deletions might play a significant role in disease genetics, contributing specifically to the underlining rare and private variation predicted to be discovered through next generation sequencing.  相似文献   

12.
Revisiting the mouse mitochondrial DNA sequence   总被引:9,自引:1,他引:8  
The existence of reliable mtDNA reference sequences for each species is of great relevance in a variety of fields, from phylogenetic and population genetics studies to pathogenetic determination of mtDNA variants in humans or in animal models of mtDNA-linked diseases. We present compelling evidence for the existence of sequencing errors on the current mouse mtDNA reference sequence. This includes the deletion of a full codon in two genes, the substitution of one amino acid on five occasions and also the involvement of tRNA and rRNA genes. The conclusions are supported by: (i) the re-sequencing of the original cell line used by Bibb and Clayton, the LA9 cell line, (ii) the sequencing of a second L-derivative clone (L929), and (iii) the comparison with 12 other mtDNA sequences from live mice, 10 of them maternally related with the mouse from which the L cells were generated. Two of the latest sequences are reported for the first time in this study (Balb/cJ and C57BL/6J). In addition, we found that both the LA9 and L929 mtDNAs also contain private clone polymorphic variants that, at least in the case of L929, promote functional impairment of the oxidative phosphorylation system. Conse quently, the mtDNA of the strain used for the mouse genome project (C57BL/6J) is proposed as the new standard for the mouse mtDNA sequence.  相似文献   

13.
本研究对金针菇Flammulina velutipes的一个RNAi转化子菌株1382R3进行了高通量测序,以本实验室先前获得的野生型W23基因组数据为参考,分析了该转化子的基因插入位点以及拷贝数。转化子菌株1382R3是通过农杆菌介导将fv-hmg1-RNAi载体转化至金针菇菌株并通过PCR检测筛选标记而得到。通过BLAST将转化子测序的reads对外源载体和基因组定位,找到具有基因组序列(GS)和外源载体序列(ES)两种序列的临界reads,并据此使用PERL语言程序成功在转化子1382R3菌株中找到两个插入位点。对两个插入位置的序列分析表明:在插入位点1,T-DNA片段部分插入;在插入位点2,T-DNA全部插入到基因组。两个插入位点都对基因组内源基因的表达造成了一定的干扰。此方法拓宽了高通量测序技术的应用范围,将其运用到遗传转化插入位置和拷贝数的研究中,有利于食用菌的功能基因组及基因工程研究。  相似文献   

14.
15.
J K Bonfield  C Rada    R Staden 《Nucleic acids research》1998,26(14):3404-3409
The final step in the detection of mutations is to determine the sequence of the suspected mutant and to compare it with that of the wild-type, and for this fluorescence-based sequencing instruments are widely used. We describe some simple algorithms forcomparing sequence traces which, as part of our sequence assembly and analysis package, are proving useful for the discovery of mutations and which may also help to identify misplaced readings in sequence assembly projects. The mutations can be detected automatically by a new program called TRACE_DIFF and new types of trace display in our program GAP4 greatly simplify visual checking of the assigned changes. To assess the accuracy of the automatic mutation detection algorithm we analysed 214 sequence readings from hypermutating DNA comprising a total of 108 497 bases. After the readings were assembled there were 1232 base differences, including 392 Ns and 166 alignment characters. Visual inspection of the traces established that of the 1232 differences, 353 were real mutations while the rest were due to base calling errors. The TRACE_DIFF algorithm automatically identified all but 36, with 28 false positives. Further information about the software can be obtained from http://www.mrc-lmb.cam.ac.uk/pubseq/  相似文献   

16.
In turkeys, spontaneous cardiomyopathy or round heart (RH) disease is characterised by dilated ventricles and cardiac muscle hypertrophy. Although the aetiology of RH is still unknown, the disease can have a significant economic impact on turkey producers. In an initial attempt to identify genomic regions associated with RH, we utilised the chicken genome sequence to target short DNA sequences (sequence-characterised amplified regions, SCARs) identified in previous studies that had significant differences in frequency distribution between RH+ and RH- turkeys. SCARs were comparatively aligned with the chicken whole-genome sequence to identify flanking regions for primer design. Primers from 32 alignments were tested and target sequences were successfully amplified for 30 loci (94%). Comparative re-sequencing identified putative SNPs in 20 of the 30 loci (67%). Genetically informative SNPs at 16 loci were genotyped in the UMN/NTBF turkey mapping population. As a result of this study, 34 markers were placed on the turkey/chicken comparative map and 15 markers were added to the turkey genetic linkage map. The position of these markers relative to cardiac-related genes is presented. In addition, analysis of genotypes at 109 microsatellite loci presumed to flank the SCAR sequences in the turkey genome identified four significant associations with RH.  相似文献   

17.
基于RefSeq数据库的人类标准转录数据集的构建   总被引:5,自引:0,他引:5  
  相似文献   

18.
Salmonid genomes are considered to be in a pseudo‐tetraploid state as a result of a genome duplication event that occurred between 25 and 100 Ma. This situation complicates single‐nucleotide polymorphism (SNP) discovery in rainbow trout as many putative SNPs are actually paralogous sequence variants (PSVs) and not simple allelic variants. To differentiate PSVs from simple allelic variants, we used 19 homozygous doubled haploid (DH) lines that represent a wide geographical range of rainbow trout populations. In the first phase of the study, we analysed SbfI restriction‐site associated DNA (RAD) sequence data from all the 19 lines and selected 11 lines for an extended SNP discovery. In the second phase, we conducted the extended SNP discovery using PstI RAD sequence data from the selected 11 lines. The complete data set is composed of 145 168 high‐quality putative SNPs that were genotyped in at least nine of the 11 lines, of which 71 446 (49%) had minor allele frequencies (MAF) of at least 18% (i.e. at least two of the 11 lines). Approximately 14% of the RAD SNPs in this data set are from expressed or coding rainbow trout sequences. Our comparison of the current data set with previous SNP discovery data sets revealed that 99% of our SNPs are novel. In the support files for this resource, we provide annotation to the positions of the SNPs in the working draft of the rainbow trout reference genome, provide the genotypes of each sample in the discovery panel and identify SNPs that are likely to be in coding sequences.  相似文献   

19.
通过对籼稻黄华占EMS(甲磺酸乙酯)诱变, 筛选得到一隐性核不育的水稻雄性不育突变体osms55, 遗传分析表明该突变体为单基因控制的隐性核不育, 采用高通量的Illumina Infinium iSelect SNP(50 K)芯片检测技术鉴定该突变体的遗传背景, 确认该突变体的遗传背景与黄华占一致。文章利用改进的MutMap方法成功克隆该雄性不育基因, 突变位点与突变表型的共分离分析表明LOC_Os02g40450(MER3)是控制osms55突变体雄性不育的基因, 该基因的剪切识别位点发生变异后导致剪切异常, 造成第5外显子缺失15个碱基, 从而产生雄性不育。改进的MutMap方法无需精确组装的野生型基因组序列作对照, 而是通过将定位群体中有突变表型植株的DNA pool和野生型植株DNA的重测序结果分别与日本晴参考基因组进行比对, 然后再比较突变体和野生型的差异SNP来确定候选基因, 该方法大大降低了野生型基因组测序和组装成本, 进一步扩大了MutMap方法的应用范围。  相似文献   

20.
With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ~159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号