首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina) and Ion Torrent (Life Technology) sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare). Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels) between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis.  相似文献   

2.
Next generation sequencing (NGS) platforms are replacing traditional molecular biology protocols like cloning and Sanger sequencing. However, accuracy of NGS platforms has rarely been measured when quantifying relative frequencies of genotypes or taxa within populations. Here we developed a new bioinformatic pipeline (QRS) that pools similar sequence variants and estimates their frequencies in NGS data sets from populations or communities. We tested whether the estimated frequency of representative sequences, generated by 454 amplicon sequencing, differs significantly from that obtained by Sanger sequencing of cloned PCR products. This was performed by analysing sequence variation of the highly variable first internal transcribed spacer (ITS1) of the ichthyosporean Caullerya mesnili, a microparasite of cladocerans of the genus Daphnia. This analysis also serves as a case example of the usage of this pipeline to study within‐population variation. Additionally, a public Illumina data set was used to validate the pipeline on community‐level data. Overall, there was a good correspondence in absolute frequencies of C. mesnili ITS1 sequences obtained from Sanger and 454 platforms. Furthermore, analyses of molecular variance (amova ) revealed that population structure of Cmesnili differs across lakes and years independently of the sequencing platform. Our results support not only the usefulness of amplicon sequencing data for studies of within‐population structure but also the successful application of the QRS pipeline on Illumina‐generated data. The QRS pipeline is freely available together with its documentation under GNU Public Licence version 3 at http://code.google.com/p/quantification-representative-sequences .  相似文献   

3.
With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ~159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.  相似文献   

4.
Suzuki S  Ono N  Furusawa C  Ying BW  Yomo T 《PloS one》2011,6(5):e19534
Next-generation sequencing technologies enable the rapid cost-effective production of sequence data. To evaluate the performance of these sequencing technologies, investigation of the quality of sequence reads obtained from these methods is important. In this study, we analyzed the quality of sequence reads and SNP detection performance using three commercially available next-generation sequencers, i.e., Roche Genome Sequencer FLX System (FLX), Illumina Genome Analyzer (GA), and Applied Biosystems SOLiD system (SOLiD). A common genomic DNA sample obtained from Escherichia coli strain DH1 was applied to these sequencers. The obtained sequence reads were aligned to the complete genome sequence of E. coli DH1, to evaluate the accuracy and sequence bias of these sequence methods. We found that the fraction of "junk" data, which could not be aligned to the reference genome, was largest in the data set of SOLiD, in which about half of reads could not be aligned. Among data sets after alignment to the reference, sequence accuracy was poorest in GA data sets, suggesting relatively low fidelity of the elongation reaction in the GA method. Furthermore, by aligning the sequence reads to the E. coli strain W3110, we screened sequence differences between two E. coli strains using data sets of three different next-generation platforms. The results revealed that the detected sequence differences were similar among these three methods, while the sequence coverage required for the detection was significantly small in the FLX data set. These results provided valuable information on the quality of short sequence reads and the performance of SNP detection in three next-generation sequencing platforms.  相似文献   

5.
Application of high‐throughput sequencing platforms in the field of ecology and evolutionary biology is developing quickly with the introduction of efficient methods to reduce genome complexity. Numerous approaches for genome complexity reduction have been developed using different combinations of restriction enzymes, library construction strategies and fragment size selection. As a result, the choice of which techniques to use may become cumbersome, because it is difficult to anticipate the number of loci resulting from each method. We developed SimRAD, an R package that performs in silico restriction enzyme digests and fragment size selection as implemented in most restriction site associated DNA polymorphism and genotyping by sequencing methods. In silico digestion is performed on a reference genome or on a randomly generated DNA sequence when no reference genome sequence is available. SimRAD accurately predicts the number of loci under alternative protocols when a reference genome sequence is available for the targeted species (or a close relative) but may be unreliable when no reference genome is available. SimRAD is also useful for fine‐tuning a given protocol to adjust the number of targeted loci. Here, we outline the functionality of SimRAD and provide an illustrative example of the use of the package (available on the CRAN at http://cran.r-project.org/web/packages/SimRAD ).  相似文献   

6.
Despite major advances in next-generation sequencing, assembly of sequencing data, especially data from novel microorganisms or re-emerging pathogens, remains constrained by the lack of suitable reference sequences. De novo assembly is the best approach to achieve an accurate finished sequence, but multiple sequencing platforms or paired-end libraries are often required to achieve full genome coverage. In this study, we demonstrated a method to assemble complete bacterial genome sequences by integrating shotgun Roche 454 pyrosequencing with optical whole genome mapping (WGM). The whole genome restriction map (WGRM) was used as the reference to scaffold de novo assembled sequence contigs through a stepwise process. Large de novo contigs were placed in the correct order and orientation through alignment to the WGRM. De novo contigs that were not aligned to WGRM were merged into scaffolds using contig branching structure information. These extended scaffolds were then aligned to the WGRM to identify the overlaps to be eliminated and the gaps and mismatches to be resolved with unused contigs. The process was repeated until a sequence with full coverage and alignment with the whole genome map was achieved. Using this method we were able to achieved 100% WGRM coverage without a paired-end library. We assembled complete sequences for three distinct genetic components of a clinical isolate of Providencia stuartii: a bacterial chromosome, a novel bla NDM-1 plasmid, and a novel bacteriophage, without separately purifying them to homogeneity.  相似文献   

7.
Mapping‐by‐sequencing analyses have largely required a complete reference sequence and employed whole genome re‐sequencing. In species such as wheat, no finished genome reference sequence is available. Additionally, because of its large genome size (17 Gb), re‐sequencing at sufficient depth of coverage is not practical. Here, we extend the utility of mapping by sequencing, developing a bespoke pipeline and algorithm to map an early‐flowering locus in einkorn wheat (Triticum monococcum L.) that is closely related to the bread wheat genome A progenitor. We have developed a genomic enrichment approach using the gene‐rich regions of hexaploid bread wheat to design a 110‐Mbp NimbleGen SeqCap EZ in solution capture probe set, representing the majority of genes in wheat. Here, we use the capture probe set to enrich and sequence an F2 mapping population of the mutant. The mutant locus was identified in T. monococcum, which lacks a complete genome reference sequence, by mapping the enriched data set onto pseudo‐chromosomes derived from the capture probe target sequence, with a long‐range order of genes based on synteny of wheat with Brachypodium distachyon. Using this approach we are able to map the region and identify a set of deleted genes within the interval.  相似文献   

8.
The Mouse Genome Database (MGD) is the community database resource for the laboratory mouse, a key model organism for interpreting the human genome and for understanding human biology and disease (http://www.informatics.jax.org). MGD provides standard nomenclature and consensus map positions for mouse genes and genetic markers; it provides a curated set of mammalian homology records, user-defined chromosomal maps, experimental data sets and the definitive mouse 'gene to sequence' reference set for the research community. The integration and standardization of these data sets facilitates the transition between mouse DNA sequence, gene and phenotype annotations. A recent focus on allele and phenotype representations enhances the ability of MGD to organize and present data for exploring the relationship between genotype and phenotype. This link between the genome and the biology of the mouse is especially important as phenotype information grows from large mutagenesis projects and genotype information grows from large-scale sequencing projects.  相似文献   

9.
Sequence assembly of large and repeat-rich plant genomes has been challenging, requiring substantial computational resources and often several complementary sequence assembly and genome mapping approaches. The recent development of fast and accurate long-read sequencing by circular consensus sequencing (CCS) on the PacBio platform may greatly increase the scope of plant pan-genome projects. Here, we compare current long-read sequencing platforms regarding their ability to rapidly generate contiguous sequence assemblies in pan-genome studies of barley (Hordeum vulgare). Most long-read assemblies are clearly superior to the current barley reference sequence based on short-reads. Assemblies derived from accurate long reads excel in most metrics, but the CCS approach was the most cost-effective strategy for assembling tens of barley genomes. A downsampling analysis indicated that 20-fold CCS coverage can yield very good sequence assemblies, while even five-fold CCS data may capture the complete sequence of most genes. We present an updated reference genome assembly for barley with near-complete representation of the repeat-rich intergenic space. Long-read assembly can underpin the construction of accurate and complete sequences of multiple genomes of a species to build pan-genome infrastructures in Triticeae crops and their wild relatives.

A greatly improved reference genome sequence of barley was assembled from accurate long reads.  相似文献   

10.
Major progress has been made in catfish genomics including construction of high-density genetic linkage maps, BAC-based physical maps, and integration of genetic linkage and physical maps. Large numbers of ESTs have been generated from both channel catfish and blue catfish. Microarray platforms have been developed for the analysis of genome expression. Genome repeat structures are studied, laying grounds for whole genome sequencing. USDA recently approved funding of the whole genome sequencing project of catfish using the next generation sequencing technologies. Generation of the whole genome sequence is a historical landmark of catfish research as it opens the real first step of the long march toward genetic enhancement. The research community needs to be focused on aquaculture performance and production traits, take advantage of the unprecedented genome information and technology, and make real progress toward genetic improvements of aquaculture brood stocks.  相似文献   

11.
The advent of Next Generation Sequencing Technology (NGST) has revolutionized molecular biology research, allowing for rapid gene/genome sequencing from a multitude of diverse species. As high throughput sequencing becomes more accessible, more efficient workflows must be developed to deal with the amounts of data produced and better assemble the genomes of de novo lineages. We combine traditional laboratory methods with Illumina NGST to amplify and sequence the largest mammalian multigene family, the Olfactory Receptor gene family, for species with and without a reference genome. We develop novel assembly methods to annotate and filter these data, which can be utilized for any gene family or any species. We find no significant difference between the ratio of genes within their respective gene families of our data compared with available genomic data. Using simulated data we explore the limitations of short‐read sequence data and our assembly in recovering this gene family. We highlight the benefits and shortcomings of these methods. Compared with data generated from traditional polymerase chain reaction, cloning and Sanger sequencing methodologies, sequence data generated using our pipeline increases yield and sequencing efficiency without reducing the number of unique genes amplified. A cloning step is not required, therefore shortening data generation time. The novel downstream methodologies and workflows described provide a tool to be utilized by many fields of biology, to access and analyze the vast quantities of data generated. By combining laboratory and in silico methods, we provide a means of extracting genomic information for multigene families without complete genome sequencing.  相似文献   

12.
Zhang  Hui  Wang  Yuexing  Deng  Ce  Zhao  Sheng  Zhang  Peng  Feng  Jie  Huang  Wei  Kang  Shujing  Qian  Qian  Xiong  Guosheng  Chang  Yuxiao 《中国科学:生命科学英文版》2022,65(2):398-411

High-quality rice reference genomes have accelerated the comprehensive identification of genome-wide variations and research on functional genomics and breeding. Tian-you-hua-zhan has been a leading hybrid in China over the past decade. Here, de novo genome assembly strategy optimization for the rice indica lines Huazhan (HZ) and Tianfeng (TF), including sequencing platforms, assembly pipelines and sequence depth, was carried out. The PacBio and Nanopore platforms for long-read sequencing were utilized, with the Canu, wtdbg2, SMARTdenovo, Flye, Canu-wtdbg2, Canu-SMARTdenovo and Canu-Flye assemblers. The combination of PacBio and Canu was optimal, considering the contig N50 length, contig number, assembled genome size and polishing process. The assembled contigs were scaffolded with Hi-C data, resulting in two “golden quality” rice reference genomes, and evaluated using the scaffold N50, BUSCO, and LTR assembly index. Furthermore, 42,625 and 41,815 non-transposable element genes were annotated for HZ and TF, respectively. Based on our assembly of HZ and TF, as well as Zhenshan97, Minghui63, Shuhui498 and 9311, comprehensive variations were identified using Nipponbare as a reference. The de novo assembly strategy for rice we optimized and the “golden quality” rice genomes we produced for HZ and TF will benefit rice genomics and breeding research, especially with respect to uncovering the genomic basis of the elite traits of HZ and TF.

  相似文献   

13.
Since the development of technologies that can determine the base-pair sequence of DNA, the ability to sequence genes has contributed much to science and medicine. However, it has remained a relatively costly and laborious process, hindering its use as a routine biomedical tool. Recent times are seeing rapid developments in this field, both in the availability of novel sequencing platforms, as well as supporting technologies involved in processes such as targeting and data analysis. This is leading to significant reductions in the cost of sequencing a human genome and the potential for its use as a routine biomedical tool. This review is a snapshot of this rapidly moving field examining the current state of the art, forthcoming developments and some of the issues still to be resolved prior to the use of new sequencing technologies in routine clinical diagnosis.  相似文献   

14.
For the robust practice of genomic medicine, sequencing results must be compatible, regardless of the sequencing technologies and algorithms used. Presently, genome sequencing is still an imprecise science and is complicated by differences in the chemistry, coverage, alignment, and variant-calling algorithms. We identified ∼3.33 million single nucleotide variants (SNVs) and ∼3.62 million SNVs in the SJK genome using SOLiD and Illumina data, respectively. Approximately 3 million SNVs were concordant between the two platforms while 68,532 SNVs were discordant; 219,616 SNVs were SOLiD-specific and 516,080 SNVs were Illumina-specific (i.e., platform-specific). Concordant, discordant, and platform-specific SNVs were further analyzed and characterized. Overall, a large portion of heterozygous SNVs that were discordant with genotyping calls of single nucleotide polymorphism chips were highly confident. Approximately 70% of the platform-specific SNVs were located in regions containing repetitive sequences. Such platform-specificity may arise from differences between platforms, with regard to read length (36 bp and 72 bp vs. 50 bp), insert size (∼100–300 bp vs. ∼1–2 kb), sequencing chemistry (sequencing-by-synthesis using single nucleotides vs. ligation-based sequencing using oligomers), and sequencing quality. When data from the two platforms were merged for variant calling, the proportion of callable regions of the reference genome increased to 99.66%, which was 1.43% higher than the average callability of the two platforms, representing ∼40 million bases. In this study, we compared the differences in sequencing results between two sequencing platforms. Approximately 90% of the SNVs were concordant between the two platforms, yet ∼10% of the SNVs were either discordant or platform-specific, indicating that each platform had its own strengths and weaknesses. When data from the two platforms were merged, both the overall callability of the reference genome and the overall accuracy of the SNVs improved, demonstrating the likelihood that a re-sequenced genome can be revised using complementary data.  相似文献   

15.
Using genomic databases for sequence-based biological discovery   总被引:1,自引:0,他引:1  
The inherent potential underlying the sequence data produced by the International Human Genome Sequencing Consortium and other systematic sequencing projects is, obviously, tremendous. As such, it becomes increasingly important that all biologists have the ability to navigate through and cull important information from key publicly available databases. The continued rapid rise in available sequence information, particularly as model organism data is generated at breakneck speed, also underscores the necessity for all biologists to learn how to effectively make their way through the expanding "sequence information space." This review discusses some of the more commonly used tools for sequence discovery; tools have been developed for the effective and efficient mining of sequence information. These include LocusLink, which provides a gene-centric view of sequence-based information, as well as the 3 major genome browsers: the National Center for Biotechnology Information Map Viewer, the University of California Santa Cruz Genome Browser, and the European Bioinformatics Institute's Ensembl system. An overview of the types of information available through each of these front-ends is given, as well as information on tutorials and other documentation intended to increase the reader's familiarity with these tools.  相似文献   

16.
Target sequence capture is an efficient technique to enrich specific genomic regions for high‐throughput sequencing in ecological and evolutionary studies. In recent years, many sequence capture approaches have been proposed, but most of them rely on commercial synthetic baits which make the experiment expensive. Here, we present a novel sequence capture approach called AFLP‐based genome sequence capture (AFLP Capture). This method uses the AFLP (amplified fragment length polymorphism) technique to generate homemade capture baits without the need for prior genome information, thus is applicable to any organisms. In this approach, biotinylated AFLP fragments representing a random fraction of the genome are used as baits to capture the homologous fragments from genomic shotgun sequencing libraries. In a trial study, by using AFLP Capture, we successfully obtained 511 orthologous loci (>700,000 bp in total length) from 11 Odorrana species and more than 100,000 single nucleotide polymorphisms (SNPs) in four analyzed individuals of an Odorrana species. This result shows that our method can be used to address questions of various evolutionary depths (from interspecies level to intraspecies level). We also discuss the flexibility in bait preparation and how the sequencing data are analyzed. In summary, AFLP Capture is a rapid and flexible tool and can significantly reduce the experimental cost for phylogenetic studies that require analyzing genome‐scale data (hundreds or thousands of loci).  相似文献   

17.
Mass spectrometry‐based proteomics is a popular and powerful method for precise and highly multiplexed protein identification. The most common method of analyzing untargeted proteomics data is called database searching, where the database is simply a collection of protein sequences from the target organism, derived from genome sequencing. Experimental peptide tandem mass spectra are compared to simplified models of theoretical spectra calculated from the translated genomic sequences. However, in several interesting application areas, such as forensics, archaeology, venomics, and others, a genome sequence may not be available, or the correct genome sequence to use is not known. In these cases, de novo peptide identification can play an important role. De novo methods infer peptide sequence directly from the tandem mass spectrum without reference to a sequence database, usually using graph‐based or machine learning algorithms. In this review, we provide a basic overview of de novo peptide identification methods and applications, briefly covering de novo algorithms and tools, and focusing in more depth on recent applications from venomics, metaproteomics, forensics, and characterization of antibody drugs.  相似文献   

18.
The assembly of a reference genome sequence of bread wheat is challenging due to its specific features such as the genome size of 17 Gbp, polyploid nature and prevalence of repetitive sequences. BAC‐by‐BAC sequencing based on chromosomal physical maps, adopted by the International Wheat Genome Sequencing Consortium as the key strategy, reduces problems caused by the genome complexity and polyploidy, but the repeat content still hampers the sequence assembly. Availability of a high‐resolution genomic map to guide sequence scaffolding and validate physical map and sequence assemblies would be highly beneficial to obtaining an accurate and complete genome sequence. Here, we chose the short arm of chromosome 7D (7DS) as a model to demonstrate for the first time that it is possible to couple chromosome flow sorting with genome mapping in nanochannel arrays and create a de novo genome map of a wheat chromosome. We constructed a high‐resolution chromosome map composed of 371 contigs with an N50 of 1.3 Mb. Long DNA molecules achieved by our approach facilitated chromosome‐scale analysis of repetitive sequences and revealed a ~800‐kb array of tandem repeats intractable to current DNA sequencing technologies. Anchoring 7DS sequence assemblies obtained by clone‐by‐clone sequencing to the 7DS genome map provided a valuable tool to improve the BAC‐contig physical map and validate sequence assembly on a chromosome‐arm scale. Our results indicate that creating genome maps for the whole wheat genome in a chromosome‐by‐chromosome manner is feasible and that they will be an affordable tool to support the production of improved pseudomolecules.  相似文献   

19.
《Genomics》2020,112(3):2379-2384
Haploid cell lines are a valuable research tool with broad applicability for genetic assays. As such the fully haploid human cell line, eHAP1, has been used in a wide array of studies. However, the absence of a corresponding reference genome sequence for this cell line has limited the potential for more widespread applications to experiments dependent on available sequence, like capture-clone methodologies. We generated ~15× coverage Nanopore long reads from ten GridION flowcells and utilized this data to assemble a de novo draft genome using minimap and miniasm and subsequently polished using Racon. This assembly was further polished using previously generated, low-coverage, Illumina short reads with Pilon and ntEdit. This resulted in a hybrid eHAP1 assembly with >90% complete BUSCO scores. We further assessed the eHAP1 long read data for structural variants using Sniffles and identify a variety of rearrangements, including a previously established Philadelphia translocation. Finally, we demonstrate how some of these variants overlap open chromatin regions, potentially impacting regulatory regions. By integrating both long and short reads, we generated a high-quality reference assembly for eHAP1 cells. The union of long and short reads demonstrates the utility in combining sequencing platforms to generate a high-quality reference genome de novo solely from low coverage data. We expect the resulting eHAP1 genome assembly to provide a useful resource to enable novel experimental applications in this important model cell line.  相似文献   

20.
PlasmoDB (http://PlasmoDB.org) is the official database of the Plasmodium falciparum genome sequencing consortium. This resource incorporates finished and draft genome sequence data and annotation emerging from Plasmodium sequencing projects. PlasmoDB currently houses information from five parasite species and provides tools for cross-species comparisons. Sequence information is also integrated with other genomic-scale data emerging from the Plasmodium research community, including gene expression analysis from EST, SAGE and microarray projects. The relational schemas used to build PlasmoDB [Genomics Unified Schema (GUS) and RNA Abundance Database (RAD)] employ a highly structured format to accommodate the diverse data types generated by sequence and expression projects. A variety of tools allow researchers to formulate complex, biologically based queries of the database. A version of the database is also available on CD-ROM (Plasmodium GenePlot), facilitating access to the data in situations where Internet access is difficult (e.g. by malaria researchers working in the field). The goal of PlasmoDB is to enhance utilization of the vast quantities of data emerging from genome-scale projects by the global malaria research community.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号