首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Pipelines for the analysis of Next-Generation Sequencing (NGS) data are generally composed of a set of different publicly available software, configured together in order to map short reads of a genome and call variants. The fidelity of pipelines is variable. We have developed ArtificialFastqGenerator, which takes a reference genome sequence as input and outputs artificial paired-end FASTQ files containing Phred quality scores. Since these artificial FASTQs are derived from the reference genome, it provides a gold-standard for read-alignment and variant-calling, thereby enabling the performance of any NGS pipeline to be evaluated. The user can customise DNA template/read length, the modelling of coverage based on GC content, whether to use real Phred base quality scores taken from existing FASTQ files, and whether to simulate sequencing errors. Detailed coverage and error summary statistics are outputted. Here we describe ArtificialFastqGenerator and illustrate its implementation in evaluating a typical bespoke NGS analysis pipeline under different experimental conditions. ArtificialFastqGenerator was released in January 2012. Source code, example files and binaries are freely available under the terms of the GNU General Public License v3.0. from https://sourceforge.net/projects/artfastqgen/.  相似文献   

2.
基于Cygwin实现生物信息学软件从Unix/Linux向Windows移植   总被引:2,自引:0,他引:2  
Cygwin可在Windows环境下提供对Unix/Linux环境的模拟与支持,具有较为完善的Unix/Linux工具包和编程环境。利用Cygwin对常用的生物信息学数据分析软件如Sim4、FASTA、Phred/Phrap/RepeatMasker、EMBOSS、HMMER和ClustalW等进行重新编译,发现通过该方式能够获得可在Windows环境下运行的可执行代码,为利用Windows环境优势的同时进行跨平台生物信息学数据分析平台的开发提供重要参考价值。  相似文献   

3.
Fluorescence-based sequencing is playing an increasingly important role in efforts to identify DNA polymorphisms and mutations of biological and medical interest. The application of this technology in generating the reference sequence of simple and complex genomes is also driving the development of new computer programs to automate base calling (Phred), sequence assembly (Phrap) and sequence assembly editing (Consed) in high throughput settings. In this report we describe a new computer program known as PolyPhred that automatically detects the presence of heterozygous single nucleotide substitutions by fluorescencebased sequencing of PCR products. Its operations are integrated with the use of the Phred, Phrap and Consed programs and together these tools generate a high throughput system for detecting DNA polymorphisms and mutations by large scale fluorescence-based resequencing. Analysis of sequences containing known DNA variants demonstrates that the accuracy of PolyPhred with single pass data is >99% when the sequences are generated with fluorescent dye-labeled primers and approximately 90% for those prepared with dye-labeled terminators.  相似文献   

4.
基于PC/Linux的核酸序列分析系统的构建及其应用   总被引:13,自引:2,他引:11  
基于PC机和Linux操作系统, 利用Phred/Phrap/Consed软件和Blast软件, 构建了核酸序列大规模自动分析系统. 该套系统可自动完成从测序峰图向核酸序列的转化、载体序列去除、序列自动拼接、重复序列鉴定以及序列的相似性分析, 可加速对大规模测序数据的分析和利用.  相似文献   

5.
Genomes are becoming heavily annotated with important features. Analysis of these features often employs oligonucleotides that hybridize at defined locations. When the defined location lies in a poor sequence context, traditional design strategies may fail. Locked Nucleic Acid (LNA) can enhance oligonucleotide affinity and specificity. Though LNA has been used in many applications, formal design rules are still being defined. To further this effort we have investigated the effect of LNA on the performance of sequencing and PCR primers in AT-rich regions, where short primers yield poor sequencing reads or PCR yields. LNA was used in three positional patterns: near the 5′ end (LNA-5′), near the 3′ end (LNA-3′) and distributed throughout (LNA-Even). Quantitative measures of sequencing read length (Phred Q30 count) and real-time PCR signal (cycle threshold, CT) were characterized using two-way ANOVA. LNA-5′ increased the average Phred Q30 score by 60% and it was never observed to decrease performance. LNA-5′ generated cycle thresholds in quantitative PCR that were comparable to high-yielding conventional primers. In contrast, LNA-3′ and LNA-Even did not improve read lengths or CT. ANOVA demonstrated the statistical significance of these results and identified significant interaction between the positional design rule and primer sequence.  相似文献   

6.
利用Phred/Phrap/Consed、cross.match、RepeatMasker、Blast等软件和自主开发程序,基于Linux操作系统,构建了林木EST序列分析系统,完成了从测序峰图向核酸序列的转化、载体序列的去除、重复序列鉴定、EST序列分类和组装、EST序列功能注释与功能分类以及SSR、SNP的发掘。并通过使用Perl语言结合bioperl模块写的脚本程序使分析过程自动化,从而可以快速地对大批林木EST数据进行分析,为林木的功能基因组学研究提供有用的信息。  相似文献   

7.
尽管二代基因组测序技术日渐流行,Sanger测序依旧是SNP识别和分析的金标准。传统对于Sanger测序结果的分析多依赖Seq Man等软件进行。然而这类软件大多依靠人工操作来识别和记录测序结果中的SNP位点,效率低下且容易发生错误。此外,当对多个个体进行序列测定时,这类软件无法完成对群体数据的管理和输出,给研究人员造成了一定的不便。Phred/Phrap/Consed/Polyphred是华盛顿大学开发的基于类Unix平台的软件包,在大规模测序数据的管理和SNP自动识别、标记与输出方面具有强大的功能。然而,由于其安装和使用较为复杂,在国内较少使用。本研究对该软件包的功能、使用流程、特点等进行了介绍,并将其安装于Ubuntu12.04操作系统并置于VMware虚拟机中,方便遗传学者的下载和使用。  相似文献   

8.
MOTIVATION: Insertion mutagenesis, using transgenes or endogenous transposons, is a popular method for generating null mutations (knockouts) in model organisms. Insertions are mapped to specific genes by amplifying (via TAIL-PCR) and sequencing genomic regions flanking the inserted DNA. The presence of multiple TAIL-PCR templates in one sequencing reaction results in chimeric sequence of intermittently low quality. Standard processing of this sequence by applying Phred quality requirements results in loss of informative sequence, whereas not trimming low-quality sequence causes inclusion of low-complexity homopolymers from the ends of sequence runs. Accurate mapping of the flanking sequences is complicated by the presence of gene families. RESULTS: Methods for extracting informative regions from sequence traces obtained by sequencing multiple TAIL-PCR fragments in a single reaction are described. The completely sequenced Arabidopsis genome was used to identify informative TAIL-PCR sequence regions. Methods were devised to define and select high quality matches and precisely map each insert to the correct genome location. These methods were used to analyze sequence of TAIL-PCR-amplified flanking regions of the inserts from individual plants in a T-DNA-mutagenized population of Arabidopsis thaliana, and are applicable to similar situations where a reference genome can be used to extract information from poor-quality sequence.  相似文献   

9.
Fluorescence-based capillary DNA sequencing has facilitated the early completion of several complex sequencing projects. While capillary systems offer great benefits in terms of ease of use and automation, we find that they are sufficiently different from slab gel separation methodologies, demanding re-examination of the protocols used to generate and use DNA sequencing templates. We have recently initiated a large-scale Human Open Reading Frame EST project involving 30 laboratories feeding 11 MegaBace 1000 capillary sequencers. The group has already produced more than 300,000 valid sequences. The most successful template preparation protocol we have found is described here. We have found that a crucial step is the standardization of the quantity and quality of the templates, which have been achieved by overnight bacterial culture followed by PCR using limiting amounts of primers. Using this protocol, there is no need for post-PCR purification, and the final preparation cost is US $0.09/template. After sequencing 10,848 templates using this protocol, 78% of the reads were accepted (after discarding vectors without inserts and inserts smaller than 100 nucleotides), and 85% of the total number of bases had Phred scores of 15 or above.  相似文献   

10.

Background  

Trace or chromatogram files (raw data) are produced by automatic nucleic acid sequencing equipment or sequencers. Each file contains information which can be interpreted by specialised software to reveal the sequence (base calling). This is done by the sequencer proprietary software or publicly available programs. Depending on the size of a sequencing project the number of trace files can vary from just a few to thousands of files. Sequencing quality assessment on various criteria is important at the stage preceding clustering and contig assembly. Two major publicly available packages – Phred and Staden are used by preAssemble to perform sequence quality processing.  相似文献   

11.
We developed an automated pipeline for the detection of single nucleotide polymorphisms (SNPs) in expressed sequence tag (EST) data sets, by combining three DNA sequence analysis programs: Phred, Phrap and PolyBayes. This application requires access to the individual electrophoregram traces. First, a reference set of 65 SNPs was obtained from the sequencing of 30 gametes in 13 maritime pine (Pinus pinaster Ait.) gene fragments (6671 bp), resulting in a frequency of 1 SNP every 102.6 bp. Second, parameters of the three programs were optimized in order to retrieve as many true SNPs, while keeping the rate of false positive as low as possible. Overall, the efficiency of detection of true SNPs was 83.1%. However, this rate varied largely as a function of the rare SNP allele frequency: down to 41% for rare SNP alleles (frequency < 10%), up to 98% for allele frequencies above 10%. Third, the detection method was applied to the 18498 assembled maritime pine (Pinus pinaster Ait.) ESTs, allowing to identify a total of 1400 candidate SNPs, in contigs containing between 4 and 20 sequence reads. These genetic resources, described for the first time in a forest tree species, were made available at http://www.pierroton.inra/genetics/Pinesnps. We also derived an analytical expression for the SNP detection probability as a function of the SNP allele frequency, the number of haploid genomes used to generate the EST sequence database, and the sample size of the contigs considered for SNP detection. The frequency of the SNP allele was shown to be the main factor influencing the probability of SNP detection.  相似文献   

12.
Using the Phred/Phrap/Polyphred/Consed pipeline established in the National Livestock Research Institute of Korea, we predicted candidate coding single nucleotide polymorphisms (cSNPs) from 7,600 expressed sequence tags (ESTs) derived from three cDNA libraries (liver, M. longissimus dorsi, and intermuscular fat) of Hanwoo (Korean native cattle) steers. From the 7,600 ESTs, 829 contigs comprising more than two EST reads were assembled using the Phrap assembler. Based on the contig analysis, 201 candidate cSNPs were identified in 129 contigs, in which transitions (69%) outnumbered transversions (31%). To verify whether the predicted cSNPs are real, 17 SNPs involved in lipid and energy metabolism were selected from the ESTs. Twelve of these were confirmed to be real while five were identified as artifacts, possibly due to expressed sequence tag sequence error. Further analysis of the 12 verified cSNPs was performed using the program BLASTX. Five were identified as nonsynonymous cSNPs, five were synonymous cSNPs, and two SNPs were located in 3'-UTRs. Our data indicated that a relatively high SNP prediction rate (71%) from a large EST database could produce abundant cSNPs rapidly, which can be used as valuable genetic markers in cattle.  相似文献   

13.
We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.  相似文献   

14.
A lack of pliant software tools that support small- to medium-scale DNA sequencing efforts is a major hindrance for recording and using laboratory workflow information to monitor the overall quality of data production. Here we describe VSQual, a set of Perl programs intended to provide simple and powerful tools to check several quality features of the sequencing data generated by automated DNA sequencing machines. The core program of VSQual is a flexible Perl-based pipeline, designed to be accessible and useful for both programmers and non-programmers. This pipeline directs the processing steps and can be easily customized for laboratory needs. Basically, the raw DNA sequencing trace files are processed by Phred and Cross_match, then the outputs are parsed, reformatted into Web-based graphical reports, and added to a Web site structure. The result is a set of real time sequencing reports easily accessible and understood by common laboratory people. These reports facilitate the monitoring of DNA sequencing as well as the management of laboratory workflow, significantly reducing operational costs and ensuring high quality and scientifically reliable results.  相似文献   

15.
The Homo sapiens major histocompatibility complex (MHC) class 1 chain related gene A (MICA) was scanned for novel single nucleotide polymorphisms (SNPs) using a panel of DNA samples from African-, Japanese- and Mexican-Americans. Overlapping primer-pairs were used to amplify products in the size range of 300 to 400 bp that were sequenced and scanned for SNPs using Phred, Phrap, Polyphred and Consed sequence analysis programs. A total of 16 SNPs were detected, six of which represent new variant nucleotides in the Homo sapiens MICA gene. Three of the variants also represent amino acid changes in the MICA protein. Differences among the three ethnic panels in the frequency of the variant nucleotides observed were inconsistent, but significant for seven of the SNPs detected. Though a small sample size, this study represents the first multi-population based analysis of the frequency and distribution of SNPs in the MICA gene, a locus that may be essential in the antigenic recognition by gammadelta T cells.  相似文献   

16.
17.
18.
Diagnostic re-sequencing plays a central role in medical and evolutionary genetics. In this report we describe a process that applies fluorescence-based re-sequencing and an integrated set of analysis tools to automate and simplify the identification of DNA variations using the human mitochondrial genome as a model system. Two programs used in genome sequence analysis (Phred, a base-caller, and Phrap, a sequence assembler) are applied to assess the quality of each base call across the sequence. Potential DNA variants are automatically identified and 'tagged' by comparing the assembled sequence with a reference sequence. We also show that employing the Consed program to display a set of highly annotated reference sequences greatly simplifies data analysis by providing a visual database containing information on the location of the PCR primers, coding and regulatory sequences and previously known DNA variants. Among the 12 genomes sequenced 378 variants including 29 new variants were identified along with two heteroplasmic sites, automatically detected by the PolyPhred program. Overall we document the ease and speed of performing high quality and accurate fluorescence-based re-sequencing on long tracts of DNA as well as the application of new approaches to automatically find and view DNA variants among these sequences.  相似文献   

19.
Objective: To identify and functionally characterize single‐nucleotide polymorphisms (SNPs) in melanin‐concentrating hormone (MCH)‐R1 and ‐R2. Research Methods and Procedures: The entire coding regions and intron/exon splice junction regions of MCH‐R1 and MCH‐R2 were sequenced from anonymous white (n = 45) and African‐American (n = 46) individuals. DNA was analyzed, and SNPs were identified using Phred, Phrap, and Consed software. DNA constructs containing MCH‐R1 and MCH‐R2 SNPs were generated and expressed in CHO cells. The effect of the SNPs in MCH‐R1 and MCH‐R2 were assessed in receptor binding assays and functional assays measuring changes in intracellular cAMP and Ca2+ levels. Results: We identified 12 SNPs in the MCH‐R1 gene. Two of these SNPs are in coding regions, and one produces an arginine‐for‐glycine substitution at residue 34 in the MCH‐R1 sequence. This SNP is present at a minor allele frequency of 15% in the African‐American population tested in this study. We identified eight SNPs in the MCH‐R2 gene. Four of these SNPs are in coding regions, and two produce amino acid substitutions. Lysine substitutes for arginine at residue 63 of the African‐American population, and glutamine substitutes for arginine at residue 152 in whites (minor allele frequency of 2% for both SNPs). No changes in receptor binding or functional signaling were observed with the SNP mutations in MCH‐R1 or MCH‐R2. Discussion: These data indicate that potential therapeutics designed to act at the MCH receptor are unlikely to have altered effects in subpopulations that express variant forms of MCH‐R1 or MCH‐R2.  相似文献   

20.
To rapidly and cost-effectively generate gene expression data, we developed an annotated unigene database of common bean (Phaseolus vulgaris L.). In this study, 3 cDNA libraries were constructed from the bean breeding line SEL1308, 1 from young leaf and 2 from seedlings inoculated or not inoculated with the fungal pathogen Colletotrichum lindemuthianum (Sacc. & Magnus) Briosi & Cavara, which causes anthracnose in common bean. To this date, 5255 single-pass sequences have been included in the database after selection based on sequence quality. These ESTs were trimmed and clustered using the computer programs Phred and CAP3 to form a unigene collection of 3126 unique sequences. Within clusters, 318 single nucleotide polymorphisms (SNPs) and 68 insertions-deletions (indels) were found, indicating the presence of paralogous gene families in our database. Each unigene sequence was analyzed for possible function using their similarity to known genes represented in the GenBank database and classified into 14 categories. Only 314 unigenes showed significant similarities to Phaseolus genomic sequences and P. vulgaris ESTs, which indicates that 90% (2818 unigenes) of our database represent newly discovered common bean genes. In addition, 12% (387 unigenes) were shown to be specific to common bean. This study represents a first step towards the discovery of novel genes in beans and a valuable source of molecular markers for expressed gene tagging and mapping.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号