The use of a priori knowledge in the alignment of targeted sequencing data is investigated using computational experiments. Adapting a Needleman-Wunsch algorithm to incorporate the genomic position information from the targeted capture, we demonstrate that alignment can be done to just the target region of interest. When in addition use is made of direct string comparison, an improvement of up to a factor of 8 in alignment speed compared to the fastest conventional aligner (Bowtie) is obtained. This results in a total alignment time in targeted sequencing of around 7 min for aligning approximately 56 million captured reads. For conventional aligners such as Bowtie, BWA or MAQ, alignment to just the target region is not feasible as experiments show that this leads to an additional 88% SNP calls, the vast majority of which are false positives (~92%).  相似文献   

Genomic structural variation (SV), a common hallmark of cancer, has important predictive and therapeutic implications. However, accurately detecting SV using high-throughput sequencing data remains challenging, especially for ‘targeted’ resequencing efforts. This is critically important in the clinical setting where targeted resequencing is frequently being applied to rapidly assess clinically actionable mutations in tumor biopsies in a cost-effective manner. We present BreaKmer, a novel approach that uses a ‘kmer’ strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants. Relative to four publically available algorithms, BreaKmer detected SV with increased sensitivity and limited calls in non-tumor samples, key features for variant analysis of tumor specimens in both the clinical and research settings.  相似文献   

The advent of next generation sequencing has influenced every aspect of biological research. Many labs are now using whole genome sequencing in Arabidopsis thaliana as a means to quickly identify EMS-generated mutations present in isolated mutants. Following identification of these mutations, examination of T-DNA insertional alleles defective in candidate genes or complementation of the mutant phenotype with a wild type copy of candidate genes can be used to verify which mutation is causative for the phenotype of interest. Here, we discuss the benefits and pitfalls of using this method to identify mutations underlying phenotypes.  相似文献   

Paired-end sequencing is a common approach for identifying structural variation (SV) in genomes. Discrepancies between the observed and expected alignments indicate potential SVs. Most SV detection algorithms use only one of the possible signals and ignore reads with multiple alignments. This results in reduced sensitivity to detect SVs, especially in repetitive regions. We introduce GASVPro, an algorithm combining both paired read and read depth signals into a probabilistic model which can analyze multiple alignments of reads. GASVPro outperforms existing methods with a 50-90% improvement in specificity on deletions and a 50% improvement on inversions.  相似文献   

One of the goals of gene expression experiments is the identification of differentially expressed genes among populations that could be used as markers. For this purpose, we implemented a model-free Bayesian approach in a user-friendly and freely available web-based tool called BayBoots. In spite of a common misunderstanding that Bayesian and model-free approaches are incompatible, we merged them in the BayBoots implementation using the Kernel density estimator and Rubin 's Bayesian Bootstrap. We used the Bayes error rate (BER) instead of the usual P values as an alternative statistical index to rank a class marker's discriminative potential, since it can be visualized by a simple graphical representation and has an intuitive interpretation. Subsequently, Bayesian Bootstrap was used to assess BER 's credibility. We tested BayBoots on microarray data to look for markers for Trypanosoma cruzi strains isolated from cardiac and asymptomatic patients. We found that the three most frequently used methods in microarray analysis: t-test, non-parametric Wilcoxon test and correlation methods, yielded several markers that were discarded by a time-consuming visual check. On the other hand, the BayBoots graphical output and ranking was able to automatically identify markers for which classification performance was consistent. BayBoots is available at: http://www.vision.ime.usp.br/~rvencio/BayBoots.  相似文献   

Small nucleolar RNAs (snoRNAs) are noncoding RNAs that direct 2′-O-methylation or pseudouridylation on ribosomal RNAs or spliceosomal small nuclear RNAs. These modifications are needed to modulate the activity of ribosomes and spliceosomes. A comprehensive repertoire of snoRNAs is needed to expand the knowledge of these modifications. The sequences corresponding to snoRNAs in 18–26-nt small RNA sequencing data have been rarely explored and remain as a hidden treasure for snoRNA annotation. Here, we showed the enrichment of small RNAs at Arabidopsis snoRNA termini and developed a computational approach to identify snoRNAs on the basis of this characteristic. The approach successfully uncovered the full-length sequences of 144 known Arabidopsis snoRNA genes, including some snoRNAs with improved 5′- or 3′-end annotation. In addition, we identified 27 and 17 candidates for novel box C/D and box H/ACA snoRNAs, respectively. Northern blot analysis and sequencing data from parallel analysis of RNA ends confirmed the expression and the termini of the newly predicted snoRNAs. Our study especially expanded on the current knowledge of box H/ACA snoRNAs and snoRNA species targeting snRNAs. In this study, we demonstrated that the use of small RNA sequencing data can increase the complexity and the accuracy of snoRNA annotation.  相似文献   

Personal-genomics endeavors, such as the 1000 Genomes project, are generating maps of genomic structural variants by analyzing ends of massively sequenced genome fragments. To process these we developed Paired-End Mapper (PEMer; http://sv.gersteinlab.org/pemer). This comprises an analysis pipeline, compatible with several next-generation sequencing platforms; simulation-based error models, yielding confidence-values for each structural variant; and a back-end database. The simulations demonstrated high structural variant reconstruction efficiency for PEMer's coverage-adjusted multi-cutoff scoring-strategy and showed its relative insensitivity to base-calling errors.  相似文献   

Detection of antibodies in serum has many important applications. Our goal was to develop a facile general experimental approach for identifying antibody-specific peptide ligands that could be used as the reagents for antibody detection. Our emphasis was on an approach that would allow identification of peptide ligands for antibodies in serum without the need to isolate the target antibody or to know the identity of its antigen. We combined ribosome display (RD) with the analysis of peptide libraries by next generation sequencing (NGS) of their coding RNA to facilitate identification of antibody-specific peptide ligands from random sequence peptide library. We first demonstrated, using purified antibodies, that with our approach-specific peptide ligands for antibodies with simple linear epitopes, as well as peptide mimotopes for antibodies recognizing complex epitopes, were readily identified. Inclusion of NGS analysis reduced the number of RD selection rounds that were required to identify specific ligands and facilitated discrimination between specific and spurious nonspecific sequences. We then used a model of human serum spiked with a known target antibody to develop NGS-based analysis that allowed identification of specific ligands for a target antibody in the context of an overwhelming amount of unrelated immunoglobins present in serum.  相似文献   

Rapid development of next generation sequencing technology has enabled the identification of genomic alterations from short sequencing reads. There are a number of software pipelines available for calling single nucleotide variants from genomic DNA but, no comprehensive pipelines to identify, annotate and prioritize expressed SNVs (eSNVs) from non-directional paired-end RNA-Seq data. We have developed the eSNV-Detect, a novel computational system, which utilizes data from multiple aligners to call, even at low read depths, and rank variants from RNA-Seq. Multi-platform comparisons with the eSNV-Detect variant candidates were performed. The method was first applied to RNA-Seq from a lymphoblastoid cell-line, achieving 99.7% precision and 91.0% sensitivity in the expressed SNPs for the matching HumanOmni2.5 BeadChip data. Comparison of RNA-Seq eSNV candidates from 25 ER+ breast tumors from The Cancer Genome Atlas (TCGA) project with whole exome coding data showed 90.6–96.8% precision and 91.6–95.7% sensitivity. Contrasting single-cell mRNA-Seq variants with matching traditional multicellular RNA-Seq data for the MD-MB231 breast cancer cell-line delineated variant heterogeneity among the single-cells. Further, Sanger sequencing validation was performed for an ER+ breast tumor with paired normal adjacent tissue validating 29 out of 31 candidate eSNVs. The source code and user manuals of the eSNV-Detect pipeline for Sun Grid Engine and virtual machine are available at http://bioinformaticstools.mayo.edu/research/esnv-detect/.  相似文献   



The highly dimensional data produced by functional genomic (FG) studies makes it difficult to visualize relationships between gene products and experimental conditions (i.e., assays). Although dimensionality reduction methods such as principal component analysis (PCA) have been very useful, their application to identify assay-specific signatures has been limited by the lack of appropriate methodologies. This article proposes a new and powerful PCA-based method for the identification of assay-specific gene signatures in FG studies.  相似文献   

Evolve and resequence (E&R) is a new approach to investigate the genomic responses to selection during experimental evolution. By using whole genome sequencing of pools of individuals (Pool-Seq), this method can identify selected variants in controlled and replicable experimental settings. Reviewing the current state of the field, we show that E&R can be powerful enough to identify causative genes and possibly even single-nucleotide polymorphisms. We also discuss how the experimental design and the complexity of the trait could result in a large number of false positive candidates. We suggest experimental and analytical strategies to maximize the power of E&R to uncover the genotype–phenotype link and serve as an important research tool for a broad range of evolutionary questions.Experimental evolution has a long tradition in biology (Garland and Rose, 2009). By exposing an evolving population to conditions chosen by the researcher, it is possible to study the response to this selection regime. A recent review highlighted the broad range of applications that have been investigated with this methodology and concluded that the breadth of research questions is only limited by the creativity of the experimenter (Kawecki et al., 2012). In addition to the great diversity of experimental designs, experimental evolution provides a unique advantage compared with other evolutionary analyses: the ability to replicate an experiment under identical conditions. Through this replication, experimenters are able to distinguish between stochastic and deterministic effects. Until recently, experimental evolution has mainly focused on phenotypes, sometimes combined with the analysis of a small number of markers (see, for example, Nuzhdin et al., 1993; Teotonio et al., 2009). In the wake of the latest sequencing technologies and the ongoing drop in DNA sequencing costs, however, the ultimate goal to connect the phenotypic response to the underlying genetic changes during an experimental evolution study has now come within reach.Depending on the starting population, two conceptually different approaches of experimental evolution can be distinguished. Either the experiment starts from a genetically homogeneous (invariable) population or from a polymorphic population. In the first approach, adaptation occurs through the accumulation of new beneficial mutations during the experiment (Elena and Lenski, 2003). These experiments therefore require very large population sizes and many generations to ensure a sufficient mutation supply and are thus largely restricted to microorganisms. Alternatively, experiments starting with a polymorphic population do not require novel mutations as selection can act on beneficial alleles that are already present at the beginning of the experiment. Given the massive genetic variation that is present in the starting population, the key challenge for this approach is distinguishing between selected and neutral variants. Neither randomly selected markers nor whole genome sequencing of a few representative individuals can provide sufficient information about the true target(s) of selection. Rather, genome-wide polymorphism data are needed.As whole genome sequencing is still not feasible for large numbers of individuals, experimental evolution studies starting from polymorphic base populations rely on a modified next-generation sequencing approach. Rather than sequencing individuals separately, DNA of multiple individuals from a population are sequenced together (Pool-Seq). This method is more cost effective than sequencing of individuals (Futschik and Schlötterer, 2010) and yields highly accurate genome-wide allele frequency estimates (reviewed in Rellstab et al., 2013; Schlötterer et al., 2014). The combination of experimental evolution with Pool-Seq is also known as Evolve and Resequence (E&R; Turner et al., 2011; Figure 1). Here, we review the state of the art of whole genome polymorphism analysis in experimental evolution studies relying primarily on segregating variation in the starting population.Open in a separate windowFigure 1Overview of E&R studies. (a) A population of flies is exposed for 60 generations to ultraviolet (UV) radiation (purple arrows). We assume here, for the sake of illustration, that darker pigmentation is beneficial in high UV environments, whereby darker flies will increase in frequency. (b) At the genotypic level, the allele frequency of the causative allele (dark brown) will increase, more so than hitchhiking variants (dark gray background) that will be recombined onto other backgrounds (breaks between dark and light gray background). (c) The allele frequencies of the starting population and the selected population are measured with Pool-Seq. (d) Causative variants can be identified by contrasting the allele frequencies between base and selected population and visualized with Manhattan plots. A full color version of this figure is available at the Heredity journal online.In many experimental evolution studies, researchers select for a well-defined trait in a controlled environment. This assures that both the phenotypic and the underlying genomic response are triggered either directly or indirectly by the selection regime applied during the experiment. Thus, E&R studies provide a complementary approach to genome-wide association studies (GWASs) and linkage mapping experiments as strategies to connect genotype and phenotype.  相似文献   

