首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Accurate estimation of expression levels from RNA-Seq data entails precise mapping of the sequence reads to a reference genome. Because the standard reference genome contains only one allele at any given locus, reads overlapping polymorphic loci that carry a non-reference allele are at least one mismatch away from the reference and, hence, are less likely to be mapped. This bias in read mapping leads to inaccurate estimates of allele-specific expression (ASE). To address this read-mapping bias, we propose the construction of an enhanced reference genome that includes the alternative alleles at known polymorphic loci. We show that mapping to this enhanced reference reduced the read-mapping biases, leading to more reliable estimates of ASE. Experiments on simulated data show that the proposed strategy reduced the number of loci with mapping bias by ≥63% when compared with a previous approach that relies on masking the polymorphic loci and by ≥18% when compared with the standard approach that uses an unaltered reference. When we applied our strategy to actual RNA-Seq data, we found that it mapped up to 15% more reads than the previous approaches and identified many seemingly incorrect inferences made by them.  相似文献   

2.
3.
Motivation: We present an algorithm to identify allelic variationgiven a Whole Genome Shotgun (WGS) assembly of haploid sequences,and to produce a set of haploid consensus sequences rather thana single consensus sequence. Existing WGS assemblers take acolumn-by-column approach to consensus generation, and producea single consensus sequence which can be inconsistent with theunderlying haploid alleles, and inconsistent with any of thealigned sequence reads. Our new algorithm uses a dynamic windowingapproach. It detects alleles by simultaneously processing theportions of aligned reads spanning a region of sequence variation,assigns reads to their respective alleles, phases adjacent variantalleles and generates a consensus sequence corresponding toeach confirmed allele. This algorithm was used to produce thefirst diploid genome sequence of an individual human. It canalso be applied to assemblies of multiple diploid individualsand hybrid assemblies of multiple haploid organisms. Results: Being applied to the individual human genome assembly,the new algorithm detects exactly two confirmed alleles andreports two consensus sequences in 98.98% of the total number2 033 311 detected regions of sequence variation. In 33 269out of 460 373 detected regions of size >1 bp, it fixes theconstructed errors of a mosaic haploid representation of a diploidlocus as produced by the original Celera Assembler consensusalgorithm. Using an optimized procedure calibrated against 1506 344 known SNPs, it detects 438 814 new heterozygous SNPswith false positive rate 12%. Availability: The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/ Contact: gdenisov{at}jcvi.org Associate Editor: John Quackenbush  相似文献   

4.
5.
6.
7.
8.
BM-map: Bayesian mapping of multireads for next-generation sequencing data   总被引:1,自引:0,他引:1  
Ji Y  Xu Y  Zhang Q  Tsui KW  Yuan Y  Norris C  Liang S  Liang H 《Biometrics》2011,67(4):1215-1224
Next-generation sequencing (NGS) technology generates millions of short reads, which provide valuable information for various aspects of cellular activities and biological functions. A key step in NGS applications (e.g., RNA-Seq) is to map short reads to correct genomic locations within the source genome. While most reads are mapped to a unique location, a significant proportion of reads align to multiple genomic locations with equal or similar numbers of mismatches; these are called multireads. The ambiguity in mapping the multireads may lead to bias in downstream analyses. Currently, most practitioners discard the multireads in their analysis, resulting in a loss of valuable information, especially for the genes with similar sequences. To refine the read mapping, we develop a Bayesian model that computes the posterior probability of mapping a multiread to each competing location. The probabilities are used for downstream analyses, such as the quantification of gene expression. We show through simulation studies and RNA-Seq analysis of real life data that the Bayesian method yields better mapping than the current leading methods. We provide a C++ program for downloading that is being packaged into a user-friendly software.  相似文献   

9.
Experimentally characterized enhancer regions have previously been shown to display specific patterns of enrichment for several different histone modifications. We modelled these enhancer chromatin profiles in the human genome and used them to guide the search for novel enhancers derived from transposable element (TE) sequences. To do this, a computational approach was taken to analyze the genome-wide histone modification landscape characterized by the ENCODE project in two human hematopoietic cell types, GM12878 and K562. We predicted the locations of 2,107 and 1,448 TE-derived enhancers in the GM12878 and K562 cell lines respectively. A vast majority of these putative enhancers are unique to each cell line; only 3.5% of the TE-derived enhancers are shared between the two. We evaluated the functional effect of TE-derived enhancers by associating them with the cell-type specific expression of nearby genes, and found that the number of TE-derived enhancers is strongly positively correlated with the expression of nearby genes in each cell line. Furthermore, genes that are differentially expressed between the two cell lines also possess a divergent number of TE-derived enhancers in their vicinity. As such, genes that are up-regulated in the GM12878 cell line and down-regulated in K562 have significantly more TE-derived enhancers in their vicinity in the GM12878 cell line and vice versa. These data indicate that human TE-derived sequences are likely to be involved in regulating cell-type specific gene expression on a broad scale and suggest that the enhancer activity of TE-derived sequences is mediated by epigenetic regulatory mechanisms.  相似文献   

10.
Next generation technologies enable massive-scale cDNA sequencing (so-called RNA-Seq). Mainly because of the difficulty of aligning short reads on exon-exon junctions, no attempts have been made so far to use RNA-Seq for building gene models de novo, that is, in the absence of a set of known genes and/or splicing events. We present G-Mo.R-Se (Gene Modelling using RNA-Seq), an approach aimed at building gene models directly from RNA-Seq and demonstrate its utility on the grapevine genome.  相似文献   

11.
12.
Yoon  Byung-Jun  Qian  Xiaoning  Kahveci  Tamer  Pal  Ranadip 《BMC genomics》2020,21(9):1-3
Background

Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual’s susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.

Results

We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet – a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.

Conclusions

Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

  相似文献   

13.
14.
15.
16.
We report an analysis of allele-specific expression (ASE) and parent-of-origin expression in adult mouse liver using next generation sequencing (RNA-Seq) of reciprocal crosses of heterozygous F1 mice from the parental strains C57BL/6J and DBA/2J. We found a 60% overlap between genes exhibiting ASE and putative cis-acting expression quantitative trait loci (cis-eQTL) identified in an intercross between the same strains. We discuss the various biological and technical factors that contribute to the differences. We also identify genes exhibiting parental imprinting and complex expression patterns. Our study demonstrates the importance of biological replicates to limit the number of false positives with RNA-Seq data.  相似文献   

17.
《Journal of bryology》2013,35(3):479-485
Abstract

Evidence is presented showing that the chromosomes of diploid and triploid races of Atrichum undulatum are significantly shorter than those of haploid plants. Relative DNA contents of the three cytotypes have been estimated and they differ significantly from an expected 1:2:3 ratio in haploid, diploid and triploid races.  相似文献   

18.
19.
In certain segments of human DNA, the methylation of deoxycytidine residues has been found to be highly specific and interindividually conserved. Imprinted DNA sequences in diploid primary cells show allele-specific differences in DNA methylation, usually with the active chromosomal regions being unmethylated and the inactive regions being methylated. We show here that DNA from spermatozoa exhibits variations in allelic methylation patterns. Since germ cells are haploid, individual spermatozoa can differ in DNA methylation patterns not only in the maternally or paternally derived allele, but also within each allele.  相似文献   

20.
Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号