首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease susceptibility genes. However, estimating isoform-specific gene expression is challenging because various biases present in RNA-Seq (RNA sequencing) data complicate the analysis, and if not appropriately corrected, can affect isoform expression estimation and downstream analysis. In this article, we present PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution. Instead of making parametric assumptions, we give adequate weight to the underlying data by the use of a non-parametric approach. Our rationale is that regardless what factors lead to non-uniformity, whether it is due to hexamer priming bias, local sequence bias, positional bias, RNA degradation, mapping bias or other unknown reasons, the probability that a fragment is sampled from a particular region will be reflected in the aligned data. This empirical approach thus maximally reflects the true underlying non-uniform read distribution. We evaluate the performance of PennSeq using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets including one with quantitative real time polymerase chain reaction measurements. Our results indicate superior performance of PennSeq over existing methods, particularly for isoforms demonstrating severe non-uniformity. PennSeq is freely available for download at http://sourceforge.net/projects/pennseq.  相似文献   

2.
3.
Remote sensing (RS) data may play an important role in the development of cost-effective means for modelling, mapping, planning and conserving biodiversity. Specifically, at the landscape scale, spatial models for the occurrences of species of conservation concern may be improved by the inclusion of RS-based predictors, to help managers to better meet different conservation challenges. In this study, we examine whether predicted distributions of 28 red-listed plant species in north-eastern Finland at the resolution of 25 ha are improved when advanced RS-variables are included as unclassified continuous predictor variables, in addition to more commonly used climate and topography variables. Using generalized additive models (GAMs), we studied whether the spatial predictions of the distribution of red-listed plant species in boreal landscapes are improved by incorporating advanced RS (normalized difference vegetation index, normalized difference soil index and Tasseled Cap transformations) information into species-environment models. Models were fitted using three different sets of explanatory variables: (1) climate-topography only; (2) remote sensing only; and (3) combined climate-topography and remote sensing variables, and evaluated by four-fold cross-validation with the area under the curve (AUC) statistics. The inclusion of RS variables improved both the explanatory power (on average 8.1 % improvement) and cross-validation performance (2.5 %) of the models. Hybrid models produced ecologically more reliable distribution maps than models using only climate-topography variables, especially for mire and shore species. In conclusion, Landsat ETM+ data integrated with climate and topographical information has the potential to improve biodiversity and rarity assessments in northern landscapes, especially in predictive studies covering extensive and remote areas.  相似文献   

4.

Background

Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation.

Results

We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation – a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC.

Conclusions

Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.
  相似文献   

5.
An investigation of mushroom phylogeny using the largest subunit of RNA polymerase II gene sequences (RPB1) was conducted in comparison with nuclear ribosomal large subunit RNA gene sequences (nLSU) for the same set of taxa in the genus Inocybe (Agaricales, Basidiomycota). The two data sets, though not significantly incongruent, exhibit conflict among the placement of two taxa that exhibit long branches in the nLSU data set. In contrast, RPB1 terminal branch lengths are rather uniform. Bootstrap support is increased for clades in RPB1. Combined data sets increase the degree of confidence for several relationships. Overall, nLSU data do not yield a robust phylogeny when independently assessed by RPB1 sequences. This multigene study indicates that Inocybe is a monophyletic group composed of at least four distinct lineages-subgenus Mallocybe, section Cervicolores, section Rimosae, and subgenus Inocybe sensu Kühner, Kuyper, non Singer. Within subgenus Inocybe, two additional lineages, one composed of species with smooth basidiospores (clade I) and a second characterized by nodulose-spored species (clade II), are recovered by RPB1 and combined data. The nLSU data recover only clade I. The genera Astrosporina and Inocybella cannot be recognized phylogenetically. "Supersections" Cortinatae and Marginatae are not monophyletic groups.  相似文献   

6.
7.
8.
9.
10.
Hubbard AE  Laan MJ 《Biometrika》2008,95(1):35-47
We propose a new causal parameter, which is a natural extension of existing approaches to causal inference such as marginal structural models. Modelling approaches are proposed for the difference between a treatment-specific counterfactual population distribution and the actual population distribution of an outcome in the target population of interest. Relevant parameters describe the effect of a hypothetical intervention on such a population and therefore we refer to these models as population intervention models. We focus on intervention models estimating the effect of an intervention in terms of a difference and ratio of means, called risk difference and relative risk if the outcome is binary. We provide a class of inverse-probability-of-treatment-weighted and doubly-robust estimators of the causal parameters in these models. The finite-sample performance of these new estimators is explored in a simulation study.  相似文献   

11.
The 2007 Energy Independence and Security Act mandates a five‐fold increase in US biofuel production by 2022. Given this ambitious policy target, there is a need for spatially explicit estimates of landscape suitability for growing biofuel feedstocks. We developed a suitability modeling approach for two major US biofuel crops, corn (Zea mays) and switchgrass (Panicum virgatum), based upon the use of two presence‐only species distribution models (SDMs): maximum entropy (Maxent) and support vector machines (SVM). SDMs are commonly used for modeling animal and plant distributions in natural environments, but have rarely been used to develop landscape models for cultivated crops. AUC, Kappa, and correlation measures derived from test data indicate that SVM slightly outperformed Maxent in modeling US corn production, although both models produced significantly accurate results. When compared with results from a mechanistic switchgrass model recently developed by Oak Ridge National Laboratory (ORNL), SVM results showed higher correlation than Maxent results with models fit using county‐scale point inputs of switchgrass production derived from expert opinion estimates. However, Maxent results for an alternative switchgrass model developed with point inputs from research trial sites showed higher correlation to the ORNL model than the corresponding results obtained from SVM. Further analysis indicates that both modeling approaches were effective in predicting county‐scale increases in corn production from 2006 to 2007, a time period in which US corn production increased by 24%. We conclude that presence‐only methods are a powerful first‐cut tool for estimating relative land suitability across geographic regions in which candidate biofuel feedstocks can be grown, and may also provide important insight into potential land‐use change patterns likely to be associated with increased biofuel demand.  相似文献   

12.
13.
MOTIVATION: Cellular processes cause changes over time. Observing and measuring those changes over time allows insights into the how and why of regulation. The experimental platform for doing the appropriate large-scale experiments to obtain time-courses of expression levels is provided by microarray technology. However, the proper way of analyzing the resulting time course data is still very much an issue under investigation. The inherent time dependencies in the data suggest that clustering techniques which reflect those dependencies yield improved performance. RESULTS: We propose to use Hidden Markov Models (HMMs) to account for the horizontal dependencies along the time axis in time course data and to cope with the prevalent errors and missing values. The HMMs are used within a model-based clustering framework. We are given a number of clusters, each represented by one Hidden Markov Model from a finite collection encompassing typical qualitative behavior. Then, our method finds in an iterative procedure cluster models and an assignment of data points to these models that maximizes the joint likelihood of clustering and models. Partially supervised learning--adding groups of labeled data to the initial collection of clusters--is supported. A graphical user interface allows querying an expression profile dataset for time course similar to a prototype graphically defined as a sequence of levels and durations. We also propose a heuristic approach to automate determination of the number of clusters. We evaluate the method on published yeast cell cycle and fibroblasts serum response datasets, and compare them, with favorable results, to the autoregressive curves method.  相似文献   

14.
Yuan  Rong  Zeng  Xinhua  Zhao  Shengbo  Wu  Gang  Yan  Xiaohong 《Plant Molecular Biology Reporter》2019,37(4):347-364
Plant Molecular Biology Reporter - Plant stems are involved in supporting the entire plant body, thus having an important effect on the yield of oilseed rape. The current understanding of the...  相似文献   

15.
16.
17.
18.
Pairwise likelihood methods for inference in image models   总被引:3,自引:0,他引:3  
Nott  DJ; Ryden  T 《Biometrika》1999,86(3):661-676
  相似文献   

19.
A flexible statistical framework is developed for the analysis of read counts from RNA-Seq gene expression studies. It provides the ability to analyse complex experiments involving multiple treatment conditions and blocking variables while still taking full account of biological variation. Biological variation between RNA samples is estimated separately from the technical variation associated with sequencing technologies. Novel empirical Bayes methods allow each gene to have its own specific variability, even when there are relatively few biological replicates from which to estimate such variability. The pipeline is implemented in the edgeR package of the Bioconductor project. A case study analysis of carcinoma data demonstrates the ability of generalized linear model methods (GLMs) to detect differential expression in a paired design, and even to detect tumour-specific expression changes. The case study demonstrates the need to allow for gene-specific variability, rather than assuming a common dispersion across genes or a fixed relationship between abundance and variability. Genewise dispersions de-prioritize genes with inconsistent results and allow the main analysis to focus on changes that are consistent between biological replicates. Parallel computational approaches are developed to make non-linear model fitting faster and more reliable, making the application of GLMs to genomic data more convenient and practical. Simulations demonstrate the ability of adjusted profile likelihood estimators to return accurate estimators of biological variability in complex situations. When variation is gene-specific, empirical Bayes estimators provide an advantageous compromise between the extremes of assuming common dispersion or separate genewise dispersion. The methods developed here can also be applied to count data arising from DNA-Seq applications, including ChIP-Seq for epigenetic marks and DNA methylation analyses.  相似文献   

20.
Motif detection based on Gibbs sampling is a common procedure used to retrieve regulatory motifs in silico. Using a species-specific background model was previously shown to increase the robustness of the algorithm. Here, we demonstrate that selecting a non-species-adapted background model can have an adverse effect on the results of motif detection. The large differences in the average nucleotide composition of prokaryotic sequences exacerbate the problem of exchanging background models. Therefore, we have developed complex background models for all prokaryotic species with available genome sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号