首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
As the use of RNA-seq has popularized, there is an increasing consciousness of the importance of experimental design, bias removal, accurate quantification and control of false positives for proper data analysis. We introduce the NOISeq R-package for quality control and analysis of count data. We show how the available diagnostic tools can be used to monitor quality issues, make pre-processing decisions and improve analysis. We demonstrate that the non-parametric NOISeqBIO efficiently controls false discoveries in experiments with biological replication and outperforms state-of-the-art methods. NOISeq is a comprehensive resource that meets current needs for robust data-aware analysis of RNA-seq differential expression.  相似文献   

2.
Competitive gene set tests are commonly used in molecular pathway analysis to test for enrichment of a particular gene annotation category amongst the differential expression results from a microarray experiment. Existing gene set tests that rely on gene permutation are shown here to be extremely sensitive to inter-gene correlation. Several data sets are analyzed to show that inter-gene correlation is non-ignorable even for experiments on homogeneous cell populations using genetically identical model organisms. A new gene set test procedure (CAMERA) is proposed based on the idea of estimating the inter-gene correlation from the data, and using it to adjust the gene set test statistic. An efficient procedure is developed for estimating the inter-gene correlation and characterizing its precision. CAMERA is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression. Analysis of breast cancer data shows that CAMERA recovers known relationships between tumor subtypes in very convincing terms. CAMERA can be used to analyze specified sets or as a pathway analysis tool using a database of molecular signatures.  相似文献   

3.
ABSTRACT: BACKGROUND: Gene-set enrichment analyses (GEA or GSEA) are commonly used for biological characterization of an experimental gene-set. This is done by finding known functional categories, such as pathways or Gene Ontology terms, that are over-represented in the experimental set; the assessment is based on an overlap statistic. Rich biological information in terms of gene interaction network is now widely available, but this topological information is not used by GEA, so there is a need for methods that exploit this type of information in high-throughput data analysis. RESULTS: We developed a method of network enrichment analysis (NEA) that extends the overlap statistic in GEA to network links between genes in the experimental set and those in the functional categories. For the crucial step in statistical inference, we developed a fast network randomization algorithm in order to obtain the distribution of any network statistic under the null hypothesis of no association between an experimental gene-set and a functional category. We illustrate the NEA method using gene and protein expression data from a lung cancer study. CONCLUSIONS: The results indicate that the NEA method is more powerful than the traditional GEA, primarily because the relationships between gene sets were more strongly captured by network connectivity rather than by simple overlaps.  相似文献   

4.
Enrichment analysis of gene sets is a popular approach that provides a functional interpretation of genome-wide expression data. Existing tests are affected by inter-gene correlations, resulting in a high Type I error. The most widely used test, Gene Set Enrichment Analysis, relies on computationally intensive permutations of sample labels to generate a null distribution that preserves gene–gene correlations. A more recent approach, CAMERA, attempts to correct for these correlations by estimating a variance inflation factor directly from the data. Although these methods generate P-values for detecting gene set activity, they are unable to produce confidence intervals or allow for post hoc comparisons. We have developed a new computational framework for Quantitative Set Analysis of Gene Expression (QuSAGE). QuSAGE accounts for inter-gene correlations, improves the estimation of the variance inflation factor and, rather than evaluating the deviation from a null hypothesis with a P-value, it quantifies gene-set activity with a complete probability density function. From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability. Compared with Gene Set Enrichment Analysis and CAMERA, QuSAGE exhibits better sensitivity and specificity on real data profiling the response to interferon therapy (in chronic Hepatitis C virus patients) and Influenza A virus infection. QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.  相似文献   

5.
Pathway analysis of genome-wide association studies (GWAS) offer a unique opportunity to collectively evaluate genetic variants with effects that are too small to be detected individually. We applied a pathway analysis to a bladder cancer GWAS containing data from 3,532 cases and 5,120 controls of European background (n = 5 studies). Thirteen hundred and ninety-nine pathways were drawn from five publicly available resources (Biocarta, Kegg, NCI-PID, HumanCyc, and Reactome), and we constructed 22 additional candidate pathways previously hypothesized to be related to bladder cancer. In total, 1421 pathways, 5647 genes and ∼90,000 SNPs were included in our study. Logistic regression model adjusting for age, sex, study, DNA source, and smoking status was used to assess the marginal trend effect of SNPs on bladder cancer risk. Two complementary pathway-based methods (gene-set enrichment analysis [GSEA], and adapted rank-truncated product [ARTP]) were used to assess the enrichment of association signals within each pathway. Eighteen pathways were detected by either GSEA or ARTP at P≤0.01. To minimize false positives, we used the I2 statistic to identify SNPs displaying heterogeneous effects across the five studies. After removing these SNPs, seven pathways (‘Aromatic amine metabolism’ [PGSEA = 0.0100, PARTP = 0.0020], ‘NAD biosynthesis’ [PGSEA = 0.0018, PARTP = 0.0086], ‘NAD salvage’ [PARTP = 0.0068], ‘Clathrin derived vesicle budding’ [PARTP = 0.0018], ‘Lysosome vesicle biogenesis’ [PGSEA = 0.0023, PARTP<0.00012], ’Retrograde neurotrophin signaling’ [PGSEA = 0.00840], and ‘Mitotic metaphase/anaphase transition’ [PGSEA = 0.0040]) remained. These pathways seem to belong to three fundamental cellular processes (metabolic detoxification, mitosis, and clathrin-mediated vesicles). Identification of the aromatic amine metabolism pathway provides support for the ability of this approach to identify pathways with established relevance to bladder carcinogenesis.  相似文献   

6.

Background

Polycystic ovary syndrome (PCOS) is one of the most common endocrine disorders in women of reproductive age, and it is affected by both environmental and genetic factors. Although the genetic component of PCOS is evident, studies aiming to identify susceptibility genes have shown controversial results. This study conducted a pathway-based analysis using a dataset obtained through a genome-wide association study (GWAS) to elucidate the biological pathways that contribute to PCOS susceptibility and the associated genes.

Methods

We used GWAS data on 636,797 autosomal single nucleotide polymorphisms (SNPs) from 1,221 individuals (432 PCOS patients and 789 controls) for analysis. A pathway analysis was conducted using meta-analysis gene-set enrichment of variant associations (MAGENTA). Top-ranking pathways or gene sets associated with PCOS were identified, and significant genes within the pathways were analyzed.

Results

The pathway analysis of the GWAS dataset identified significant pathways related to oocyte meiosis and the regulation of insulin secretion by acetylcholine and free fatty acids (all nominal gene-set enrichment analysis (GSEA) P-values < 0.05). In addition, INS, GNAQ, STXBP1, PLCB3, PLCB2, SMC3 and PLCZ1 were significant genes observed within the biological pathways (all gene P-values < 0.05).

Conclusions

By applying MAGENTA pathway analysis to PCOS GWAS data, we identified significant pathways and candidate genes involved in PCOS. Our findings may provide new leads for understanding the mechanisms underlying the development of PCOS.  相似文献   

7.

Background  

Gene-set analysis evaluates the expression of biological pathways, or a priori defined gene sets, rather than that of individual genes, in association with a binary phenotype, and is of great biologic interest in many DNA microarray studies. Gene Set Enrichment Analysis (GSEA) has been applied widely as a tool for gene-set analyses. We describe here some critical problems with GSEA and propose an alternative method by extending the individual-gene analysis method, Significance Analysis of Microarray (SAM), to gene-set analyses (SAM-GS).  相似文献   

8.
9.
RNA-Seq technologies are quickly revolutionizing genomic studies, and statistical methods for RNA-seq data are under continuous development. Timely review and comparison of the most recently proposed statistical methods will provide a useful guide for choosing among them for data analysis. Particular interest surrounds the ability to detect differential expression (DE) in genes. Here we compare four recently proposed statistical methods, edgeR, DESeq, baySeq, and a method with a two-stage Poisson model (TSPM), through a variety of simulations that were based on different distribution models or real data. We compared the ability of these methods to detect DE genes in terms of the significance ranking of genes and false discovery rate control. All methods compared are implemented in freely available software. We also discuss the availability and functions of the currently available versions of these software.  相似文献   

10.
RNA-seq is now the technology of choice for genome-wide differential gene expression experiments, but it is not clear how many biological replicates are needed to ensure valid biological interpretation of the results or which statistical tools are best for analyzing the data. An RNA-seq experiment with 48 biological replicates in each of two conditions was performed to answer these questions and provide guidelines for experimental design. With three biological replicates, nine of the 11 tools evaluated found only 20%–40% of the significantly differentially expressed (SDE) genes identified with the full set of 42 clean replicates. This rises to >85% for the subset of SDE genes changing in expression by more than fourfold. To achieve >85% for all SDE genes regardless of fold change requires more than 20 biological replicates. The same nine tools successfully control their false discovery rate at ≲5% for all numbers of replicates, while the remaining two tools fail to control their FDR adequately, particularly for low numbers of replicates. For future RNA-seq experiments, these results suggest that at least six biological replicates should be used, rising to at least 12 when it is important to identify SDE genes for all fold changes. If fewer than 12 replicates are used, a superior combination of true positive and false positive performances makes edgeR and DESeq2 the leading tools. For higher replicate numbers, minimizing false positives is more important and DESeq marginally outperforms the other tools.  相似文献   

11.
Fistulifera sp. strain JPCC DA0580 is a newly sequenced pennate diatom that is capable of simultaneously growing and accumulating lipids. This is a unique trait, not found in other related microalgae so far. It is able to accumulate between 40 to 60% of its cell weight in lipids, making it a strong candidate for the production of biofuel. To investigate this characteristic, we used RNA-Seq data gathered at four different times while Fistulifera sp. strain JPCC DA0580 was grown in oil accumulating and non-oil accumulating conditions. We then adapted gene set enrichment analysis (GSEA) to investigate the relationship between the difference in gene expression of 7,822 genes and metabolic functions in our data. We utilized information in the KEGG pathway database to create the gene sets and changed GSEA to use re-sampling so that data from the different time points could be included in the analysis. Our GSEA method identified photosynthesis, lipid synthesis and amino acid synthesis related pathways as processes that play a significant role in oil production and growth in Fistulifera sp. strain JPCC DA0580. In addition to GSEA, we visualized the results by creating a network of compounds and reactions, and plotted the expression data on top of the network. This made existing graph algorithms available to us which we then used to calculate a path that metabolizes glucose into triacylglycerol (TAG) in the smallest number of steps. By visualizing the data this way, we observed a separate up-regulation of genes at different times instead of a concerted response. We also identified two metabolic paths that used less reactions than the one shown in KEGG and showed that the reactions were up-regulated during the experiment. The combination of analysis and visualization methods successfully analyzed time-course data, identified important metabolic pathways and provided new hypotheses for further research.  相似文献   

12.

Background

Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data.

Results

We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations.

Conclusions

Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight.
  相似文献   

13.
14.
A test-statistic typically employed in the gene set enrichment analysis (GSEA) prevents this method from being genuinely multivariate. In particular, this statistic is insensitive to changes in the correlation structure of the gene sets of interest. The present paper considers the utility of an alternative test-statistic in designing the confirmatory component of the GSEA. This statistic is based on a pertinent distance between joint distributions of expression levels of genes included in the set of interest. The null distribution of the proposed test-statistic, known as the multivariate N-statistic, is obtained by permuting group labels. Our simulation studies and analysis of biological data confirm the conjecture that the N-statistic is a much better choice for multivariate significance testing within the framework of the GSEA. We also discuss some other aspects of the GSEA paradigm and suggest new avenues for future research.  相似文献   

15.
16.
Pathway analysis has been proposed as a complement to single SNP analyses in GWAS. This study compared pathway analysis methods using two lung cancer GWAS data sets based on four studies: one a combined data set from Central Europe and Toronto (CETO); the other a combined data set from Germany and MD Anderson (GRMD). We searched the literature for pathway analysis methods that were widely used, representative of other methods, and had available software for performing analysis. We selected the programs EASE, which uses a modified Fishers Exact calculation to test for pathway associations, GenGen (a version of Gene Set Enrichment Analysis (GSEA)), which uses a Kolmogorov-Smirnov-like running sum statistic as the test statistic, and SLAT, which uses a p-value combination approach. We also included a modified version of the SUMSTAT method (mSUMSTAT), which tests for association by averaging χ2 statistics from genotype association tests. There were nearly 18000 genes available for analysis, following mapping of more than 300,000 SNPs from each data set. These were mapped to 421 GO level 4 gene sets for pathway analysis. Among the methods designed to be robust to biases related to gene size and pathway SNP correlation (GenGen, mSUMSTAT and SLAT), the mSUMSTAT approach identified the most significant pathways (8 in CETO and 1 in GRMD). This included a highly plausible association for the acetylcholine receptor activity pathway in both CETO (FDR≤0.001) and GRMD (FDR = 0.009), although two strong association signals at a single gene cluster (CHRNA3-CHRNA5-CHRNB4) drive this result, complicating its interpretation. Few other replicated associations were found using any of these methods. Difficulty in replicating associations hindered our comparison, but results suggest mSUMSTAT has advantages over the other approaches, and may be a useful pathway analysis tool to use alongside other methods such as the commonly used GSEA (GenGen) approach.  相似文献   

17.
Objective: We aimed to explore the prognostic value of a disintegrin and metalloproteinase with thrombospondin motifs (ADAMTS) genes in gastric cancer (GC). Methods: The RNA-sequencing (RNA-seq) expression data for 351 GC patients and other relevant clinical data were acquired from The Cancer Genome Atlas (TCGA). Survival analysis and a genome-wide gene set enrichment analysis (GSEA) were performed to define the underlying molecular value of the ADAMTS genes in GC development. Besides, qRT-PCR and immunohistochemistry were all employed to validate the relationship between the expression of these genes and GC patient prognosis. Results: The Log rank test with both Cox regression and Kaplan–Meier survival analyses showed that ADAMTS6 expression profile correlated with the GC patients clinical outcome. Patients with a high expression of ADAMTS6 were associated with poor overall survival (OS). Comprehensive survival analysis of the ADAMTS genes suggests that ADAMTS6 might be an independent predictive factor for the OS in patients with GC. Besides, GSEA demonstrated that ADAMTS6 might be involved in multiple biological processes and pathways, such as the vascular endothelial growth factor A (VEGFA), kirsten rat sarcoma viral oncogene (KRAS), tumor protein P53, c-Jun N-terminal kinase (JNK), cadherin (CDH1) or tumor necrosis factor (TNF) pathways. It was also confirmed by immunohistochemistry and qRT-PCR that ADAMTS6 is highly expressed in GC, which may be related to the prognosis of GC patients. Conclusion: In summary, our study demonstrated that ADAMTS6 gene could be used as a potential molecular marker for GC prognosis.  相似文献   

18.
19.

Background

Deviations in the amount of genomic content that arise during tumorigenesis, called copy number alterations, are structural rearrangements that can critically affect gene expression patterns. Additionally, copy number alteration profiles allow insight into cancer discrimination, progression and complexity. On data obtained from high-throughput sequencing, improving quality through GC bias correction and keeping false positives to a minimum help build reliable copy number alteration profiles.

Results

We introduce seqCNA, a parallelized R package for an integral copy number analysis of high-throughput sequencing cancer data. The package includes novel methodology on (i) filtering, reducing false positives, and (ii) GC content correction, improving copy number profile quality, especially under great read coverage and high correlation between GC content and copy number. Adequate analysis steps are automatically chosen based on availability of paired-end mapping, matched normal samples and genome annotation.

Conclusions

seqCNA, available through Bioconductor, provides accurate copy number predictions in tumoural data, thanks to the extensive filtering and better GC bias correction, while providing an integrated and parallelized workflow.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-178) contains supplementary material, which is available to authorized users.  相似文献   

20.
Molecular identification of mixed‐species pollen samples has a range of applications in various fields of research. To date, such molecular identification has primarily been carried out via amplicon sequencing, but whole‐genome shotgun (WGS) sequencing of pollen DNA has potential advantages, including (1) more genetic information per sample and (2) the potential for better quantitative matching. In this study, we tested the performance of WGS sequencing methodology and publicly available reference sequences in identifying species and quantifying their relative abundance in pollen mock communities. Using mock communities previously analyzed with DNA metabarcoding, we sequenced approximately 200Mbp for each sample using Illumina HiSeq and MiSeq. Taxonomic identifications were based on the Kraken k‐mer identification method with reference libraries constructed from full‐genome and short read archive data from the NCBI database. We found WGS to be a reliable method for taxonomic identification of pollen with near 100% identification of species in mixtures but generating higher rates of false positives (reads not identified to the correct taxon at the required taxonomic level) relative to rbcL and ITS2 amplicon sequencing. For quantification of relative species abundance, WGS data provided a stronger correlation between pollen grain proportion and sequence read proportion, but diverged more from a 1:1 relationship, likely due to the higher rate of false positives. Currently, a limitation of WGS‐based pollen identification is the lack of representation of plant diversity in publicly available genome databases. As databases improve and costs drop, we expect that eventually genomics methods will become the methods of choice for species identification and quantification of mixed‐species pollen samples.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号