首页 | 本学科首页   官方微博 | 高级检索  
   检索      


DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis
Authors:Quanhu Sheng  Yu Shyr  Xi Chen
Institution:.Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232 USA ;.Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, TN 37232 USA ;.Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232 USA
Abstract:

Background

Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis.

Results

We developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package.

Conclusions

Researchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-323) contains supplementary material, which is available to authorized users.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号