期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

Malik Yousef Segun Jung Louise C Showe Michael K Showe 《BMC bioinformatics》2007,8(1):144

Background

Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. 相似文献

2.

Classification and biomarker identification using gene network modules and support vector machines

Malik Yousef Mohamed Ketany Larry Manevitz Louise C Showe Michael K Showe 《BMC bioinformatics》2009,10(1):337

Background

Classification using microarray datasets is usually based on a small number of samples for which tens of thousands of gene expression measurements have been obtained. The selection of the genes most significant to the classification problem is a challenging issue in high dimension data analysis and interpretation. A previous study with SVM-RCE (Recursive Cluster Elimination), suggested that classification based on groups of correlated genes sometimes exhibits better performance than classification using single genes. Large databases of gene interaction networks provide an important resource for the analysis of genetic phenomena and for classification studies using interacting genes. 相似文献

3.

Missing value imputation for microarray gene expression data using histone acetylation information

Qian Xiang Xianhua Dai Yangyang Deng Caisheng He Jiang Wang Jihua Feng Zhiming Dai 《BMC bioinformatics》2008,9(1):252

Background

It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. 相似文献

4.

SplicerAV: a tool for mining microarray expression data for changes in RNA processing

Timothy J Robinson Michaela A Dinan Mark Dewhirst Mariano A Garcia-Blanco James L Pearson 《BMC bioinformatics》2010,11(1):108

Background

Over the past two decades more than fifty thousand unique clinical and biological samples have been assayed using the Affymetrix HG-U133 and HG-U95 GeneChip microarray platforms. This substantial repository has been used extensively to characterize changes in gene expression between biological samples, but has not been previously mined en masse for changes in mRNA processing. We explored the possibility of using HG-U133 microarray data to identify changes in alternative mRNA processing in several available archival datasets. 相似文献

5.

Probabilistic prediction and ranking of human protein-protein interactions

Michelle S Scott Geoffrey J Barton 《BMC bioinformatics》2007,8(1):239

Background

Although the prediction of protein-protein interactions has been extensively investigated for yeast, few such datasets exist for the far larger proteome in human. Furthermore, it has recently been estimated that the overall average false positive rate of available computational and high-throughput experimental interaction datasets is as high as 90%. 相似文献

6.

GenomeGraphs: integrated genomic data visualization with R

Steffen Durinck James Bullard Paul T Spellman Sandrine Dudoit 《BMC bioinformatics》2009,10(1):2-9

Background

Biological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses. 相似文献

7.

FastGroupII: A web-based bioinformatics platform for analyses of large 16S rDNA libraries

Yanan Yu Mya Breitbart Pat McNairnie Forest Rohwer 《BMC bioinformatics》2006,7(1):57

Background

High-throughput sequencing makes it possible to rapidly obtain thousands of 16S rDNA sequences from environmental samples. Bioinformatic tools for the analyses of large 16S rDNA sequence databases are needed to comprehensively describe and compare these datasets. 相似文献

8.

Asymmetric microarray data produces gene lists highly predictive of research literature on multiple cancer types

Noor B Dawany Aydin Tozeren 《BMC bioinformatics》2010,11(1):483

Background

Much of the public access cancer microarray data is asymmetric, belonging to datasets containing no samples from normal tissue. Asymmetric data cannot be used in standard meta-analysis approaches (such as the inverse variance method) to obtain large sample sizes for statistical power enrichment. Noting that plenty of normal tissue microarray samples exist in studies not involving cancer, we investigated the viability and accuracy of an integrated microarray analysis approach based on significance analysis of microarrays (merged SAM) using a collection of data from separate diseased and normal samples. 相似文献

9.

Alkahest NuclearBLAST : a user-friendly BLAST management and analysis system

Stephen?E?Diener Thomas?D?Houfek Sam?E?Kalat DE?Windham Mark?Burke Charles?Opperman Ralph?A?Dean Email author 《BMC bioinformatics》2005,6(1):147

Background -

Sequencing of EST and BAC end datasets is no longer limited to large research groups. Drops in per-base pricing have made high throughput sequencing accessible to individual investigators. However, there are few options available which provide a free and user-friendly solution to the BLAST result storage and data mining needs of biologists. 相似文献

10.

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Vikas Bansal 《BMC bioinformatics》2017,18(3):43

Background

PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from “natural” read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.

Results

In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.

Conclusions

The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates.

相似文献

11.

DeBi: Discovering Differentially Expressed Biclusters using a Frequent Itemset Approach

Serin A Vingron M 《Algorithms for molecular biology : AMB》2011,6(1):18-12

相似文献

12.

Stratification bias in low signal microarray studies

Brian J Parker Simon Günter Justin Bedo 《BMC bioinformatics》2007,8(1):326

Background

When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. 相似文献

13.

AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number

Aaron M Newman James B Cooper 《BMC bioinformatics》2010,11(1):117

Background

Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry. 相似文献

14.

Characteristics of predictor sets found using differential prioritization

Chia Huey Ooi Madhu Chetty Shyh Wei Teng 《Algorithms for molecular biology : AMB》2007,2(1):7-21

Background

Feature selection plays an undeniably important role in classification problems involving high dimensional datasets such as microarray datasets. For filter-based feature selection, two well-known criteria used in forming predictor sets are relevance and redundancy. However, there is a third criterion which is at least as important as the other two in affecting the efficacy of the resulting predictor sets. This criterion is the degree of differential prioritization (DDP), which varies the emphases on relevance and redundancy depending on the value of the DDP. Previous empirical works on publicly available microarray datasets have confirmed the effectiveness of the DDP in molecular classification. We now propose to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is to be done through a simulation study which involves vigorous analyses of the characteristics of predictor sets found using different values of the DDP from toy datasets designed to mimic real-life microarray datasets. 相似文献

15.

Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

Chia Huey Ooi Madhu Chetty Shyh Wei Teng 《BMC bioinformatics》2006,7(1):320

Background

Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. 相似文献

16.

An improved statistical model for taxonomic assignment of metagenomics

Yujing Yao Zhezhen Jin Joseph H Lee 《BMC genetics》2018,19(1):98

Background

With the advances in the next-generation sequencing technologies, researchers can now rapidly examine the composition of samples from humans and their surroundings. To enhance the accuracy of taxonomy assignments in metagenomic samples, we developed a method that allows multiple mismatch probabilities from different genomes.

Results

We extended the algorithm of taxonomic assignment of metagenomic sequence reads (TAMER) by developing an improved method that can set a different mismatch probability for each genome rather than imposing a single parameter for all genomes, thereby obtaining a greater degree of accuracy. This method, which we call TADIP (Taxonomic Assignment of metagenomics based on DIfferent Probabilities), was comprehensively tested in simulated and real datasets. The results support that TADIP improved the performance of TAMER especially in large sample size datasets with high complexity.

Conclusions

TADIP was developed as a statistical model to improve the estimate accuracy of taxonomy assignments. Based on its varying mismatch probability setting and correlated variance matrix setting, its performance was enhanced for high complexity samples when compared with TAMER.

相似文献

17.

A benchmark for statistical microarray data analysis that preserves actual biological and technical variance

Benoît De Hertogh Bertrand De Meulder Fabrice Berger Michael Pierre Eric Bareke Anthoula Gaigneaux Eric Depiereux 《BMC bioinformatics》2010,11(1):17

Background

Recent reanalysis of spike-in datasets underscored the need for new and more accurate benchmark datasets for statistical microarray analysis. We present here a fresh method using biologically-relevant data to evaluate the performance of statistical methods. 相似文献

18.

ITS as an environmental DNA barcode for fungi: an <Emphasis Type="Italic">in silico</Emphasis> approach reveals potential PCR biases

Eva Bellemain Tor Carlsen Christian Brochmann Eric Coissac Pierre Taberlet Håvard Kauserud 《BMC microbiology》2010,10(1):189

Background

During the last 15 years the internal transcribed spacer (ITS) of nuclear DNA has been used as a target for analyzing fungal diversity in environmental samples, and has recently been selected as the standard marker for fungal DNA barcoding. In this study we explored the potential amplification biases that various commonly utilized ITS primers might introduce during amplification of different parts of the ITS region in samples containing mixed templates ('environmental barcoding'). We performed in silico PCR analyses with commonly used primer combinations using various ITS datasets obtained from public databases as templates. 相似文献

19.

An enhanced probabilistic LDA for multi-class brain computer interface

Xu P Yang P Lei X Yao D 《PloS one》2011,6(1):e14634

Background

There is a growing interest in the study of signal processing and machine learning methods, which may make the brain computer interface (BCI) a new communication channel. A variety of classification methods have been utilized to convert the brain information into control commands. However, most of the methods only produce uncalibrated values and uncertain results.

Methodology/Principal Findings

In this study, we presented a probabilistic method “enhanced BLDA” (EBLDA) for multi-class motor imagery BCI, which utilized Bayesian linear discriminant analysis (BLDA) with probabilistic output to improve the classification performance. EBLDA builds a new classifier that enlarges training dataset by adding test samples with high probability. EBLDA is based on the hypothesis that unlabeled samples with high probability provide valuable information to enhance learning process and generate a classifier with refined decision boundaries. To investigate the performance of EBLDA, we first used carefully designed simulated datasets to study how EBLDA works. Then, we adopted a real BCI dataset for further evaluation. The current study shows that: 1) Probabilistic information can improve the performance of BCI for subjects with high kappa coefficient; 2) With supplementary training samples from the test samples of high probability, EBLDA is significantly better than BLDA in classification, especially for small training datasets, in which EBLDA can obtain a refined decision boundary by a shift of BLDA decision boundary with the support of the information from test samples.

Conclusions/Significance

The proposed EBLDA could potentially reduce training effort. Therefore, it is valuable for us to realize an effective online BCI system, especially for multi-class BCI systems. 相似文献

20.

MASPECTRAS: a platform for management and analysis of proteomics LC-MS/MS data

Jürgen Hartler Gerhard G Thallinger Gernot Stocker Alexander Sturn Thomas R Burkard Erik Körner Robert Rader Andreas Schmidt Karl Mechtler Zlatko Trajanoski 《BMC bioinformatics》2007,8(1):197

Background

The advancements of proteomics technologies have led to a rapid increase in the number, size and rate at which datasets are generated. Managing and extracting valuable information from such datasets requires the use of data management platforms and computational approaches. 相似文献