首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.

Introduction

A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives

Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods

We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results

Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion

Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.
  相似文献   

2.

Background

Single-cell RNA sequencing (scRNA-seq) technology provides an effective way to study cell heterogeneity. However, due to the low capture efficiency and stochastic gene expression, scRNA-seq data often contains a high percentage of missing values. It has been showed that the missing rate can reach approximately 30% even after noise reduction. To accurately recover missing values in scRNA-seq data, we need to know where the missing data is; how much data is missing; and what are the values of these data.

Methods

To solve these three problems, we propose a novel model with a hybrid machine learning method, namely, missing imputation for single-cell RNA-seq (MISC). To solve the first problem, we transformed it to a binary classification problem on the RNA-seq expression matrix. Then, for the second problem, we searched for the intersection of the classification results, zero-inflated model and false negative model results. Finally, we used the regression model to recover the data in the missing elements.

Results

We compared the raw data without imputation, the mean-smooth neighbor cell trajectory, MISC on chronic myeloid leukemia data (CML), the primary somatosensory cortex and the hippocampal CA1 region of mouse brain cells. On the CML data, MISC discovered a trajectory branch from the CP-CML to the BC-CML, which provides direct evidence of evolution from CP to BC stem cells. On the mouse brain data, MISC clearly divides the pyramidal CA1 into different branches, and it is direct evidence of pyramidal CA1 in the subpopulations. In the meantime, with MISC, the oligodendrocyte cells became an independent group with an apparent boundary.

Conclusions

Our results showed that the MISC model improved the cell type classification and could be instrumental to study cellular heterogeneity. Overall, MISC is a robust missing data imputation model for single-cell RNA-seq data.
  相似文献   

3.

Background

Non-Negative Matrix factorization has become an essential tool for feature extraction in a wide spectrum of applications. In the present work, our objective is to extend the applicability of the method to the case of missing and/or corrupted data due to outliers.

Results

An essential property for missing data imputation and detection of outliers is that the uncorrupted data matrix is low rank, i.e. has only a small number of degrees of freedom. We devise a new version of the Bregman proximal idea which preserves nonnegativity and mix it with the Augmented Lagrangian approach for simultaneous reconstruction of the features of interest and detection of the outliers using a sparsity promoting ? 1 penality.

Conclusions

An application to the analysis of gene expression data of patients with bladder cancer is finally proposed.
  相似文献   

4.

Introduction

The generic metabolomics data processing workflow is constructed with a serial set of processes including peak picking, quality assurance, normalisation, missing value imputation, transformation and scaling. The combination of these processes should present the experimental data in an appropriate structure so to identify the biological changes in a valid and robust manner.

Objectives

Currently, different researchers apply different data processing methods and no assessment of the permutations applied to UHPLC-MS datasets has been published. Here we wish to define the most appropriate data processing workflow.

Methods

We assess the influence of normalisation, missing value imputation, transformation and scaling methods on univariate and multivariate analysis of UHPLC-MS datasets acquired for different mammalian samples.

Results

Our studies have shown that once data are filtered, missing values are not correlated with m/z, retention time or response. Following an exhaustive evaluation, we recommend PQN normalisation with no missing value imputation and no transformation or scaling for univariate analysis. For PCA we recommend applying PQN normalisation with Random Forest missing value imputation, glog transformation and no scaling method. For PLS-DA we recommend PQN normalisation, KNN as the missing value imputation method, generalised logarithm transformation and no scaling. These recommendations are based on searching for the biologically important metabolite features independent of their measured abundance.

Conclusion

The appropriate choice of normalisation, missing value imputation, transformation and scaling methods differs depending on the data analysis method and the choice of method is essential to maximise the biological derivations from UHPLC-MS datasets.
  相似文献   

5.

Background

Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.

Methods

We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n?=?1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci.

Results

Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.

Conclusion

Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.
  相似文献   

6.

Background

The current literature establishes the importance of gene functional category and expression in promoting or suppressing duplicate gene loss after whole genome doubling in plants, a process known as fractionation. Inspired by studies that have reported gene expression to be the dominating factor in preventing duplicate gene loss, we analyzed the relative effect of functional category and expression.

Methods

We use multivariate methods to study data sets on gene retention, function and expression in rosids and asterids to estimate effects and assess their interaction.

Results

Our results suggest that the effect on duplicate gene retention fractionation by functional category and expression are independent and have no statistical interaction.

Conclusion

In plants, functional category is the more dominant factor in explaining duplicate gene loss.
  相似文献   

7.

Background

Microarray gene expression data are accumulating in public databases. The expression profiles contain valuable information for understanding human gene expression patterns. However, the effective use of public microarray data requires integrating the expression profiles from heterogeneous sources.

Results

In this study, we have compiled a compendium of microarray expression profiles of various human tissue samples. The microarray raw data generated in different research laboratories have been obtained and combined into a single dataset after data normalization and transformation. To demonstrate the usefulness of the integrated microarray data for studying human gene expression patterns, we have analyzed the dataset to identify potential tissue-selective genes. A new method has been proposed for genome-wide identification of tissue-selective gene targets using both microarray intensity values and detection calls. The candidate genes for brain, liver and testis-selective expression have been examined, and the results suggest that our approach can select some interesting gene targets for further experimental studies.

Conclusion

A computational approach has been developed in this study for combining microarray expression profiles from heterogeneous sources. The integrated microarray data can be used to investigate tissue-selective expression patterns of human genes.
  相似文献   

8.

Background

Microarray technologies produced large amount of data. The hierarchical clustering is commonly used to identify clusters of co-expressed genes. However, microarray datasets often contain missing values (MVs) representing a major drawback for the use of the clustering methods. Usually the MVs are not treated, or replaced by zero or estimated by the k-Nearest Neighbor (kNN) approach. The topic of the paper is to study the stability of gene clusters, defined by various hierarchical clustering algorithms, of microarrays experiments including or not MVs.

Results

In this study, we show that the MVs have important effects on the stability of the gene clusters. Moreover, the magnitude of the gene misallocations is depending on the aggregation algorithm. The most appropriate aggregation methods (e.g. complete-linkage and Ward) are highly sensitive to MVs, and surprisingly, for a very tiny proportion of MVs (e.g. 1%). In most of the case, the MVs must be replaced by expected values. The MVs replacement by the kNN approach clearly improves the identification of co-expressed gene clusters. Nevertheless, we observe that kNN approach is less suitable for the extreme values of gene expression.

Conclusion

The presence of MVs (even at a low rate) is a major factor of gene cluster instability. In addition, the impact depends on the hierarchical clustering algorithm used. Some methods should be used carefully. Nevertheless, the kNN approach constitutes one efficient method for restoring the missing expression gene values, with a low error level. Our study highlights the need of statistical treatments in microarray data to avoid misinterpretation.
  相似文献   

9.

Background

Identification of common genes associated with comorbid diseases can be critical in understanding their pathobiological mechanism. This work presents a novel method to predict missing common genes associated with a disease pair. Searching for missing common genes is formulated as an optimization problem to minimize network based module separation from two subgraphs produced by mapping genes associated with disease onto the interactome.

Results

Using cross validation on more than 600 disease pairs, our method achieves significantly higher average receiver operating characteristic ROC Score of 0.95 compared to a baseline ROC score 0.60 using randomized data.

Conclusion

Missing common genes prediction is aimed to complete gene set associated with comorbid disease for better understanding of biological intervention. It will also be useful for gene targeted therapeutics related to comorbid diseases. This method can be further considered for prediction of missing edges to complete the subgraph associated with disease pair.
  相似文献   

10.
11.
Xing  Linlin  Guo  Maozu  Liu  Xiaoyan  Wang  Chunyu  Wang  Lei  Zhang  Yin 《BMC genomics》2017,18(9):844-30

Background

The reconstruction of gene regulatory network (GRN) from gene expression data can discover regulatory relationships among genes and gain deep insights into the complicated regulation mechanism of life. However, it is still a great challenge in systems biology and bioinformatics. During the past years, numerous computational approaches have been developed for this goal, and Bayesian network (BN) methods draw most of attention among these methods because of its inherent probability characteristics. However, Bayesian network methods are time consuming and cannot handle large-scale networks due to their high computational complexity, while the mutual information-based methods are highly effective but directionless and have a high false-positive rate.

Results

To solve these problems, we propose a Candidate Auto Selection algorithm (CAS) based on mutual information and breakpoint detection to restrict the search space in order to accelerate the learning process of Bayesian network. First, the proposed CAS algorithm automatically selects the neighbor candidates of each node before searching the best structure of GRN. Then based on CAS algorithm, we propose a globally optimal greedy search method (CAS + G), which focuses on finding the highest rated network structure, and a local learning method (CAS + L), which focuses on faster learning the structure with little loss of quality.

Conclusion

Results show that the proposed CAS algorithm can effectively reduce the search space of Bayesian networks through identifying the neighbor candidates of each node. In our experiments, the CAS + G method outperforms the state-of-the-art method on simulation data for inferring GRNs, and the CAS + L method is significantly faster than the state-of-the-art method with little loss of accuracy. Hence, the CAS based methods effectively decrease the computational complexity of Bayesian network and are more suitable for GRN inference.
  相似文献   

12.

Introduction

Data processing is one of the biggest problems in metabolomics, given the high number of samples analyzed and the need of multiple software packages for each step of the processing workflow.

Objectives

Merge in the same platform the steps required for metabolomics data processing.

Methods

KniMet is a workflow for the processing of mass spectrometry-metabolomics data based on the KNIME Analytics platform.

Results

The approach includes key steps to follow in metabolomics data processing: feature filtering, missing value imputation, normalization, batch correction and annotation.

Conclusion

KniMet provides the user with a local, modular and customizable workflow for the processing of both GC–MS and LC–MS open profiling data.
  相似文献   

13.

Background

Differential gene expression is important to understand the biological differences between healthy and diseased states. Two common sources of differential gene expression data are microarray studies and the biomedical literature.

Methods

With the aid of text mining and gene expression analysis we have examined the comparative properties of these two sources of differential gene expression data.

Results

The literature shows a preference for reporting genes associated to higher fold changes in microarray data, rather than genes that are simply significantly differentially expressed. Thus, the resemblance between the literature and microarray data increases when the fold-change threshold for microarray data is increased. Moreover, the literature has a reporting preference for differentially expressed genes that (1) are overexpressed rather than underexpressed; (2) are overexpressed in multiple diseases; and (3) are popular in the biomedical literature at large. Additionally, the degree to which diseases are similar depends on whether microarray data or the literature is used to compare them. Finally, vaguely-qualified reports of differential expression magnitudes in the literature have only small correlation with microarray fold-change data.

Conclusions

Reporting biases of differential gene expression in the literature can be affecting our appreciation of disease biology and of the degree of similarity that actually exists between different diseases.
  相似文献   

14.

Background

With the rapid accumulation of genomic data, it has become a challenge issue to annotate and interpret these data. As a representative, Gene set enrichment analysis has been widely used to interpret large molecular datasets generated by biological experiments. The result of gene set enrichment analysis heavily relies on the quality and integrity of gene set annotations. Although several methods were developed to annotate gene sets, there is still a lack of high quality annotation methods. Here, we propose a novel method to improve the annotation accuracy through combining the GO structure and gene expression data.

Results

We propose a novel approach for optimizing gene set annotations to get more accurate annotation results. The proposed method filters the inconsistent annotations using GO structure information and probabilistic gene set clusters calculated by a range of cluster sizes over multiple bootstrap resampled datasets. The proposed method is employed to analyze p53 cell lines, colon cancer and breast cancer gene expression data. The experimental results show that the proposed method can filter a number of annotations unrelated to experimental data and increase gene set enrichment power and decrease the inconsistent of annotations.

Conclusions

A novel gene set annotation optimization approach is proposed to improve the quality of gene annotations. Experimental results indicate that the proposed method effectively improves gene set annotation quality based on the GO structure and gene expression data.
  相似文献   

15.
Tang J  Guo S  Sun Q  Deng Y  Zhou D 《BMC genomics》2010,11(Z2):S9

Background

Ultrasound imaging technology has wide applications in cattle reproduction and has been used to monitor individual follicles and determine the patterns of follicular development. However, the speckles in ultrasound images affect the post-processing, such as follicle segmentation and finally affect the measurement of the follicles. In order to reduce the effect of speckles, a bilateral filter is developed in this paper.

Results

We develop a new bilateral filter for speckle reduction in ultrasound images for follicle segmentation and measurement. Different from the previous bilateral filters, the proposed bilateral filter uses normalized difference in the computation of the Gaussian intensity difference. We also present the results of follicle segmentation after speckle reduction. Experimental results on both synthetic images and real ultrasound images demonstrate the effectiveness of the proposed filter.

Conclusions

Compared with the previous bilateral filters, the proposed bilateral filter can reduce speckles in both high-intensity regions and low intensity regions in ultrasound images. The segmentation of the follicles in the speckle reduced images by the proposed method has higher performance than the segmentation in the original ultrasound image, and the images filtered by Gaussian filter, the conventional bilateral filter respectively.
  相似文献   

16.

Background

Modern approaches to treating genetic disorders, cancers and even epidemics rely on a detailed understanding of the underlying gene signaling network. Previous work has used time series microarray data to infer gene signaling networks given a large number of accurate time series samples. Microarray data available for many biological experiments is limited to a small number of arrays with little or no time series guarantees. When several samples are averaged to examine differences in mean value between a diseased and normal state, information from individual samples that could indicate a gene relationship can be lost.

Results

Asynchronous Inference of Regulatory Networks (AIRnet) provides gene signaling network inference using more practical assumptions about the microarray data. By learning correlation patterns for the changes in microarray values from all pairs of samples, accurate network reconstructions can be performed with data that is normally available in microarray experiments.

Conclusions

By focussing on the changes between microarray samples, instead of absolute values, increased information can be gleaned from expression data.
  相似文献   

17.

Background

Randomised controlled trials (RCTs) are perceived as the gold-standard method for evaluating healthcare interventions, and increasingly include quality of life (QoL) measures. The observed results are susceptible to bias if a substantial proportion of outcome data are missing. The review aimed to determine whether imputation was used to deal with missing QoL outcomes.

Methods

A random selection of 285 RCTs published during 2005/6 in the British Medical Journal, Lancet, New England Journal of Medicine and Journal of American Medical Association were identified.

Results

QoL outcomes were reported in 61 (21%) trials. Six (10%) reported having no missing data, 20 (33%) reported ≤ 10% missing, eleven (18%) 11%–20% missing, and eleven (18%) reported >20% missing. Missingness was unclear in 13 (21%). Missing data were imputed in 19 (31%) of the 61 trials. Imputation was part of the primary analysis in 13 trials, but a sensitivity analysis in six. Last value carried forward was used in 12 trials and multiple imputation in two. Following imputation, the most common analysis method was analysis of covariance (10 trials).

Conclusion

The majority of studies did not impute missing data and carried out a complete-case analysis. For those studies that did impute missing data, researchers tended to prefer simpler methods of imputation, despite more sophisticated methods being available.
  相似文献   

18.
19.

Background

Superpositioning is an important problem in structural biology. Determining an optimal superposition requires a one-to-one correspondence between the atoms of two proteins structures. However, in practice, some atoms are missing from their original structures. Current superposition implementations address the missing data crudely by ignoring such atoms from their structures.

Results

In this paper, we propose an effective method for superpositioning pairwise and multiple structures without sequence alignment. It is a two-stage procedure including data reduction and data registration.

Conclusions

Numerical experiments demonstrated that our method is effective and efficient. The code package of protein structure superposition method for addressing the cases with missing data is implemented by MATLAB, and it is freely available from: http://sourceforge.net/projects/pssm123/files/?source=navbar
  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号