首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Single-cell RNA sequencing (scRNA-seq) technology provides an effective way to study cell heterogeneity. However, due to the low capture efficiency and stochastic gene expression, scRNA-seq data often contains a high percentage of missing values. It has been showed that the missing rate can reach approximately 30% even after noise reduction. To accurately recover missing values in scRNA-seq data, we need to know where the missing data is; how much data is missing; and what are the values of these data.

Methods

To solve these three problems, we propose a novel model with a hybrid machine learning method, namely, missing imputation for single-cell RNA-seq (MISC). To solve the first problem, we transformed it to a binary classification problem on the RNA-seq expression matrix. Then, for the second problem, we searched for the intersection of the classification results, zero-inflated model and false negative model results. Finally, we used the regression model to recover the data in the missing elements.

Results

We compared the raw data without imputation, the mean-smooth neighbor cell trajectory, MISC on chronic myeloid leukemia data (CML), the primary somatosensory cortex and the hippocampal CA1 region of mouse brain cells. On the CML data, MISC discovered a trajectory branch from the CP-CML to the BC-CML, which provides direct evidence of evolution from CP to BC stem cells. On the mouse brain data, MISC clearly divides the pyramidal CA1 into different branches, and it is direct evidence of pyramidal CA1 in the subpopulations. In the meantime, with MISC, the oligodendrocyte cells became an independent group with an apparent boundary.

Conclusions

Our results showed that the MISC model improved the cell type classification and could be instrumental to study cellular heterogeneity. Overall, MISC is a robust missing data imputation model for single-cell RNA-seq data.
  相似文献   

2.

Background

Gene expression time series data are usually in the form of high-dimensional arrays. Unfortunately, the data may sometimes contain missing values: for either the expression values of some genes at some time points or the entire expression values of a single time point or some sets of consecutive time points. This significantly affects the performance of many algorithms for gene expression analysis that take as an input, the complete matrix of gene expression measurement. For instance, previous works have shown that gene regulatory interactions can be estimated from the complete matrix of gene expression measurement. Yet, till date, few algorithms have been proposed for the inference of gene regulatory network from gene expression data with missing values.

Results

We describe a nonlinear dynamic stochastic model for the evolution of gene expression. The model captures the structural, dynamical, and the nonlinear natures of the underlying biomolecular systems. We present point-based Gaussian approximation (PBGA) filters for joint state and parameter estimation of the system with one-step or two-step missing measurements. The PBGA filters use Gaussian approximation and various quadrature rules, such as the unscented transform (UT), the third-degree cubature rule and the central difference rule for computing the related posteriors. The proposed algorithm is evaluated with satisfying results for synthetic networks, in silico networks released as a part of the DREAM project, and the real biological network, the in vivo reverse engineering and modeling assessment (IRMA) network of yeast Saccharomyces cerevisiae.

Conclusion

PBGA filters are proposed to elucidate the underlying gene regulatory network (GRN) from time series gene expression data that contain missing values. In our state-space model, we proposed a measurement model that incorporates the effect of the missing data points into the sequential algorithm. This approach produces a better inference of the model parameters and hence, more accurate prediction of the underlying GRN compared to when using the conventional Gaussian approximation (GA) filters ignoring the missing data points.
  相似文献   

3.

Introduction

A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation.

Objectives

Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations.

Methods

We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data.

Results

Our results show that NS-kNN typically outperforms kNN when at least 20–30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR.

Conclusion

Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.
  相似文献   

4.
Multiple imputation (MI) is increasingly popular for handling multivariate missing data. Two general approaches are available in standard computer packages: MI based on the posterior distribution of incomplete variables under a multivariate (joint) model, and fully conditional specification (FCS), which imputes missing values using univariate conditional distributions for each incomplete variable given all the others, cycling iteratively through the univariate imputation models. In the context of longitudinal or clustered data, it is not clear whether these approaches result in consistent estimates of regression coefficient and variance component parameters when the analysis model of interest is a linear mixed effects model (LMM) that includes both random intercepts and slopes with either covariates or both covariates and outcome contain missing information. In the current paper, we compared the performance of seven different MI methods for handling missing values in longitudinal and clustered data in the context of fitting LMMs with both random intercepts and slopes. We study the theoretical compatibility between specific imputation models fitted under each of these approaches and the LMM, and also conduct simulation studies in both the longitudinal and clustered data settings. Simulations were motivated by analyses of the association between body mass index (BMI) and quality of life (QoL) in the Longitudinal Study of Australian Children (LSAC). Our findings showed that the relative performance of MI methods vary according to whether the incomplete covariate has fixed or random effects and whether there is missingnesss in the outcome variable. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance components via simulation. We illustrate our findings with the analysis of LSAC data.  相似文献   

5.
Mixed cultures submitted to acetate "feast" and "famine" cycles are able to store intracellularly high quantities of polyhydroxybutyrate (PHB). It was demonstrated in a previous study that the intracellular PHB content can be increased up to 78.5% (g HB/gVSS) of cell dry weight in a sequencing batch reactor (SBR) with optimised operating conditions. The specific PHB formation rate was also shown to be higher for mixed cultures than for pure cultures. Such high intracellular PHB contents and specific productivity open new perspectives for the industrial production of polyhydroxyalkanoates (PHA) using mixed cultures instead of pure cultures. The main goal in this work was to develop a mathematical model of mixed cultures envisaging the optimisation of PHB production. A relatively simple two-compartments cell model was developed based on experimental observations and other models proposed in the literature. A convenient experimental planing allowed to identify the kinetic parameters and yield coefficients. Experiments were performed with and without ammonia limitation enabling the analysis of PHB formation independently of the cell growth process. The experimental true yields partially confirm the theoretical values proposed in the literature. The final model exhibited high accuracy in describing the process state of most experiments performed, thus opening good perspectives for future model-based optimisation studies.  相似文献   

6.
A mathematical model has been analysed describing uridine uptake in mammalian cells as a tandem process that involves membrane transport and uridine phosphorylation within the cell. The measurement of kinetic parametres of uridine uptake in 3T6 cells showed that the transport system possesses a low affinity to uridine (Kt = 145 microM) and a high velocity (Vt = 10 microM/sec), whereas the phosphorylation system possesses a high affinity for uridine (Ke = 10 microM) and a low velocity (Ve = 0.17 microM/sec). A method of construction of "ideal" curves was proposed, describing the time dependence of uridine uptake which helps to verify values of kinetic parameters obtained. On the basis of the theoretical analysis and generalization of experimental data it was concluded that the optimum conditions of uridine transport parameters measuring at 25 degrees C involve the uridine concentration in the medium equal to 20-200 microM, and the time of cell incubation, 2-20 sec, while the optimum conditions of uridine phosphorilation parameters measuring being its concentration in the medium 5-20 microM and the cell incubation longer than 1 minute.  相似文献   

7.
Chen B  Zhou XH 《Biometrics》2011,67(3):830-842
Longitudinal studies often feature incomplete response and covariate data. Likelihood-based methods such as the expectation-maximization algorithm give consistent estimators for model parameters when data are missing at random (MAR) provided that the response model and the missing covariate model are correctly specified; however, we do not need to specify the missing data mechanism. An alternative method is the weighted estimating equation, which gives consistent estimators if the missing data and response models are correctly specified; however, we do not need to specify the distribution of the covariates that have missing values. In this article, we develop a doubly robust estimation method for longitudinal data with missing response and missing covariate when data are MAR. This method is appealing in that it can provide consistent estimators if either the missing data model or the missing covariate model is correctly specified. Simulation studies demonstrate that this method performs well in a variety of situations.  相似文献   

8.
We have developed a quadrupole magnetic flow sorter (QMS) to facilitate high-throughput binary cell separation. Optimized QMS operation requires the adjustment of three flow parameters based on the immunomagnetic characteristics of the target cell sample. To overcome the inefficiency of semiempirical operation/optimization of QMS flow parameters, a theoretical model of the QMS sorting process was developed. Application of this model requires measurement of the magnetophoretic mobility distribution of the cell sample by the cell tracking velocimetry (CTV) technique developed in our laboratory. In this work, the theoretical model was experimentally tested using breast carcinoma cells (HCC1954) overexpressing the HER-2/neu gene, and peripheral blood leukocytes (PBLs). The magnetophoretic mobility distribution of immunomagnetically labeled HCC1954 cells was measured using the CTV technique, and then theoretical predictions of sorting recoveries were calculated. Mean magnetophoretic mobilities of (1-3) x 10(-4) mm(3)/(T A s) were obtained depending on the labeling conditions. Labeled HCC1954 cells were mixed with unlabeled PBLs to form a "spiked" sample to be separated by the QMS. Fractional recoveries of cells for different flow parameters were examined and compared with theoretical predictions. Experimental results showed that the theoretical model accurately predicted fractional recoveries of HCC1954 cells. High-throughput (3.29 x 10(5) cells/s) separations with high recovery (0.89) of HCC1954 cells were achieved.  相似文献   

9.
A Bayesian missing value estimation method for gene expression profile data   总被引:13,自引:0,他引:13  
MOTIVATION: Gene expression profile analyses have been used in numerous studies covering a broad range of areas in biology. When unreliable measurements are excluded, missing values are introduced in gene expression profiles. Although existing multivariate analysis methods have difficulty with the treatment of missing values, this problem has received little attention. There are many options for dealing with missing values, each of which reaches drastically different results. Ignoring missing values is the simplest method and is frequently applied. This approach, however, has its flaws. In this article, we propose an estimation method for missing values, which is based on Bayesian principal component analysis (BPCA). Although the methodology that a probabilistic model and latent variables are estimated simultaneously within the framework of Bayes inference is not new in principle, actual BPCA implementation that makes it possible to estimate arbitrary missing variables is new in terms of statistical methodology. RESULTS: When applied to DNA microarray data from various experimental conditions, the BPCA method exhibited markedly better estimation ability than other recently proposed methods, such as singular value decomposition and K-nearest neighbors. While the estimation performance of existing methods depends on model parameters whose determination is difficult, our BPCA method is free from this difficulty. Accordingly, the BPCA method provides accurate and convenient estimation for missing values. AVAILABILITY: The software is available at http://hawaii.aist-nara.ac.jp/~shige-o/tools/.  相似文献   

10.
A simple theoretical model is hypothesized to describe the steady state behavior of a differentiating cell system as exemplified by blood cells. The cell system consists of several morphologically distinguishable cell classes which develop sequentially. Each cell class except the last one is mitotically capable. Mitosis is assumed to be either heteromorphogenic, homomorphogenic, or asymmetric. Some algebraic equations are derived which are conservation equations describing the flux of cells from one class to another. The theoretical considerations have been applied to some experimental observations in humans concerning neutrophil production, particularly in reference to relative cell numbers and mitotic fractions of the myeloblast, promyelocyte, and myelocyte cell classes. These observations are utilized to help determine the values of the parameters which characterize the model. Among these parameters are the generation times of the various cell classes, and the predicted values of the generation times are found to be in excellent agreement with observed grain-count halving times. However, the predicted mitotic times are in disagreement with their observed values.  相似文献   

11.
Multiple imputation has become a widely accepted technique to deal with the problem of incomplete data. Typically, imputation of missing values and the statistical analysis are performed separately. Therefore, the imputation model has to be consistent with the analysis model. If the data are analyzed with a mixture model, the parameter estimates are usually obtained iteratively. Thus, if the data are missing not at random, parameter estimation and treatment of missingness should be combined. We solve both problems by simultaneously imputing values using the data augmentation method and estimating parameters using the EM algorithm. This iterative procedure ensures that the missing values are properly imputed given the current parameter estimates. Properties of the parameter estimates were investigated in a simulation study. The results are illustrated using data from the National Health and Nutrition Examination Survey.  相似文献   

12.
Summary.   The present article deals with informative missing (IM) exposure data in matched case–control studies. When the missingness mechanism depends on the unobserved exposure values, modeling the missing data mechanism is inevitable. Therefore, a full likelihood-based approach for handling IM data has been proposed by positing a model for selection probability, and a parametric model for the partially missing exposure variable among the control population along with a disease risk model. We develop an EM algorithm to estimate the model parameters. Three special cases: (a) binary exposure variable, (b) normally distributed exposure variable, and (c) lognormally distributed exposure variable are discussed in detail. The method is illustrated by analyzing a real matched case–control data with missing exposure variable. The performance of the proposed method is evaluated through simulation studies, and the robustness of the proposed method for violation of different types of model assumptions has been considered.  相似文献   

13.
The present article deals with informative missing (IM) exposure data in matched case-control studies. When the missingness mechanism depends on the unobserved exposure values, modeling the missing data mechanism is inevitable. Therefore, a full likelihood-based approach for handling IM data has been proposed by positing a model for selection probability, and a parametric model for the partially missing exposure variable among the control population along with a disease risk model. We develop an EM algorithm to estimate the model parameters. Three special cases: (a) binary exposure variable, (b) normally distributed exposure variable, and (c) lognormally distributed exposure variable are discussed in detail. The method is illustrated by analyzing a real matched case-control data with missing exposure variable. The performance of the proposed method is evaluated through simulation studies, and the robustness of the proposed method for violation of different types of model assumptions has been considered.  相似文献   

14.
Phenomenological parameters from a mathematical model of cell motility are used to quantitatively characterize chemosensory migration responses of rat alveolar macrophages migrating to C5a in the linear under-agarose assay, simultaneously at the levels of both single cells and cell populations. This model provides theoretical relationships between single-cell and cell-population motility parameters. Our experiments offer a critical test of these theoretical linking relationships, by comparison of results obtained at the cell population level to results obtained at the single-cell level. Random motility of a cell population is characterized by the random motility coefficient, mu (analogous to a particle diffusion coefficient), whereas single-cell random motility is described by cell speed, s, and persistence time, P (related to the period of time that a cell moves in one direction before changing direction). Population chemotaxis is quantified by the chemotactic sensitivity, chi 0, which provides a measure of the minimum attractant gradient necessary to elicit a specified chemotactic response. Single-cell chemotaxis is characterized by the chemotactic index, CI, which ranges from 0 for purely random motility to 1 for perfectly directed motility. Measurements of cell number versus migration distance were analyzed in conjunction with the phenomenological model to determine the population parameters while paths of individual cells in the same experiment were analyzed in order to determine the single-cell parameters. The parameter mu shows a biphasic dependence on C5a concentration with a maximum of 1.9 x 10(-8) cm2/sec at 10(-11) M C5a and relative minima of 0.86 x 10(-8) cm2/sec at 10(-7) M C5a and 1.1 x 10(-8) cm2/sec in the absence of Ca; s and P remain fairly constant with C5a concentration, with s ranging from 2.1 to 2.5 microns/min and P varying from 22 to 32 min. chi 0 is equal to 1.0 x 10(-6) cm/receptor for all C5a concentrations tested, corresponding to 60% correct orientation for a difference of 500 bound C5a receptors across a 20 microns cell length. The maximum CI measured was 0.2. Values for the population parameters mu and chi 0 were calculated from single-cell parameter values using the aforementioned theoretical linking relationships. The values of mu and chi 0 calculated from single-cell parameters agreed with values of mu and chi 0 determined independently from population migrations, over the full range of C5a concentrations, confirming the validity of the linking equations. Experimental confirmation of such relationships between single-cell and cell-population parameters has not previously been reported.  相似文献   

15.
MOTIVATION: Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. RESULTS: The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. AVAILABILITY: The CMVE software is available upon request from the authors.  相似文献   

16.
Analysis of nucleated cell size in a minicell-producing strain of Escherichia coli and in its parental strain shows that the two distributions are considerably different. A model is proposed to account for this difference. The model states that: (i) in the mutant population, the cell poles are available as potential division sites in addition to the normally located division sites; (ii) the probability of a division occurring at any of the potential division sites is equal; and (iii) only enough "division factor" arises at each unit cell doubling to permit a single division. This factor is utilized entirely in the formation of a single septum. Thus, the occurrence of a polar division with the production of an anucleate minicell (which occurs only in the mutant strain) prevents the occurrence of a non-polar division, with the result that the average nucleated cell length is increased in minicell-producing strains. The model has been used to construct a theoretical population, and a number of parameters of the real and theoretical populations have been compared. The two populations are very similar in all of the parameters measured.  相似文献   

17.
Dynamics of the numbers of population of proliferating cells under periodic phase-specific cytotoxic effect with and without blocking action was studied on the basis of a mathematical model. It has been shown that at real values of the model parameters after the beginning of the effects the population number exponentially depends of time. Dependence of the population number dynamics on integral parameters of the cell cycle and the exposure regime was studied. It has been shown that at certain periods a resonance decrease of the damage of the population cells must be observed. The values of the periods corresponding to the resonance decrease of the damage are determined mainly by the mean duration of the cell cycle and the time of blocking action, at small duration of the blocking action they are approximately multiple to the average time of the cell cycle. The theoretical predictions are proved experimentally in the experiments on determining the relationship between the damage of small intestine epithelium and mouse survival and the period of repeated periodic injections of S-phase-specific cytotoxic agent hydroxyurea. A distinct resonance increase of mouse survival and decreased damage of the epithelium were observed under the injections of oxyurea with the periods near to the mean and doubled mean time of the cell cycle of crypt enterocytes. The results obtained not only support the correctness of theoretical predictions, but make it possible to estimate the parameters of the stem cell cycle of mouse small intestine epithelium. They also show that this approach can be used for reducing aftereffects of chemotherapy by means of phase-specific agents.  相似文献   

18.
Longitudinal studies, in which individuals are measured repeatedly in time, are often incomplete. We model continuous-time longitudinal data from the Multicenter AIDS Cohort Study using a diffusion model in which the diffusion parameters are functions of the covariates. These data are jointly modeled with the process of time-to-death due to AIDS. We show that, even for large data sets with a large number of missing variables, a Bayesian analysis is feasible using Gibbs sampling and compare a complete case analysis with a Bayesian treatment of missing values.  相似文献   

19.

Background

The small intestinal epithelium is a dynamic system with specialized cell types. The various cell populations of this tissue are continually renewed and replenished from stem cells that reside in the small intestinal crypt. The cell types and their locations in the crypt and villus are well known, but the details of the kinetics of stem cell division, and precursor cell proliferation and differentiation into mature enterocytes and secretory cells are still being studied. These proliferation and differentiation events have been extensively modeled with a variety of computational approaches in the past.

Methods

A compartmental population kinetics model, incorporating experimentally measured proliferation rates for various intestinal epithelial cell types, is implemented for a previously reported scheme for the intestinal cell dynamics. A sensitivity analysis is performed to determine the effect that varying the model parameters has upon the model outputs, the steady-state cell populations.

Results

The model is unable to reproduce the experimentally known timescale of renewal of the intestinal epithelium if literature values for the proliferation rates of stem cells and transit amplifying cells are employed. Unphysically large rates of proliferation result when these parameters are allowed to vary to reproduce this timescale and the steady-state populations of terminally differentiated intestinal epithelial cells. Sensitivity analysis reveals that the strongest contributor to the steady-state populations is the transit amplifying cell proliferation rate when literature values are used, but that the differentiation rate of transit amplifying cells to secretory progenitor cells dominates when all parameters are allowed to vary.

Conclusions

A compartmental population kinetics model of proliferation and differentiation of cells of the intestinal epithelium can provide a simplifying means of understanding a complicated multistep process. However, when literature values for proliferation rates of the crypt based columnar and transit amplifying cell populations are employed in the model, it cannot reproduce the experimentally known timescale of intestinal epithelial renewal. Nevertheless, it remains a valuable conceptual tool, and its sensitivity analysis provides important clues for which events in the process are the most important in controlling the steady-state populations of specialized intestinal epithelial cells.
  相似文献   

20.
The birefringence of tropomyosin crystals was measured in the temperature range 5 degrees-35 degrees C. The experimental results are compared with a simple model calculation based on the theory developed by Wiener for the optical properties of colloidal systems. The difference between experimental and theoretical values is less than 15%, which denotes a good agreement given the simplicity of the model. A value of 0.011 was obtained for the intrinsic birefringence of the tropomyosin molecule. The temperature dependence of the crystal birefringence could be accounted for in part by a change of the unit cell parameters; this change was experimentally observed by others in x-ray diffraction experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号