首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Gan X  Liew AW  Yan H 《Nucleic acids research》2006,34(5):1608-1619
Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods.  相似文献   

2.
Microarray experiments generate data sets with information on the expression levels of thousands of genes in a set of biological samples. Unfortunately, such experiments often produce multiple missing expression values, normally due to various experimental problems. As many algorithms for gene expression analysis require a complete data matrix as input, the missing values have to be estimated in order to analyze the available data. Alternatively, genes and arrays can be removed until no missing values remain. However, for genes or arrays with only a small number of missing values, it is desirable to impute those values. For the subsequent analysis to be as informative as possible, it is essential that the estimates for the missing gene expression values are accurate. A small amount of badly estimated missing values in the data might be enough for clustering methods, such as hierachical clustering or K-means clustering, to produce misleading results. Thus, accurate methods for missing value estimation are needed. We present novel methods for estimation of missing values in microarray data sets that are based on the least squares principle, and that utilize correlations between both genes and arrays. For this set of methods, we use the common reference name LSimpute. We compare the estimation accuracy of our methods with the widely used KNNimpute on three complete data matrices from public data sets by randomly knocking out data (labeling as missing). From these tests, we conclude that our LSimpute methods produce estimates that consistently are more accurate than those obtained using KNNimpute. Additionally, we examine a more classic approach to missing value estimation based on expectation maximization (EM). We refer to our EM implementations as EMimpute, and the estimate errors using the EMimpute methods are compared with those our novel methods produce. The results indicate that on average, the estimates from our best performing LSimpute method are at least as accurate as those from the best EMimpute algorithm.  相似文献   

3.
4.
5.
Missing value estimation methods for DNA microarrays   总被引:39,自引:0,他引:39  
MOTIVATION: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. RESULTS: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.  相似文献   

6.
This study compared the molecular lipidomic profile of LDL in patients with nondiabetic advanced renal disease and no evidence of CVD to that of age-matched controls, with the hypothesis that it would reveal proatherogenic lipid alterations. LDL was isolated from 10 normocholesterolemic patients with stage 4/5 renal disease and 10 controls, and lipids were analyzed by accurate mass LC/MS. Top-down lipidomics analysis and manual examination of the data identified 352 lipid species, and automated comparative analysis demonstrated alterations in lipid profile in disease. The total lipid and cholesterol content was unchanged, but levels of triacylglycerides and N-acyltaurines were significantly increased, while phosphatidylcholines, plasmenyl ethanolamines, sulfatides, ceramides, and cholesterol sulfate were significantly decreased in chronic kidney disease (CKD) patients. Chemometric analysis of individual lipid species showed very good discrimination of control and disease sample despite the small cohorts and identified individual unsaturated phospholipids and triglycerides mainly responsible for the discrimination. These findings illustrate the point that although the clinical biochemistry parameters may not appear abnormal, there may be important underlying lipidomic changes that contribute to disease pathology. The lipidomic profile of CKD LDL offers potential for new biomarkers and novel insights into lipid metabolism and cardiovascular risk in this disease.  相似文献   

7.
Fistulifera sp. strain JPCC DA0580 is a newly sequenced pennate diatom that is capable of simultaneously growing and accumulating lipids. This is a unique trait, not found in other related microalgae so far. It is able to accumulate between 40 to 60% of its cell weight in lipids, making it a strong candidate for the production of biofuel. To investigate this characteristic, we used RNA-Seq data gathered at four different times while Fistulifera sp. strain JPCC DA0580 was grown in oil accumulating and non-oil accumulating conditions. We then adapted gene set enrichment analysis (GSEA) to investigate the relationship between the difference in gene expression of 7,822 genes and metabolic functions in our data. We utilized information in the KEGG pathway database to create the gene sets and changed GSEA to use re-sampling so that data from the different time points could be included in the analysis. Our GSEA method identified photosynthesis, lipid synthesis and amino acid synthesis related pathways as processes that play a significant role in oil production and growth in Fistulifera sp. strain JPCC DA0580. In addition to GSEA, we visualized the results by creating a network of compounds and reactions, and plotted the expression data on top of the network. This made existing graph algorithms available to us which we then used to calculate a path that metabolizes glucose into triacylglycerol (TAG) in the smallest number of steps. By visualizing the data this way, we observed a separate up-regulation of genes at different times instead of a concerted response. We also identified two metabolic paths that used less reactions than the one shown in KEGG and showed that the reactions were up-regulated during the experiment. The combination of analysis and visualization methods successfully analyzed time-course data, identified important metabolic pathways and provided new hypotheses for further research.  相似文献   

8.
Autosomal recessive polycystic kidney disease (ARPKD) is a severe, monogenetically inherited kidney and liver disease. PCK rats carrying the orthologous mutant gene serve as a model of human disease, and alterations in lipid profiles in PCK rats suggest that defined subsets of lipids may be useful as molecular disease markers. Whereas MALDI protein imaging mass spectrometry (IMS) has become a promising tool for disease classification, widely applicable workflows that link MALDI lipid imaging and identification as well as structural characterization of candidate disease-classifying marker lipids are lacking. Here, we combine selective MALDI imaging of sulfated kidney lipids and Fisher discriminant analysis (FDA) of imaging data sets for identification of candidate markers of progressive disease in PCK rats. Our study highlights strong increases in lower mass lipids as main classifiers of cystic disease. Structure determination by high-resolution mass spectrometry identifies these altered lipids as taurine-conjugated bile acids. These sulfated lipids are selectively elevated in the PCK rat model but not in models of related hepatorenal fibrocystic diseases, suggesting that they be molecular markers of the disease and that a combination of MALDI imaging with high-resolution MS methods and Fisher discriminant data analysis may be applicable for lipid marker discovery.  相似文献   

9.
Hopke PK  Liu C  Rubin DB 《Biometrics》2001,57(1):22-33
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.  相似文献   

10.
Two-dimensional SDS-PAGE gel electrophoresis using post-run staining is widely used to measure the abundances of thousands of protein spots simultaneously. Usually, the protein abundances of two or more biological groups are compared using biological and technical replicates. After gel separation and staining, the spots are detected, spot volumes are quantified, and spots are matched across gels. There are almost always many missing values in the resulting data set. The missing values arise either because the corresponding proteins have very low abundances (or are absent) or because of experimental errors such as incomplete/over focusing in the first dimension or varying run times in the second dimension as well as faulty spot detection and matching. In this study, we show that the probability for a spot to be missing can be modeled by a logistic regression function of the logarithm of the volume. Furthermore, we present an algorithm that takes a set of gels with technical and biological replicates as input and estimates the average protein abundances in the biological groups from the number of missing spots and measured volumes of the present spots using a maximum likelihood approach. Confidence intervals for abundances and p-values for differential expression between two groups are calculated using bootstrap sampling. The algorithm is compared to two standard approaches, one that discards missing values and one that sets all missing values to zero. We have evaluated this approach in two different gel data sets of different biological origin. An R-program, implementing the algorithm, is freely available at http://bioinfo.thep .lu.se/MissingValues2Dgels.html.  相似文献   

11.
Alzheimer’s disease is the most common cause of dementia worldwide, affecting the elderly population. It is characterized by the hallmark pathology of amyloid-β deposition, neurofibrillary tangle formation, and extensive neuronal degeneration in the brain. Wealth of data related to Alzheimer’s disease has been generated to date, nevertheless, the molecular mechanism underlying the etiology and pathophysiology of the disease is still unknown. Here we described a method for the combined analysis of multiple types of genome-wide data aimed at revealing convergent evidence interest that would not be captured by a standard molecular approach. Lists of Alzheimer-related genes (seed genes) were obtained from different sets of data on gene expression, SNPs, and molecular targets of drugs. Network analysis was applied for identifying the regions of the human protein-protein interaction network showing a significant enrichment in seed genes, and ultimately, in genes associated to Alzheimer’s disease, due to the cumulative effect of different combinations of the starting data sets. The functional properties of these enriched modules were characterized, effectively considering the role of both Alzheimer-related seed genes and genes that closely interact with them. This approach allowed us to present evidence in favor of one of the competing theories about AD underlying processes, specifically evidence supporting a predominant role of metabolism-associated biological process terms, including autophagy, insulin and fatty acid metabolic processes in Alzheimer, with a focus on AMP-activated protein kinase. This central regulator of cellular energy homeostasis regulates a series of brain functions altered in Alzheimer’s disease and could link genetic perturbation with neuronal transmission and energy regulation, representing a potential candidate to be targeted by therapy.  相似文献   

12.

Background

Gene expression time series data are usually in the form of high-dimensional arrays. Unfortunately, the data may sometimes contain missing values: for either the expression values of some genes at some time points or the entire expression values of a single time point or some sets of consecutive time points. This significantly affects the performance of many algorithms for gene expression analysis that take as an input, the complete matrix of gene expression measurement. For instance, previous works have shown that gene regulatory interactions can be estimated from the complete matrix of gene expression measurement. Yet, till date, few algorithms have been proposed for the inference of gene regulatory network from gene expression data with missing values.

Results

We describe a nonlinear dynamic stochastic model for the evolution of gene expression. The model captures the structural, dynamical, and the nonlinear natures of the underlying biomolecular systems. We present point-based Gaussian approximation (PBGA) filters for joint state and parameter estimation of the system with one-step or two-step missing measurements. The PBGA filters use Gaussian approximation and various quadrature rules, such as the unscented transform (UT), the third-degree cubature rule and the central difference rule for computing the related posteriors. The proposed algorithm is evaluated with satisfying results for synthetic networks, in silico networks released as a part of the DREAM project, and the real biological network, the in vivo reverse engineering and modeling assessment (IRMA) network of yeast Saccharomyces cerevisiae.

Conclusion

PBGA filters are proposed to elucidate the underlying gene regulatory network (GRN) from time series gene expression data that contain missing values. In our state-space model, we proposed a measurement model that incorporates the effect of the missing data points into the sequential algorithm. This approach produces a better inference of the model parameters and hence, more accurate prediction of the underlying GRN compared to when using the conventional Gaussian approximation (GA) filters ignoring the missing data points.
  相似文献   

13.
We focus on the problem of generalizing a causal effect estimated on a randomized controlled trial (RCT) to a target population described by a set of covariates from observational data. Available methods such as inverse propensity sampling weighting are not designed to handle missing values, which are however common in both data sources. In addition to coupling the assumptions for causal effect identifiability and for the missing values mechanism and to defining appropriate estimation strategies, one difficulty is to consider the specific structure of the data with two sources and treatment and outcome only available in the RCT. We propose three multiple imputation strategies to handle missing values when generalizing treatment effects, each handling the multisource structure of the problem differently (separate imputation, joint imputation with fixed effect, joint imputation ignoring source information). As an alternative to multiple imputation, we also propose a direct estimation approach that treats incomplete covariates as semidiscrete variables. The multiple imputation strategies and the latter alternative rely on different sets of assumptions concerning the impact of missing values on identifiability. We discuss these assumptions and assess the methods through an extensive simulation study. This work is motivated by the analysis of a large registry of over 20,000 major trauma patients and an RCT studying the effect of tranexamic acid administration on mortality in major trauma patients admitted to intensive care units. The analysis illustrates how the missing values handling can impact the conclusion about the effect generalized from the RCT to the target population.  相似文献   

14.
MOTIVATION: Clustering technique is used to find groups of genes that show similar expression patterns under multiple experimental conditions. Nonetheless, the results obtained by cluster analysis are influenced by the existence of missing values that commonly arise in microarray experiments. Because a clustering method requires a complete data matrix as an input, previous studies have estimated the missing values using an imputation method in the preprocessing step of clustering. However, a common limitation of these conventional approaches is that once the estimates of missing values are fixed in the preprocessing step, they are not changed during subsequent processes of clustering; badly estimated missing values obtained in data preprocessing are likely to deteriorate the quality and reliability of clustering results. Thus, a new clustering method is required for improving missing values during iterative clustering process. RESULTS: We present a method for Clustering Incomplete data using Alternating Optimization (CIAO) in which a prior imputation method is not required. To reduce the influence of imputation in preprocessing, we take an alternative optimization approach to find better estimates during iterative clustering process. This method improves the estimates of missing values by exploiting the cluster information such as cluster centroids and all available non-missing values in each iteration. To test the performance of the CIAO, we applied the CIAO and conventional imputation-based clustering methods, e.g. k-means based on KNNimpute, for clustering two yeast incomplete data sets, and compared the clustering result of each method using the Saccharomyces Genome Database annotations. The clustering results of the CIAO method are more significantly relevant to the biological gene annotations than those of other methods, indicating its effectiveness and potential for clustering incomplete gene expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request.  相似文献   

15.
Synthetic lipids with a nitroxide or a fluorescent probe have been extensively used during the last 30 years to determine the transmembrane diffusion of phospholipids in artificial or biological membranes. However, the relevance of data obtained with these modified lipids has sometimes been questioned. Beside possible artefacts introduced by the reporter probe, synthetic lipids used in cells often contain a short fatty acid chain in the sn-2 position, which gives them higher water solubility than naturally occurring lipids. In the present review, we have attempted to give a critical appraisal. Main strategies are recalled and important discoveries obtained with lipid probes on transmembrane lipid traffic in eukaryotic cells are briefly summarized. Examples of artefacts caused by lipid probes are given. Comparisons between data obtained by different techniques such as ESR and fluorescence allow us to emphasize the complementary character of the two approaches and more generally show the necessity to use several probes before drawing conclusions concerning endogenous lipids. In spite of these pitfalls, overall, lipid probes have provided a wealth of useful information that, to date, cannot be obtained with unlabeled lipids.  相似文献   

16.
Dairy fat contains high amounts of saturated fatty acids (FA), which are associated with cardiovascular disease (CVD) risk. Manipulation of dairy cows nutrition allows to decrease the saturated FA content of milk fat, and is associated with increases either in conjugated linoleic acid (CLA) and trans-11-C18:1 contents, or in trans-10-C18:1 content. CLA putatively exhibits beneficial properties on CVD risk, whereas trans FA are suspected to be detrimental. The present study compared the effects of a trans-10-C18:1-rich butter (T10 butter), a trans-11-C18:1+CLA-rich butter (T11-CLA butter) and a standard butter (S butter) on lipid parameters linked to the CVD risk and fatty streaks. Thirty-six White New Zealand rabbits were fed one of the three butters (12% of the diet, plus 0.2% cholesterol) for 6 (experiment 1) or 12 (experiment 2) weeks. Liver lipids, plasma lipids and lipoprotein concentrations (experiments 1 and 2) and aortic lipid deposition (experiment 2) were determined. The T10 butter increased VLDL-cholesterol compared with the two others, and total and LDL-cholesterol compared with the T11-CLA butter ( P < 0.05). The T10 butter also increased non-HDL/HDL ratio and aortic lipid deposition compared with the T11-CLA butter ( P < 0.05). The T11-CLA butter non-significantly reduced aortic lipid deposition compared with the S butter, and decreased HDL-cholesterol and increased liver triacyglycerols compared with the two other butters (< 0.05). These results suggest that, compared with the S butter, the T10 butter had detrimental effects on plasma lipid and lipoprotein metabolism in rabbits, whereas the T11-CLA butter was neutral or tended to reduce the aortic lipid deposition.  相似文献   

17.
18.
This paper investigates the utility of the Lomb–Scargle periodogram for the analysis of biological rhythms. This method is particularly suited to detect periodic components in unequally sampled time-series and data sets with missing values, but restricts all calculations to actually measured values. The Lomb-Scargle method was tested on both real and simulated time-series with even and uneven sampling, and compared to a standard method in biomedical rhythm research, the Chi-square periodogram. Results indicate that the Lomb–Scargle algorithm shows a clearly better detection efficiency and accuracy in the presence of noise, and avoids possible bias or erroneous results that may arise from replacement of missing data by interpolation techniques. Hence, the Lomb–Scargle periodogram may serve as a useful method for the study of biological rhythms, especially when applied to telemetrical or observational time-series obtained from free-living animals, i.e., data sets that notoriously lack points.  相似文献   

19.
This paper investigates the utility of the Lomb-Scargle periodogram for the analysis of biological rhythms. This method is particularly suited to detect periodic components in unequally sampled time-series and data sets with missing values, but restricts all calculations to actually measured values. The Lomb-Scargle method was tested on both real and simulated time-series with even and uneven sampling, and compared to a standard method in biomedical rhythm research, the Chi-square periodogram. Results indicate that the Lomb-Scargle algorithm shows a clearly better detection efficiency and accuracy in the presence of noise, and avoids possible bias or erroneous results that may arise from replacement of missing data by interpolation techniques. Hence, the Lomb-Scargle periodogram may serve as a useful method for the study of biological rhythms, especially when applied to telemetrical or observational time-series obtained from free-living animals, i.e., data sets that notoriously lack points.  相似文献   

20.
Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology ('MCAM') employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号