首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.  相似文献   

2.
What is the role of higher-order spike correlations for neuronal information processing? Common data analysis methods to address this question are devised for the application to spike recordings from multiple single neurons. Here, we present a new method which evaluates the subthreshold membrane potential fluctuations of one neuron, and infers higher-order correlations among the neurons that constitute its presynaptic population. This has two important advantages: Very large populations of up to several thousands of neurons can be studied, and the spike sorting is obsolete. Moreover, this new approach truly emphasizes the functional aspects of higher-order statistics, since we infer exactly those correlations which are seen by a neuron. Our approach is to represent the subthreshold membrane potential fluctuations as presynaptic activity filtered with a fixed kernel, as it would be the case for a leaky integrator neuron model. This allows us to adapt the recently proposed method CuBIC (cumulant based inference of higher-order correlations from the population spike count; Staude et al., J Comput Neurosci 29(1–2):327–350, 2010c) with which the maximal order of correlation can be inferred. By numerical simulation we show that our new method is reasonably sensitive to weak higher-order correlations, and that only short stretches of membrane potential are required for their reliable inference. Finally, we demonstrate its remarkable robustness against violations of the simplifying assumptions made for its construction, and discuss how it can be employed to analyze in vivo intracellular recordings of membrane potentials.  相似文献   

3.
4.
Precise spike coordination between the spiking activities of multiple neurons is suggested as an indication of coordinated network activity in active cell assemblies. Spike correlation analysis aims to identify such cooperative network activity by detecting excess spike synchrony in simultaneously recorded multiple neural spike sequences. Cooperative activity is expected to organize dynamically during behavior and cognition; therefore currently available analysis techniques must be extended to enable the estimation of multiple time-varying spike interactions between neurons simultaneously. In particular, new methods must take advantage of the simultaneous observations of multiple neurons by addressing their higher-order dependencies, which cannot be revealed by pairwise analyses alone. In this paper, we develop a method for estimating time-varying spike interactions by means of a state-space analysis. Discretized parallel spike sequences are modeled as multi-variate binary processes using a log-linear model that provides a well-defined measure of higher-order spike correlation in an information geometry framework. We construct a recursive Bayesian filter/smoother for the extraction of spike interaction parameters. This method can simultaneously estimate the dynamic pairwise spike interactions of multiple single neurons, thereby extending the Ising/spin-glass model analysis of multiple neural spike train data to a nonstationary analysis. Furthermore, the method can estimate dynamic higher-order spike interactions. To validate the inclusion of the higher-order terms in the model, we construct an approximation method to assess the goodness-of-fit to spike data. In addition, we formulate a test method for the presence of higher-order spike correlation even in nonstationary spike data, e.g., data from awake behaving animals. The utility of the proposed methods is tested using simulated spike data with known underlying correlation dynamics. Finally, we apply the methods to neural spike data simultaneously recorded from the motor cortex of an awake monkey and demonstrate that the higher-order spike correlation organizes dynamically in relation to a behavioral demand.  相似文献   

5.
Estimating the causal interaction between neurons is very important for better understanding the functional connectivity in neuronal networks. We propose a method called normalized permutation transfer entropy (NPTE) to evaluate the temporal causal interaction between spike trains, which quantifies the fraction of ordinal information in a neuron that has presented in another one. The performance of this method is evaluated with the spike trains generated by an Izhikevich’s neuronal model. Results show that the NPTE method can effectively estimate the causal interaction between two neurons without influence of data length. Considering both the precision of time delay estimated and the robustness of information flow estimated against neuronal firing rate, the NPTE method is superior to other information theoretic method including normalized transfer entropy, symbolic transfer entropy and permutation conditional mutual information. To test the performance of NPTE on analyzing simulated biophysically realistic synapses, an Izhikevich’s cortical network that based on the neuronal model is employed. It is found that the NPTE method is able to characterize mutual interactions and identify spurious causality in a network of three neurons exactly. We conclude that the proposed method can obtain more reliable comparison of interactions between different pairs of neurons and is a promising tool to uncover more details on the neural coding.  相似文献   

6.
Advances in DNA sequencing technology have revolutionized the field of molecular analysis of trophic interactions, and it is now possible to recover counts of food DNA sequences from a wide range of dietary samples. But what do these counts mean? To obtain an accurate estimate of a consumer's diet should we work strictly with data sets summarizing frequency of occurrence of different food taxa, or is it possible to use relative number of sequences? Both approaches are applied to obtain semi‐quantitative diet summaries, but occurrence data are often promoted as a more conservative and reliable option due to taxa‐specific biases in recovery of sequences. We explore representative dietary metabarcoding data sets and point out that diet summaries based on occurrence data often overestimate the importance of food consumed in small quantities (potentially including low‐level contaminants) and are sensitive to the count threshold used to define an occurrence. Our simulations indicate that using relative read abundance (RRA) information often provides a more accurate view of population‐level diet even with moderate recovery biases incorporated; however, RRA summaries are sensitive to recovery biases impacting common diet taxa. Both approaches are more accurate when the mean number of food taxa in samples is small. The ideas presented here highlight the need to consider all sources of bias and to justify the methods used to interpret count data in dietary metabarcoding studies. We encourage researchers to continue addressing methodological challenges and acknowledge unanswered questions to help spur future investigations in this rapidly developing area of research.  相似文献   

7.
The statistical analysis of neuronal spike trains by models of point processes often relies on the assumption of constant process parameters. However, it is a well-known problem that the parameters of empirical spike trains can be highly variable, such as for example the firing rate. In order to test the null hypothesis of a constant rate and to estimate the change points, a Multiple Filter Test (MFT) and a corresponding algorithm (MFA) have been proposed that can be applied under the assumption of independent inter spike intervals (ISIs). As empirical spike trains often show weak dependencies in the correlation structure of ISIs, we extend the MFT here to point processes associated with short range dependencies. By specifically estimating serial dependencies in the test statistic, we show that the new MFT can be applied to a variety of empirical firing patterns, including positive and negative serial correlations as well as tonic and bursty firing. The new MFT is applied to a data set of empirical spike trains with serial correlations, and simulations show improved performance against methods that assume independence. In case of positive correlations, our new MFT is necessary to reduce the number of false positives, which can be highly enhanced when falsely assuming independence. For the frequent case of negative correlations, the new MFT shows an improved detection probability of change points and thus, also a higher potential of signal extraction from noisy spike trains.  相似文献   

8.
Analysis of sensory neurons'' processing characteristics requires simultaneous measurement of presented stimuli and concurrent spike responses. The functional transformation from high-dimensional stimulus space to the binary space of spike and non-spike responses is commonly described with linear-nonlinear models, whose linear filter component describes the neuron''s receptive field. From a machine learning perspective, this corresponds to the binary classification problem of discriminating spike-eliciting from non-spike-eliciting stimulus examples. The classification-based receptive field (CbRF) estimation method proposed here adapts a linear large-margin classifier to optimally predict experimental stimulus-response data and subsequently interprets learned classifier weights as the neuron''s receptive field filter. Computational learning theory provides a theoretical framework for learning from data and guarantees optimality in the sense that the risk of erroneously assigning a spike-eliciting stimulus example to the non-spike class (and vice versa) is minimized. Efficacy of the CbRF method is validated with simulations and for auditory spectro-temporal receptive field (STRF) estimation from experimental recordings in the auditory midbrain of Mongolian gerbils. Acoustic stimulation is performed with frequency-modulated tone complexes that mimic properties of natural stimuli, specifically non-Gaussian amplitude distribution and higher-order correlations. Results demonstrate that the proposed approach successfully identifies correct underlying STRFs, even in cases where second-order methods based on the spike-triggered average (STA) do not. Applied to small data samples, the method is shown to converge on smaller amounts of experimental recordings and with lower estimation variance than the generalized linear model and recent information theoretic methods. Thus, CbRF estimation may prove useful for investigation of neuronal processes in response to natural stimuli and in settings where rapid adaptation is induced by experimental design.  相似文献   

9.
MOTIVATION: Arrays allow measurements of the expression levels of thousands of mRNAs to be made simultaneously. The resulting data sets are information rich but require extensive mining to enhance their usefulness. Information theoretic methods are capable of assessing similarities and dissimilarities between data distributions and may be suited to the analysis of gene expression experiments. The purpose of this study was to investigate information theoretic data mining approaches to discover temporal patterns of gene expression from array-derived gene expression data. RESULTS: The Kullback-Leibler divergence, an information-theoretic distance that measures the relative dissimilarity between two data distribution profiles, was used in conjunction with an unsupervised self-organizing map algorithm. Two published, array-derived gene expression data sets were analyzed. The patterns obtained with the KL clustering method were found to be superior to those obtained with the hierarchical clustering algorithm using the Pearson correlation distance measure. The biological significance of the results was also examined. AVAILABILITY: Software code is available by request from the authors. All programs were written in ANSI C and Matlab (Mathworks Inc., Natick, MA).  相似文献   

10.
We develop a general minimally coupled subspace approach (MCSA) to compute absolute entropies of macromolecules, such as proteins, from computer generated canonical ensembles. Our approach overcomes limitations of current estimates such as the quasi-harmonic approximation which neglects non-linear and higher-order correlations as well as multi-minima characteristics of protein energy landscapes. Here, Full Correlation Analysis, adaptive kernel density estimation, and mutual information expansions are combined and high accuracy is demonstrated for a number of test systems ranging from alkanes to a 14 residue peptide. We further computed the configurational entropy for the full 67-residue cofactor of the TATA box binding protein illustrating that MCSA yields improved results also for large macromolecular systems.  相似文献   

11.
The multidimensional computations performed by many biological systems are often characterized with limited information about the correlations between inputs and outputs. Given this limitation, our approach is to construct the maximum noise entropy response function of the system, leading to a closed-form and minimally biased model consistent with a given set of constraints on the input/output moments; the result is equivalent to conditional random field models from machine learning. For systems with binary outputs, such as neurons encoding sensory stimuli, the maximum noise entropy models are logistic functions whose arguments depend on the constraints. A constraint on the average output turns the binary maximum noise entropy models into minimum mutual information models, allowing for the calculation of the information content of the constraints and an information theoretic characterization of the system's computations. We use this approach to analyze the nonlinear input/output functions in macaque retina and thalamus; although these systems have been previously shown to be responsive to two input dimensions, the functional form of the response function in this reduced space had not been unambiguously identified. A second order model based on the logistic function is found to be both necessary and sufficient to accurately describe the neural responses to naturalistic stimuli, accounting for an average of 93% of the mutual information with a small number of parameters. Thus, despite the fact that the stimulus is highly non-Gaussian, the vast majority of the information in the neural responses is related to first and second order correlations. Our results suggest a principled and unbiased way to model multidimensional computations and determine the statistics of the inputs that are being encoded in the outputs.  相似文献   

12.
Gene set analysis methods are popular tools for identifying differentially expressed gene sets in microarray data. Most existing methods use a permutation test to assess significance for each gene set. The permutation test's assumption of exchangeable samples is often not satisfied for time‐series data and complex experimental designs, and in addition it requires a certain number of samples to compute p‐values accurately. The method presented here uses a rotation test rather than a permutation test to assess significance. The rotation test can compute accurate p‐values also for very small sample sizes. The method can handle complex designs and is particularly suited for longitudinal microarray data where the samples may have complex correlation structures. Dependencies between genes, modeled with the use of gene networks, are incorporated in the estimation of correlations between samples. In addition, the method can test for both gene sets that are differentially expressed and gene sets that show strong time trends. We show on simulated longitudinal data that the ability to identify important gene sets may be improved by taking the correlation structure between samples into account. Applied to real data, the method identifies both gene sets with constant expression and gene sets with strong time trends.  相似文献   

13.
Recent developments in electrophysiological and optical recording techniques enable the simultaneous observation of large numbers of neurons. A meaningful interpretation of the resulting multivariate data, however, presents a serious challenge. In particular, the estimation of higher-order correlations that characterize the cooperative dynamics of groups of neurons is impeded by the combinatorial explosion of the parameter space. The resulting requirements with respect to sample size and recording time has rendered the detection of coordinated neuronal groups exceedingly difficult. Here we describe a novel approach to infer higher-order correlations in massively parallel spike trains that is less susceptible to these problems. Based on the superimposed activity of all recorded neurons, the cumulant-based inference of higher-order correlations (CuBIC) presented here exploits the fact that the absence of higher-order correlations imposes also strong constraints on correlations of lower order. Thus, estimates of only few lower-order cumulants suffice to infer higher-order correlations in the population. As a consequence, CuBIC is much better compatible with the constraints of in vivo recordings than previous approaches, which is shown by a systematic analysis of its parameter dependence.  相似文献   

14.
We study the equilibrium in the use of synonymous codons by eukaryotic organisms and find five equations involving substitution rates that we believe embody the important implications of equilibrium for the process of silent substitution. We then combine these five equations with additional criteria to determine sets of substitution rates applicable to eukaryotic organisms. One method employs the equilibrium equations and a principle of maximum entropy to find the most uniform set of rates consistent with equilibrium. In a second method we combine the equilibrium equations with data on the man-mouse divergence to determine that set of rates that is most neutral yet consistent with both types of data (i.e., equilibrium and divergence data). Simulations show this second method to be quite reliable in spite of significant saturation in the substitution process. We find that when divergence data are included in the calculation of rates, even though these rates are chosen to be as neutral as possible, the strength of selection inferred from the nonuniformity of the rates is approximately doubled. Both sets of rates are applied to estimate the human-mouse divergence time based on several independent subsets of the divergence data consisting of the quartet, C- or T-ending duet, and A- or G-ending duet codon sets. Both rate sets produce patterns of divergence times that are shortest for the quartet data, intermediate for the CT-ending duets, and longest for the AG-ending duets. This indicates that rates of transitions in the duet-codon sets are significantly higher than those in the quartet-codon sets; this effect is especially marked for A----G, the rate of which in duets must be about double that in quartets.  相似文献   

15.
Wennekers T  Ay N  Andras P 《Bio Systems》2007,89(1-3):190-197
It has been argued that information processing in the cortex is optimised with regard to certain information theoretic principles. We have, for instance, recently shown that spike-timing dependent plasticity can improve an information-theoretic measure called spatio-temporal stochastic interaction which captures how strongly a set of neurons cooperates in space and time. Systems with high stochastic interaction reveal Poisson spike trains but nonetheless occupy only a strongly reduced area in their global phase space, they reveal repetiting but complex global activation patterns, and they can be interpreted as computational systems operating on selected sets of collective patterns or "global states" in a rule-like manner. In the present work we investigate stochastic interaction in high-resolution EEG-data from cat auditory cortex. Using Kohonen maps to reduce the high-dimensional dynamics of the system, we are able to detect repetiting system states and estimate the stochastic interaction in the data, which turns out to be fairly high. This suggests an organised cooperation in the underlying neural networks which cause the data and may reflect generic intrinsic computational capabilities of the cortex.  相似文献   

16.
Microarray experiments generate data sets with information on the expression levels of thousands of genes in a set of biological samples. Unfortunately, such experiments often produce multiple missing expression values, normally due to various experimental problems. As many algorithms for gene expression analysis require a complete data matrix as input, the missing values have to be estimated in order to analyze the available data. Alternatively, genes and arrays can be removed until no missing values remain. However, for genes or arrays with only a small number of missing values, it is desirable to impute those values. For the subsequent analysis to be as informative as possible, it is essential that the estimates for the missing gene expression values are accurate. A small amount of badly estimated missing values in the data might be enough for clustering methods, such as hierachical clustering or K-means clustering, to produce misleading results. Thus, accurate methods for missing value estimation are needed. We present novel methods for estimation of missing values in microarray data sets that are based on the least squares principle, and that utilize correlations between both genes and arrays. For this set of methods, we use the common reference name LSimpute. We compare the estimation accuracy of our methods with the widely used KNNimpute on three complete data matrices from public data sets by randomly knocking out data (labeling as missing). From these tests, we conclude that our LSimpute methods produce estimates that consistently are more accurate than those obtained using KNNimpute. Additionally, we examine a more classic approach to missing value estimation based on expectation maximization (EM). We refer to our EM implementations as EMimpute, and the estimate errors using the EMimpute methods are compared with those our novel methods produce. The results indicate that on average, the estimates from our best performing LSimpute method are at least as accurate as those from the best EMimpute algorithm.  相似文献   

17.
Is the Rapoport effect widespread? Null models revisited   总被引:1,自引:0,他引:1  
Aim  To test the Rapoport effect using null models and data sets taken from the literature. We propose an improvement on an existing method, testing the Rapoport effect in elevational and latitudinal distributions when distributions are restricted by sampling.
Location  Global.
Methods  First, we hypothesized that real range size distributions are similar to those expected by null assumptions (expected by only imposing boundaries to species distributions). When these distributions were different from those expected under the null assumptions, we tested the hypothesis that these distributions correspond to those expected when a Rapoport effect occurs. We used two simulation methods, random and pseudo-random, which differed only in that the latter one assumes fixed species mid-points, coinciding with real mid-points. Observed correlations between range size and mid-point were compared with the frequency distribution of 1000 simulations, using both simulation methods. We compared the correlation curves generated by 1000 simulations with those of the observed distributions, testing whether correlations indicated a Rapoport effect.
Results  Several significant patterns of correlations between range size and mid-point were observed in the data sets when compared with random and pseudo-random simulations. However, few of these correlations were consistent with a Rapoport effect.
Main conclusions  Although some recent studies are consistent with a Rapoport effect, our results suggest that the Rapoport effect is not a widespread pattern in global ecology.  相似文献   

18.
The shape of stimulus onset is a distinct feature of many acoustic communication signals. In some grasshopper species the steepness of amplitude rise of the pulses which comprise the song subunits is sexually dimorphic and a major criterion of sex recognition. Here, we describe potential mechanisms by which auditory interneurons could transmit the information on onset steepness from the metathoracic ganglion to the brain of the grasshopper. Since no single interneuron unequivocally encoded onset steepness, it appears that this information has to reside in the relative spike counts or the relative spike timing of a small group of ascending auditory interneurons. The decisive component of this mechanism seems to be the steepness-dependent leading inhibition displayed by two interneurons (AN3, AN4). The inhibition increased with increasing onset steepness, thus delayed the excitatory response, and in one interneuron even strongly reduced the spike count. Other ascending interneurons, whose responses were little affected by onset steepness, could serve as reference neurons (AN6, AN12). Thus, our results suggest that a comparison of both, spike count and first-spike timing within a small set of ascending interneurons could yield the information on signal onset steepness, that is on the sex of the sender.  相似文献   

19.
Summary We study the equilibrium in the use of synonymous codons by eukaryotic organisms and find five equations involving substitution rates that we believe embody the important implications of equilibrium for the process of silent substitution. We then combine these five equations with additional criteria to determine sets of substitution rates applicable to eukaryotic organisms. One method employs the equilibrium equations and a principle of maximum entropy to find the most uniform set of rates consistent with equilibrium. In a second method we combine the equilibrium equations with data on the man-mouse divergence to determine that set of rates that is most neutral yet consistent with both types of data (i.e., equilibrium and divergence data). Simulations show this second method to be quite reliable in spite of significant saturation in the substitution process. We find that when divergence data are included in the calculation of rates, even though these rates are chosen to be as neutral as possible, the strength of selection inferred from the nonuniformity of the rates is approximately doubled. Both sets of rates are applied to estimate the human-mouse divergence time based on several independent subsets of the divergence data consisting of the quartet, C- or T-ending duet, and A- or G-ending duet codon sets. Both rate sets produce patterns of divergence times that are shortest for the quartet data, intermediate for the CT-ending duets, and longest for the AG-ending duets. This indicates that rates of transitions in the duet-codon sets are significantly higher than those in the quartet-codon sets; this effect is especially marked for AG, the rate of which in duets must be about double that in quartets.  相似文献   

20.
The geochemical evaluation methodology described in this paper is used to distinguish contaminated samples from those that contain only naturally occurring levels of inorganic constituents. Site-to-background comparisons of trace elements in soil based solely on statistical techniques are prone to high false positive indications. Trace element distributions in soil tend to span a wide range of concentrations and are highly right-skewed, approximating lognormal distributions, and background data sets are typically too small to capture this range. Geochemical correlations of trace versus major elements are predicated on the natural elemental associations in soil. Linear trends with positive slopes are expected for scatter plots of specific trace versus major elements in uncontaminated samples. Individual samples that may contain a component of contamination are identified by their positions off the trend formed by uncontaminated samples. In addition to pinpointing which samples may be contaminated, this technique provides mechanistic explanations for naturally elevated element concentrations, information that a purely statistical approach cannot provide. These geochemical evaluations have been successfully performed at numerous facilities across the United States. Removing naturally occurring constituents from consideration early in a site investigation reduces or eliminates unnecessary investigation and risk assessment, and focuses remediation efforts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号