首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.  相似文献   

2.
Simultaneous spike-counts of neural populations are typically modeled by a Gaussian distribution. On short time scales, however, this distribution is too restrictive to describe and analyze multivariate distributions of discrete spike-counts. We present an alternative that is based on copulas and can account for arbitrary marginal distributions, including Poisson and negative binomial distributions as well as second and higher-order interactions. We describe maximum likelihood-based procedures for fitting copula-based models to spike-count data, and we derive a so-called flashlight transformation which makes it possible to move the tail dependence of an arbitrary copula into an arbitrary orthant of the multivariate probability distribution. Mixtures of copulas that combine different dependence structures and thereby model different driving processes simultaneously are also introduced. First, we apply copula-based models to populations of integrate-and-fire neurons receiving partially correlated input and show that the best fitting copulas provide information about the functional connectivity of coupled neurons which can be extracted using the flashlight transformation. We then apply the new method to data which were recorded from macaque prefrontal cortex using a multi-tetrode array. We find that copula-based distributions with negative binomial marginals provide an appropriate stochastic model for the multivariate spike-count distributions rather than the multivariate Poisson latent variables distribution and the often used multivariate normal distribution. The dependence structure of these distributions provides evidence for common inhibitory input to all recorded stimulus encoding neurons. Finally, we show that copula-based models can be successfully used to evaluate neural codes, e.g., to characterize stimulus-dependent spike-count distributions with information measures. This demonstrates that copula-based models are not only a versatile class of models for multivariate distributions of spike-counts, but that those models can be exploited to understand functional dependencies.  相似文献   

3.
Simulation software programs continue to evolve and to meet the needs of risk analysts. In the past several years, two spreadsheet add-in programs added the capability of fitting distributions to data to their tool kits using classical statistical (i.e., non-Bayesian) methods. Crystal Ball version 4.0 now contains this capability in its standard program (and in Crystal Ball Pro version 4.0), while the BestFit software program is a component of the @RISK Decision Tools Suite that can also be purchased as a stand-alone program. Both programs will automatically fit distributions using maximum likelihood estimators to continuous data and provide goodness-of-fit statistics based on chi-squared, Kolmogorov-Smirnov, and Anderson-Darling tests. BestFit will also fit discrete distributions, and for all distributions it offers the option of optimizing the fit based on the goodness-of-fit parameters. Analysts should be wary of placing too much emphasis on the goodness-of-fit statistics given their limitations, and the fact that only some of the statistics are appropriately corrected to account for the fact that the distribution parameters are also fit using the data. These programs dramatically simplify efforts to use maximum likelihood estimation to fit distributions. However, the fact that a program is used to fit distributions should not be viewed as validation that the data have been fitted and interpreted correctly. Both programs rely heavily on the analyst's judgment and will allow analysts to fit inappropriate distributions. Currently, both programs could be improved by adding the ability to perform extensive basic exploratory data analysis and to give regression diagnostics that are needed to satisfy critical analysts or reviewers. Given that Bayesian methods are central to risk analysis, adding the capability of fitting distributions by combining data with prior information would greatly increase the utility of these programs.  相似文献   

4.
Current methods for analysis of data from studies of protein-protein interactions using fluorescence resonance energy transfer (FRET) emerged from several decades of research using wide-field microscopes and spectrofluorometers to measure fluorescence from individual cells or cell populations. Inherent to most measurements is an averaging of the distributions of FRET efficiencies over large populations of protein complexes, which washes out information regarding the stoichiometry and structure of protein complexes. Although the introduction of laser-scanning microscopes in principle could facilitate quantification of the distributions of FRET efficiencies in live cells, only comparatively recently did this potential fully materialize, through development of spectral- or lifetime-based approaches. To exploit this new opportunity in molecular imaging, it is necessary to further develop theoretical models and methods of data analysis. Using Monte Carlo simulations, we investigated FRET in homogenous and inhomogeneous spatial distributions of molecules. Our results indicate that an analysis based on distributions of FRET efficiencies presents significant advantages over the average-based approach, which include allowing for proper identification of biologically relevant FRET. This study provides insights into the effect of molecular crowding on FRET, and it offers a basis for information extraction from distributions of FRET efficiencies using simulations-based data fitting.  相似文献   

5.
Catria is 1 of the 22 native Italian horse breeds that now survive from a larger number. Thirty individuals, representative of the Catria horse, were analyzed for 11 microsatellites and compared with data of 10 breeds reared in Italy. Three different approaches, genetic distances, correspondence analysis, and clustering methods, were considered to study genetic relationships among Catria and the other horse populations. Genetic differentiation among breeds was highly significant (P < 0.01) for all loci. Average F(ST) values indicate that around 10% of the total genetic variation was explained by the between-breed differences and the 3 approaches utilized gave similar results. Italian native breeds are clearly separated from the other examined breeds. However, by the correspondence analysis, the Catria appears closer to Maremmano and Murgese. The results of Bayesian approaches give further information showing for Catria a common origin with Maremmano and Italian Heavy Draught. Genetic relationships among Catria and the other breeds are consistent with the breed's documented history. The data and information found here can be utilized in the organization of conservation programmes planned to reduce inbreeding and to minimize loss of genetic variability.  相似文献   

6.
Multiple imputation (MI) is increasingly popular for handling multivariate missing data. Two general approaches are available in standard computer packages: MI based on the posterior distribution of incomplete variables under a multivariate (joint) model, and fully conditional specification (FCS), which imputes missing values using univariate conditional distributions for each incomplete variable given all the others, cycling iteratively through the univariate imputation models. In the context of longitudinal or clustered data, it is not clear whether these approaches result in consistent estimates of regression coefficient and variance component parameters when the analysis model of interest is a linear mixed effects model (LMM) that includes both random intercepts and slopes with either covariates or both covariates and outcome contain missing information. In the current paper, we compared the performance of seven different MI methods for handling missing values in longitudinal and clustered data in the context of fitting LMMs with both random intercepts and slopes. We study the theoretical compatibility between specific imputation models fitted under each of these approaches and the LMM, and also conduct simulation studies in both the longitudinal and clustered data settings. Simulations were motivated by analyses of the association between body mass index (BMI) and quality of life (QoL) in the Longitudinal Study of Australian Children (LSAC). Our findings showed that the relative performance of MI methods vary according to whether the incomplete covariate has fixed or random effects and whether there is missingnesss in the outcome variable. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance components via simulation. We illustrate our findings with the analysis of LSAC data.  相似文献   

7.
High throughput identification of peptides in databases from tandem mass spectrometry data is a key technique in modern proteomics. Common approaches to interpret large scale peptide identification results are based on the statistical analysis of average score distributions, which are constructed from the set of best scores produced by large collections of MS/MS spectra by using searching engines such as SEQUEST. Other approaches calculate individual peptide identification probabilities on the basis of theoretical models or from single-spectrum score distributions constructed by the set of scores produced by each MS/MS spectrum. In this work, we study the mathematical properties of average SEQUEST score distributions by introducing the concept of spectrum quality and expressing these average distributions as compositions of single-spectrum distributions. We predict and demonstrate in the practice that average score distributions are dominated by the quality distribution in the spectra collection, except in the low probability region, where it is possible to predict the dependence of average probability on database size. Our analysis leads to a novel indicator, the probability ratio, which takes optimally into account the statistical information provided by the first and second best scores. The probability ratio is a non-parametric and robust indicator that makes spectra classification according to parameters such as charge state unnecessary and allows a peptide identification performance, on the basis of false discovery rates, that is better than that obtained by other empirical statistical approaches. The probability ratio also compares favorably with statistical probability indicators obtained by the construction of single-spectrum SEQUEST score distributions. These results make the robustness, conceptual simplicity, and ease of automation of the probability ratio algorithm a very attractive alternative to determine peptide identification confidences and error rates in high throughput experiments.  相似文献   

8.
Ewing G  Nicholls G  Rodrigo A 《Genetics》2004,168(4):2407-2420
We present a Bayesian statistical inference approach for simultaneously estimating mutation rate, population sizes, and migration rates in an island-structured population, using temporal and spatial sequence data. Markov chain Monte Carlo is used to collect samples from the posterior probability distribution. We demonstrate that this chain implementation successfully reaches equilibrium and recovers truth for simulated data. A real HIV DNA sequence data set with two demes, semen and blood, is used as an example to demonstrate the method by fitting asymmetric migration rates and different population sizes. This data set exhibits a bimodal joint posterior distribution, with modes favoring different preferred migration directions. This full data set was subsequently split temporally for further analysis. Qualitative behavior of one subset was similar to the bimodal distribution observed with the full data set. The temporally split data showed significant differences in the posterior distributions and estimates of parameter values over time.  相似文献   

9.

Aims

The fitting of statistical distributions to microbial sampling data is a common application in quantitative microbiology and risk assessment applications. An underlying assumption of most fitting techniques is that data are collected with simple random sampling, which is often times not the case. This study develops a weighted maximum likelihood estimation framework that is appropriate for microbiological samples that are collected with unequal probabilities of selection.

Methods and Results

A weighted maximum likelihood estimation framework is proposed for microbiological samples that are collected with unequal probabilities of selection. Two examples, based on the collection of food samples during processing, are provided to demonstrate the method and highlight the magnitude of biases in the maximum likelihood estimator when data are inappropriately treated as a simple random sample.

Conclusions

Failure to properly weight samples to account for how data are collected can introduce substantial biases into inferences drawn from the data.

Significance and Impact of the Study

The proposed methodology will reduce or eliminate an important source of bias in inferences drawn from the analysis of microbial data. This will also make comparisons between studies and the combination of results from different studies more reliable, which is important for risk assessment applications.  相似文献   

10.
We prove that the generalized Poisson distribution GP(theta, eta) (eta > or = 0) is a mixture of Poisson distributions; this is a new property for a distribution which is the topic of the book by Consul (1989). Because we find that the fits to count data of the generalized Poisson and negative binomial distributions are often similar, to understand their differences, we compare the probability mass functions and skewnesses of the generalized Poisson and negative binomial distributions with the first two moments fixed. They have slight differences in many situations, but their zero-inflated distributions, with masses at zero, means and variances fixed, can differ more. These probabilistic comparisons are helpful in selecting a better fitting distribution for modelling count data with long right tails. Through a real example of count data with large zero fraction, we illustrate how the generalized Poisson and negative binomial distributions as well as their zero-inflated distributions can be discriminated.  相似文献   

11.
Csanády L 《Biophysical journal》2006,90(10):3523-3545
The distributions of log-likelihood ratios (DeltaLL) obtained from fitting ion-channel dwell-time distributions with nested pairs of gating models (Xi, full model; Xi(R), submodel) were studied both theoretically and using simulated data. When Xi is true, DeltaLL is asymptotically normally distributed with predictable mean and variance that increase linearly with data length (n). When Xi(R) is true and corresponds to a distinct point in full parameter space, DeltaLL is Gamma-distributed (2DeltaLL is chi-square). However, when data generated by an l-component multiexponential distribution are fitted by l+1 components, Xi(R) corresponds to an infinite set of points in parameter space. The distribution of DeltaLL is a mixture of two components, one identically zero, the other approximated by a Gamma-distribution. This empirical distribution of DeltaLL, assuming Xi(R), allows construction of a valid log-likelihood ratio test. The log-likelihood ratio test, the Akaike information criterion, and the Schwarz criterion all produce asymmetrical Type I and II errors and inefficiently recognize Xi, when true, from short datasets. A new decision strategy, which considers both the parameter estimates and DeltaLL, yields more symmetrical errors and a larger discrimination power for small n. These observations are explained by the distributions of DeltaLL when Xi or Xi(R) is true.  相似文献   

12.
13.
Text-mining systems are indispensable tools to reduce the increasing flux of information in scientific literature to topics pertinent to a particular interest in focus. Most of the scientific literature is published as unstructured free text, complicating the development of data processing tools, which rely on structured information. To overcome the problems of free text analysis, structured, hand-curated information derived from literature is integrated in text-mining systems to improve precision and recall. In this paper several text-mining approaches are reviewed and the next step in development of text-mining systems, which is based on a concept of multiple lines of evidence, is described: results from literature analysis are combined with evidence from experiments and genome analysis to improve the accuracy of results and to generate additional knowledge beyond what is known solely from literature.  相似文献   

14.
Statistical modelling of biological survey data in relation to remotely mapped environmental variables is a powerful technique for making more effective use of sparse data in regional conservation planning. Application of such modelling to planning in the northeast New South Wales (NSW) region of Australia represents one of the most extensive and longest running case studies of this approach anywhere in the world. Since the early 1980s, statistical modelling has been used to extrapolate distributions of over 2300 species of plants and animals, and a wide variety of higher-level communities and assemblages. These modelled distributions have played a pivotal role in a series of major land-use planning processes, culminating in extensive additions to the region's protected area system. This paper provides an overview of the analytical methodology used to model distributions of individual species in northeast NSW, including approaches to: (1) developing a basic integrated statistical and geographical information system (GIS) framework to facilitate automated fitting and extrapolation of species models; (2) extending this basic approach to incorporate consideration of spatial autocorrelation, land-cover mapping and expert knowledge; and (3) evaluating the performance of species modelling, both in terms of predictive accuracy and in terms of the effectiveness with which such models function as general surrogates for biodiversity.  相似文献   

15.
In phase I clinical trials, experimental drugs are administered to healthy volunteers in order to establish their safety and to explore the relationship between the dose taken and the concentration found in plasma. Each volunteer receives a series of increasing single doses. In this paper a Bayesian decision procedure is developed for choosing the doses to give in the next round of the study, taking into account both prior information and the responses observed so far. The procedure seeks the optimal doses for learning about the dose-concentration relationship, subject to a constraint which reduces the risk of administering dangerously high doses. Individual volunteers receive more than one dose, and the pharmacokinetic responses observed are, after logarithmic transformation, treated as approximately normally distributed. Thus data analysis can be achieved by fitting linear mixed models. By expressing prior information as 'pseudo-data', and by maximizing over posterior distributions rather than taking expectations, a procedure which can be implemented using standard mixed model software is derived. Comparisons are made with existing approaches to the conduct of these studies, and the new method is illustrated using real and simulated data.To whom correspondence should be addressed.  相似文献   

16.
Species distribution models (SDM) have been broadly used in ecology to address theoretical and practical problems. Currently, there are two main approaches to generate SDMs: (i) correlative, which is based on species occurrences and environmental predictor layers and (ii) process-based models, which are constructed based on species' functional traits and physiological tolerances. The distributions estimated by each approach are based on different components of species niche. Predictions of correlative models approach species realized niches, while predictions of process-based are more akin to species fundamental niche. Here, we integrated the predictions of fundamental and realized distributions of the freshwater turtle Trachemys dorbigni. Fundamental distribution was estimated using data of T. dorbigni's egg incubation temperature, and realized distribution was estimated using species occurrence records. Both types of distributions were estimated using the same regression approaches (logistic regression and support vector machines), both considering macroclimatic and microclimatic temperatures. The realized distribution of T. dorbigni was generally nested in its fundamental distribution reinforcing theoretical assumptions that the species' realized niche is a subset of its fundamental niche. Both modelling algorithms produced similar results but microtemperature generated better results than macrotemperature for the incubation model. Finally, our results reinforce the conclusion that species realized distributions are constrained by other factors other than just thermal tolerances.  相似文献   

17.
Analytical ultracentrifugation has reemerged as a widely used tool for the study of ensembles of biological macromolecules to understand, for example, their size-distribution and interactions in free solution. Such information can be obtained from the mathematical analysis of the concentration and signal gradients across the solution column and their evolution in time generated as a result of the gravitational force. In sedimentation velocity analytical ultracentrifugation, this analysis is frequently conducted using high resolution, diffusion-deconvoluted sedimentation coefficient distributions. They are based on Fredholm integral equations, which are ill-posed unless stabilized by regularization. In many fields, maximum entropy and Tikhonov-Phillips regularization are well-established and powerful approaches that calculate the most parsimonious distribution consistent with the data and prior knowledge, in accordance with Occam's razor. In the implementations available in analytical ultracentrifugation, to date, the basic assumption implied is that all sedimentation coefficients are equally likely and that the information retrieved should be condensed to the least amount possible. Frequently, however, more detailed distributions would be warranted by specific detailed prior knowledge on the macromolecular ensemble under study, such as the expectation of the sample to be monodisperse or paucidisperse or the expectation for the migration to establish a bimodal sedimentation pattern based on Gilbert-Jenkins' theory for the migration of chemically reacting systems. So far, such prior knowledge has remained largely unused in the calculation of the sedimentation coefficient or molecular weight distributions or was only applied as constraints. In the present paper, we examine how prior expectations can be built directly into the computational data analysis, conservatively in a way that honors the complete information of the experimental data, whether or not consistent with the prior expectation. Consistent with analogous results in other fields, we find that the use of available prior knowledge can have a dramatic effect on the resulting molecular weight, sedimentation coefficient, and size-and-shape distributions and can significantly increase both their sensitivity and their resolution. Further, the use of multiple alternative prior information allows us to probe the range of possible interpretations consistent with the data.  相似文献   

18.
19.
We present a method for giant lipid vesicle shape analysis that combines manually guided large-scale video microscopy and computer vision algorithms to enable analyzing vesicle populations. The method retains the benefits of light microscopy and enables non-destructive analysis of vesicles from suspensions containing up to several thousands of lipid vesicles (1–50 µm in diameter). For each sample, image analysis was employed to extract data on vesicle quantity and size distributions of their projected diameters and isoperimetric quotients (measure of contour roundness). This process enables a comparison of samples from the same population over time, or the comparison of a treated population to a control. Although vesicles in suspensions are heterogeneous in sizes and shapes and have distinctively non-homogeneous distribution throughout the suspension, this method allows for the capture and analysis of repeatable vesicle samples that are representative of the population inspected.  相似文献   

20.
Many distributions have been used in flood frequency analysis (FFA) for fitting the flood extremes data. However, as shown in the paper, the scatter of Polish data plotted on the moment ratio diagram shows that there is still room for a new model. In the paper, we study the usefulness of the generalized exponential (GE) distribution in flood frequency analysis for Polish Rivers. We investigate the fit of GE distribution to the Polish data of the maximum flows in comparison with the inverse Gaussian (IG) distribution, which in our previous studies showed the best fitting among several models commonly used in FFA. Since the use of a discrimination procedure without the knowledge of its performance for the considered probability density functions may lead to erroneous conclusions, we compare the probability of correct selection for the GE and IG distributions along with the analysis of the asymptotic model error in respect to the upper quantile values. As an application, both GE and IG distributions are alternatively assumed for describing the annual peak flows for several gauging stations of Polish Rivers. To find the best fitting model, four discrimination procedures are used. In turn, they are based on the maximized logarithm of the likelihood function (K procedure), on the density function of the scale transformation maximal invariant (QK procedure), on the Kolmogorov-Smirnov statistics (KS procedure) and the fourth procedure based on the differences between the ML estimate of 1% quantile and its value assessed by the method of moments and linear moments, in sequence (R procedure). Due to the uncertainty of choosing the best model, the method of aggregation is applied to estimate of the maximum flow quantiles.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号