首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
MOTIVATION: Diffusable and non-diffusable gene products play a major role in body plan formation. A quantitative understanding of the spatio-temporal patterns formed in body plan formation, by using simulation models is an important addition to experimental observation. The inverse modelling approach consists of describing the body plan formation by a rule-based model, and fitting the model parameters to real observed data. In body plan formation, the data are usually obtained from fluorescent immunohistochemistry or in situ hybridizations. Inferring model parameters by comparing such data to those from simulation is a major computational bottleneck. An important aspect in this process is the choice of method used for parameter estimation. When no information on parameters is available, parameter estimation is mostly done by means of heuristic algorithms. RESULTS: We show that parameter estimation for pattern formation models can be efficiently performed using an evolution strategy (ES). As a case study we use a quantitative spatio-temporal model of the regulatory network for early development in Drosophila melanogaster. In order to estimate the parameters, the simulated results are compared to a time series of gene products involved in the network obtained with immunohistochemistry. We demonstrate that a (mu,lambda)-ES can be used to find good quality solutions in the parameter estimation. We also show that an ES with multiple populations is 5-140 times as fast as parallel simulated annealing for this case study, and that combining ES with a local search results in an efficient parameter estimation method.  相似文献   

3.

Background

An important use of data obtained from microarray measurements is the classification of tumor types with respect to genes that are either up or down regulated in specific cancer types. A number of algorithms have been proposed to obtain such classifications. These algorithms usually require parameter optimization to obtain accurate results depending on the type of data. Additionally, it is highly critical to find an optimal set of markers among those up or down regulated genes that can be clinically utilized to build assays for the diagnosis or to follow progression of specific cancer types. In this paper, we employ a mixed integer programming based classification algorithm named hyper-box enclosure method (HBE) for the classification of some cancer types with a minimal set of predictor genes. This optimization based method which is a user friendly and efficient classifier may allow the clinicians to diagnose and follow progression of certain cancer types.

Methodology/Principal Findings

We apply HBE algorithm to some well known data sets such as leukemia, prostate cancer, diffuse large B-cell lymphoma (DLBCL), small round blue cell tumors (SRBCT) to find some predictor genes that can be utilized for diagnosis and prognosis in a robust manner with a high accuracy. Our approach does not require any modification or parameter optimization for each data set. Additionally, information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection methods are employed for the gene selection. The results are compared with those from other studies and biological roles of selected genes in corresponding cancer type are described.

Conclusions/Significance

The performance of our algorithm overall was better than the other algorithms reported in the literature and classifiers found in WEKA data-mining package. Since it does not require a parameter optimization and it performs consistently very high prediction rate on different type of data sets, HBE method is an effective and consistent tool for cancer type prediction with a small number of gene markers.  相似文献   

4.
We consider the problem of using time-series data to inform a corresponding deterministic model and introduce the concept of genetic algorithms (GA) as a tool for parameter estimation, providing instructions for an implementation of the method that does not require access to special toolboxes or software. We give as an example a model for cholera, a disease for which there is much mechanistic uncertainty in the literature. We use GA to find parameter sets using available time-series data from the introduction of cholera in Haiti and we discuss the value of comparing multiple parameter sets with similar performances in describing the data.  相似文献   

5.
Effects of censoring on parameter estimates and power in genetic modeling.   总被引:5,自引:0,他引:5  
Genetic and environmental influences on variance in phenotypic traits may be estimated with normal theory Maximum Likelihood (ML). However, when the assumption of multivariate normality is not met, this method may result in biased parameter estimates and incorrect likelihood ratio tests. We simulated multivariate normal distributed twin data under the assumption of three different genetic models. Genetic model fitting was performed in six data sets: multivariate normal data, discrete uncensored data, censored data, square root transformed censored data, normal scores of censored data, and categorical data. Estimates were obtained with normal theory ML (data sets 1-5) and with categorical data analysis (data set 6). Statistical power was examined by fitting reduced models to the data. When fitting an ACE model to censored data, an unbiased estimate of the additive genetic effect was obtained. However, the common environmental effect was underestimated and the unique environmental effect was overestimated. Transformations did not remove this bias. When fitting an ADE model, the additive genetic effect was underestimated while the dominant and unique environmental effects were overestimated. In all models, the correct parameter estimates were recovered with categorical data analysis. However, with categorical data analysis, the statistical power decreased. The analysis of L-shaped distributed data with normal theory ML results in biased parameter estimates. Unbiased parameter estimates are obtained with categorical data analysis, but the power decreases.  相似文献   

6.
Modern 3D electron microscopy approaches have recently allowed unprecedented insight into the 3D ultrastructural organization of cells and tissues, enabling the visualization of large macromolecular machines, such as adhesion complexes, as well as higher-order structures, such as the cytoskeleton and cellular organelles in their respective cell and tissue context. Given the inherent complexity of cellular volumes, it is essential to first extract the features of interest in order to allow visualization, quantification, and therefore comprehension of their 3D organization. Each data set is defined by distinct characteristics, e.g., signal-to-noise ratio, crispness (sharpness) of the data, heterogeneity of its features, crowdedness of features, presence or absence of characteristic shapes that allow for easy identification, and the percentage of the entire volume that a specific region of interest occupies. All these characteristics need to be considered when deciding on which approach to take for segmentation.The six different 3D ultrastructural data sets presented were obtained by three different imaging approaches: resin embedded stained electron tomography, focused ion beam- and serial block face- scanning electron microscopy (FIB-SEM, SBF-SEM) of mildly stained and heavily stained samples, respectively. For these data sets, four different segmentation approaches have been applied: (1) fully manual model building followed solely by visualization of the model, (2) manual tracing segmentation of the data followed by surface rendering, (3) semi-automated approaches followed by surface rendering, or (4) automated custom-designed segmentation algorithms followed by surface rendering and quantitative analysis. Depending on the combination of data set characteristics, it was found that typically one of these four categorical approaches outperforms the others, but depending on the exact sequence of criteria, more than one approach may be successful. Based on these data, we propose a triage scheme that categorizes both objective data set characteristics and subjective personal criteria for the analysis of the different data sets.  相似文献   

7.
The increase in complexity of computational neuron models makes the hand tuning of model parameters more difficult than ever. Fortunately, the parallel increase in computer power allows scientists to automate this tuning. Optimization algorithms need two essential components. The first one is a function that measures the difference between the output of the model with a given set of parameter and the data. This error function or fitness function makes the ranking of different parameter sets possible. The second component is a search algorithm that explores the parameter space to find the best parameter set in a minimal amount of time. In this review we distinguish three types of error functions: feature-based ones, point-by-point comparison of voltage traces and multi-objective functions. We then detail several popular search algorithms, including brute-force methods, simulated annealing, genetic algorithms, evolution strategies, differential evolution and particle-swarm optimization. Last, we shortly describe Neurofitter, a free software package that combines a phase–plane trajectory density fitness function with several search algorithms.  相似文献   

8.
How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.  相似文献   

9.
Modern data-rich analyses may call for fitting a large number of nonparametric quantile regressions. For example, growth charts may be constructed for each of a collection of variables, to identify those for which individuals with a disorder tend to fall in the tails of their age-specific distribution; such variables might serve as developmental biomarkers. When such a large set of analyses are carried out by penalized spline smoothing, reliable automatic selection of the smoothing parameter is particularly important. We show that two popular methods for smoothness selection may tend to overfit when estimating extreme quantiles as a smooth function of a predictor such as age; and that improved results can be obtained by multifold cross-validation or by a novel likelihood approach. A simulation study, and an application to a functional magnetic resonance imaging data set, demonstrate the favorable performance of our methods.  相似文献   

10.
Fitting parameter sets of non-linear equations in cardiac single cell ionic models to reproduce experimental behavior is a time consuming process. The standard procedure is to adjust maximum channel conductances in ionic models to reproduce action potentials (APs) recorded in isolated cells. However, vastly different sets of parameters can produce similar APs. Furthermore, even with an excellent AP match in case of single cell, tissue behaviour may be very different. We hypothesize that this uncertainty can be reduced by additionally fitting membrane resistance (Rm). To investigate the importance of Rm, we developed a genetic algorithm approach which incorporated Rm data calculated at a few points in the cycle, in addition to AP morphology. Performance was compared to a genetic algorithm using only AP morphology data. The optimal parameter sets and goodness of fit as computed by the different methods were compared. First, we fit an ionic model to itself, starting from a random parameter set. Next, we fit the AP of one ionic model to that of another. Finally, we fit an ionic model to experimentally recorded rabbit action potentials. Adding the extra objective (Rm, at a few voltages) to the AP fit, lead to much better convergence. Typically, a smaller MSE (mean square error, defined as the average of the squared error between the target AP and AP that is to be fitted) was achieved in one fifth of the number of generations compared to using only AP data. Importantly, the variability in fit parameters was also greatly reduced, with many parameters showing an order of magnitude decrease in variability. Adding Rm to the objective function improves the robustness of fitting, better preserving tissue level behavior, and should be incorporated.  相似文献   

11.
Mathematical modeling and simulation studies are playing an increasingly important role in helping researchers elucidate how living organisms function in cells. In systems biology, researchers typically tune many parameters manually to achieve simulation results that are consistent with biological knowledge. This severely limits the size and complexity of simulation models built. In order to break this limitation, we propose a computational framework to automatically estimate kinetic parameters for a given network structure. We utilized an online (on-the-fly) model checking technique (which saves resources compared to the offline approach), with a quantitative modeling and simulation architecture named hybrid functional Petri net with extension (HFPNe). We demonstrate the applicability of this framework by the analysis of the underlying model for the neuronal cell fate decision model (ASE fate model) in Caenorhabditis elegans. First, we built a quantitative ASE fate model containing 3327 components emulating nine genetic conditions. Then, using our developed efficient online model checker, MIRACH 1.0, together with parameter estimation, we ran 20-million simulation runs, and were able to locate 57 parameter sets for 23 parameters in the model that are consistent with 45 biological rules extracted from published biological articles without much manual intervention. To evaluate the robustness of these 57 parameter sets, we run another 20 million simulation runs using different magnitudes of noise. Our simulation results concluded that among these models, one model is the most reasonable and robust simulation model owing to the high stability against these stochastic noises. Our simulation results provide interesting biological findings which could be used for future wet-lab experiments.  相似文献   

12.

Background  

Accurate methods for extraction of meaningful patterns in high dimensional data have become increasingly important with the recent generation of data types containing measurements across thousands of variables. Principal components analysis (PCA) is a linear dimensionality reduction (DR) method that is unsupervised in that it relies only on the data; projections are calculated in Euclidean or a similar linear space and do not use tuning parameters for optimizing the fit to the data. However, relationships within sets of nonlinear data types, such as biological networks or images, are frequently mis-rendered into a low dimensional space by linear methods. Nonlinear methods, in contrast, attempt to model important aspects of the underlying data structure, often requiring parameter(s) fitting to the data type of interest. In many cases, the optimal parameter values vary when different classification algorithms are applied on the same rendered subspace, making the results of such methods highly dependent upon the type of classifier implemented.  相似文献   

13.
Model-based analysis of fMRI data is an important tool for investigating the computational role of different brain regions. With this method, theoretical models of behavior can be leveraged to find the brain structures underlying variables from specific algorithms, such as prediction errors in reinforcement learning. One potential weakness with this approach is that models often have free parameters and thus the results of the analysis may depend on how these free parameters are set. In this work we asked whether this hypothetical weakness is a problem in practice. We first developed general closed-form expressions for the relationship between results of fMRI analyses using different regressors, e.g., one corresponding to the true process underlying the measured data and one a model-derived approximation of the true generative regressor. Then, as a specific test case, we examined the sensitivity of model-based fMRI to the learning rate parameter in reinforcement learning, both in theory and in two previously-published datasets. We found that even gross errors in the learning rate lead to only minute changes in the neural results. Our findings thus suggest that precise model fitting is not always necessary for model-based fMRI. They also highlight the difficulty in using fMRI data for arbitrating between different models or model parameters. While these specific results pertain only to the effect of learning rate in simple reinforcement learning models, we provide a template for testing for effects of different parameters in other models.  相似文献   

14.
Phylogenetic trees from multiple genes can be obtained in two fundamentally different ways. In one, gene sequences are concatenated into a super-gene alignment, which is then analyzed to generate the species tree. In the other, phylogenies are inferred separately from each gene, and a consensus of these gene phylogenies is used to represent the species tree. Here, we have compared these two approaches by means of computer simulation, using 448 parameter sets, including evolutionary rate, sequence length, base composition, and transition/transversion rate bias. In these simulations, we emphasized a worst-case scenario analysis in which 100 replicate datasets for each evolutionary parameter set (gene) were generated, and the replicate dataset that produced a tree topology showing the largest number of phylogenetic errors was selected to represent that parameter set. Both randomly selected and worst-case replicates were utilized to compare the consensus and concatenation approaches primarily using the neighbor-joining (NJ) method. We find that the concatenation approach yields more accurate trees, even when the sequences concatenated have evolved with very different substitution patterns and no attempts are made to accommodate these differences while inferring phylogenies. These results appear to hold true for parsimony and likelihood methods as well. The concatenation approach shows >95% accuracy with only 10 genes. However, this gain in accuracy is sometimes accompanied by reinforcement of certain systematic biases, resulting in spuriously high bootstrap support for incorrect partitions, whether we employ site, gene, or a combined bootstrap resampling approach. Therefore, it will be prudent to report the number of individual genes supporting an inferred clade in the concatenated sequence tree, in addition to the bootstrap support.  相似文献   

15.
We apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g., depending on whether or not consideration of side-chain orientation is necessary for proper alignment). We also show how our pairwise alignment procedure can be extended to generate a multiple alignment for a group of related structures. We have compared these alignments in detail with corresponding manual ones culled from the literature. We find good agreement (to within 95% for the core regions), and detailed comparison highlights how particular protein structural features (such as certain strands) are problematical to align, giving somewhat ambiguous results. With these improvements and systematic tests, our procedure should be useful for the development of scop and the future classification of protein folds.  相似文献   

16.
Beyer J  May B 《Molecular ecology》2003,12(8):2243-2250
We present an algorithm to partition a single generation of individuals into full-sib families using single-locus co-dominant marker data. Pairwise likelihood ratios are used to create a graph that represents the full-sib relationships within the data set. Connected-component and minimum-cut algorithms from the graph theory are then employed to find the full-sib families within the graph. The results of a large-scale simulation study show that the algorithm is able to produce accurate partitions when applied to data sets with eight or more loci. Although the algorithm performs best when the distribution of allele frequencies and family sizes in a data set is uniform, the inclusion of more loci or alleles per locus allows accurate partitions to be created from data sets in which these distributions are highly skewed.  相似文献   

17.
In this paper, we present algorithms to find near-optimal sets of epidemic spreaders in complex networks. We extend the notion of local-centrality, a centrality measure previously shown to correspond with a node''s ability to spread an epidemic, to sets of nodes by introducing combinatorial local centrality. Though we prove that finding a set of nodes that maximizes this new measure is NP-hard, good approximations are available. We show that a strictly greedy approach obtains the best approximation ratio unless P = NP and then formulate a modified version of this approach that leverages qualities of the network to achieve a faster runtime while maintaining this theoretical guarantee. We perform an experimental evaluation on samples from several different network structures which demonstrate that our algorithm maximizes combinatorial local centrality and consistently chooses the most effective set of nodes to spread infection under the SIR model, relative to selecting the top nodes using many common centrality measures. We also demonstrate that the optimized algorithm we develop scales effectively.  相似文献   

18.
Adaptive quality-based clustering of gene expression profiles   总被引:17,自引:0,他引:17  
MOTIVATION: Microarray experiments generate a considerable amount of data, which analyzed properly help us gain a huge amount of biologically relevant information about the global cellular behaviour. Clustering (grouping genes with similar expression profiles) is one of the first steps in data analysis of high-throughput expression measurements. A number of clustering algorithms have proved useful to make sense of such data. These classical algorithms, though useful, suffer from several drawbacks (e.g. they require the predefinition of arbitrary parameters like the number of clusters; they force every gene into a cluster despite a low correlation with other cluster members). In the following we describe a novel adaptive quality-based clustering algorithm that tackles some of these drawbacks. RESULTS: We propose a heuristic iterative two-step algorithm: First, we find in the high-dimensional representation of the data a sphere where the "density" of expression profiles is locally maximal (based on a preliminary estimate of the radius of the cluster-quality-based approach). In a second step, we derive an optimal radius of the cluster (adaptive approach) so that only the significantly coexpressed genes are included in the cluster. This estimation is achieved by fitting a model to the data using an EM-algorithm. By inferring the radius from the data itself, the biologist is freed from finding an optimal value for this radius by trial-and-error. The computational complexity of this method is approximately linear in the number of gene expression profiles in the data set. Finally, our method is successfully validated using existing data sets. AVAILABILITY: http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html  相似文献   

19.
Mathematical models in epidemiology are an indispensable tool to determine the dynamics and important characteristics of infectious diseases. Apart from their scientific merit, these models are often used to inform political decisions and interventional measures during an ongoing outbreak. However, reliably inferring the epidemical dynamics by connecting complex models to real data is still hard and requires either laborious manual parameter fitting or expensive optimization methods which have to be repeated from scratch for every application of a given model. In this work, we address this problem with a novel combination of epidemiological modeling with specialized neural networks. Our approach entails two computational phases: In an initial training phase, a mathematical model describing the epidemic is used as a coach for a neural network, which acquires global knowledge about the full range of possible disease dynamics. In the subsequent inference phase, the trained neural network processes the observed data of an actual outbreak and infers the parameters of the model in order to realistically reproduce the observed dynamics and reliably predict future progression. With its flexible framework, our simulation-based approach is applicable to a variety of epidemiological models. Moreover, since our method is fully Bayesian, it is designed to incorporate all available prior knowledge about plausible parameter values and returns complete joint posterior distributions over these parameters. Application of our method to the early Covid-19 outbreak phase in Germany demonstrates that we are able to obtain reliable probabilistic estimates for important disease characteristics, such as generation time, fraction of undetected infections, likelihood of transmission before symptom onset, and reporting delays using a very moderate amount of real-world observations.  相似文献   

20.
Aim Trait‐based risk assessment for invasive species is becoming an important tool for identifying non‐indigenous species that are likely to cause harm. Despite this, concerns remain that the invasion process is too complex for accurate predictions to be made. Our goal was to test risk assessment performance across a range of taxonomic and geographical scales, at different points in the invasion process, with a range of statistical and machine learning algorithms. Location Regional to global data sets. Methods We selected six data sets differing in size, geography and taxonomic scope. For each data set, we created seven risk assessment tools using a range of statistical and machine learning algorithms. Performance of tools was compared to determine the effects of data set size and scale, the algorithm used, and to determine overall performance of the trait‐based risk assessment approach. Results Risk assessment tools with good performance were generated for all data sets. Random forests (RF) and logistic regression (LR) consistently produced tools with high performance. Other algorithms had varied performance. Despite their greater power and flexibility, machine learning algorithms did not systematically outperform statistical algorithms. Geographic scope of the data set, and size of the data set, did not systematically affect risk assessment performance. Main conclusions Across six representative data sets, we were able to create risk assessment tools with high performance. Additional data sets could be generated for other taxonomic groups and regions, and these could support efforts to prevent the arrival of new invaders. Random forests and LR approaches performed well for all data sets and could be used as a standard approach to risk assessment development.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号