首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Recent advances in big data and analytics research have provided a wealth of large data sets that are too big to be analyzed in their entirety, due to restrictions on computer memory or storage size. New Bayesian methods have been developed for data sets that are large only due to large sample sizes. These methods partition big data sets into subsets and perform independent Bayesian Markov chain Monte Carlo analyses on the subsets. The methods then combine the independent subset posterior samples to estimate a posterior density given the full data set. These approaches were shown to be effective for Bayesian models including logistic regression models, Gaussian mixture models and hierarchical models. Here, we introduce the R package parallelMCMCcombine which carries out four of these techniques for combining independent subset posterior samples. We illustrate each of the methods using a Bayesian logistic regression model for simulation data and a Bayesian Gamma model for real data; we also demonstrate features and capabilities of the R package. The package assumes the user has carried out the Bayesian analysis and has produced the independent subposterior samples outside of the package. The methods are primarily suited to models with unknown parameters of fixed dimension that exist in continuous parameter spaces. We envision this tool will allow researchers to explore the various methods for their specific applications and will assist future progress in this rapidly developing field.  相似文献   

2.
Increasing concern over the implications of climate change for biodiversity has led to the use of species–climate envelope models to project species extinction risk under climate‐change scenarios. However, recent studies have demonstrated significant variability in model predictions and there remains a pressing need to validate models and to reduce uncertainties. Model validation is problematic as predictions are made for events that have not yet occurred. Resubstituition and data partitioning of present‐day data sets are, therefore, commonly used to test the predictive performance of models. However, these approaches suffer from the problems of spatial and temporal autocorrelation in the calibration and validation sets. Using observed distribution shifts among 116 British breeding‐bird species over the past ~20 years, we are able to provide a first independent validation of four envelope modelling techniques under climate change. Results showed good to fair predictive performance on independent validation, although rules used to assess model performance are difficult to interpret in a decision‐planning context. We also showed that measures of performance on nonindependent data provided optimistic estimates of models' predictive ability on independent data. Artificial neural networks and generalized additive models provided generally more accurate predictions of species range shifts than generalized linear models or classification tree analysis. Data for independent model validation and replication of this study are rare and we argue that perfect validation may not in fact be conceptually possible. We also note that usefulness of models is contingent on both the questions being asked and the techniques used. Implementations of species–climate envelope models for testing hypotheses and predicting future events may prove wrong, while being potentially useful if put into appropriate context.  相似文献   

3.
Two examples in quantitative biology are examined to emphasize the need for two-phase regression models: the osmotic behaviour of cells and the non-linear temperature kinetics of membrane-bound enzyme systems. Existing statistical techniques are inadequate to test the equality of break-points of two data sets for specific reasons. We suggest here a pragmatic solution by way of a computer programme useful in applying two-phase regression models to such data sets wherein a decision needs to be made whether the critical transition differs or not.  相似文献   

4.
1. Matrix population models are widely used to describe population dynamics, conduct population viability analyses and derive management recommendations for plant populations. For endangered or invasive species, management decisions are often based on small demographic data sets. Hence, there is a need for population models which accurately assess population performance from such small data sets.
2. We used demographic data on two perennial herbs with different life histories to compare the accuracy and precision of the traditional matrix population model and the recently developed integral projection model (IPM) in relation to the amount of data.
3. For large data sets both matrix models and IPMs produced identical estimates of population growth rate (λ). However, for small data sets containing fewer than 300 individuals, IPMs often produced smaller bias and variance for λ than matrix models despite different matrix structures and sampling techniques used to construct the matrix population models.
4. Synthesis and applications . Our results suggest that the smaller bias and variance of λ estimates make IPMs preferable to matrix population models for small demographic data sets with a few hundred individuals. These results are likely to be applicable to a wide range of herbaceous, perennial plant species where demographic fate can be modelled as a function of a continuous state variable such as size. We recommend the use of IPMs to assess population performance and management strategies particularly for endangered or invasive perennial herbs where little demographic data are available.  相似文献   

5.
For Genetic Analysis Workshop 19, 2 extensive data sets were provided, including whole genome and whole exome sequence data, gene expression data, and longitudinal blood pressure outcomes, together with nongenetic covariates. These data sets gave researchers the chance to investigate different aspects of more complex relationships within the data, and the contributions in our working group focused on statistical methods for the joint analysis of multiple phenotypes, which is part of the research field of data integration. The analysis of data from different sources poses challenges to researchers but provides the opportunity to model the real-life situation more realistically.Our 4 contributions all used the provided real data to identify genetic predictors for blood pressure. In the contributions, novel multivariate rare variant tests, copula models, structural equation models and a sparse matrix representation variable selection approach were applied. Each of these statistical models can be used to investigate specific hypothesized relationships, which are described together with their biological assumptions.The results showed that all methods are ready for application on a genome-wide scale and can be used or extended to include multiple omics data sets. The results provide potentially interesting genetic targets for future investigation and replication. Furthermore, all contributions demonstrated that the analysis of complex data sets could benefit from modeling correlated phenotypes jointly as well as by adding further bioinformatics information.  相似文献   

6.
When the fluorescence intensity of a chromophore attached to or bound in an enzyme relates to a specific reactive step in the enzymatic reaction, a single molecule fluorescence study of the process reveals a time sequence in the fluorescence emission that can be analyzed to derive kinetic and mechanistic information. Reports of various experimental results and corresponding theoretical studies have provided a basis for interpreting these data and understanding the methodology. We have found it useful to parallel experiments with Monte Carlo simulations of potential models hypothesized to describe the reaction kinetics. The simulations can be adapted to include experimental limitations, such as limited data sets, and complexities such as dynamic disorder, where reaction rates appear to change over time. By using models that are known a priori, the simulations reveal some of the challenges of interpreting finite single-molecule data sets by employing various statistical signatures that have been identified.  相似文献   

7.
Many organizations collect large passive acoustic monitoring (PAM) data sets that need to be efficiently and reliably analyzed. To determine appropriate methods for effective analysis of big PAM data sets, we undertook a literature review of baleen whale PAM analysis methods. Methodologies from 166 studies (published between 2000–2019) were summarized, and a detailed review was performed on the 94 studies that recorded more than 1,000 hr of acoustic data (“big data”). Analysis techniques for extracting baleen whale information from PAM data sets varied depending on the research observed. A spectrum of methodologies was used and ranged from manual analysis of all acoustic data by human experts to completely automated techniques with no manual validation. Based on this assessment, recommendations are provided to encourage robust research methods that are comparable across studies and sectors, achievable across research groups, and consistent with previous work. These include using automated techniques when possible to increase efficiency and repeatability, supplementing automation with manual review to calculate automated detector performance, and increasing consistency in terminology and presentation of results. This work can be used to facilitate discussion for minimum standards and best practices to be implemented in the field of marine mammal PAM.  相似文献   

8.
Given the need for parallel increases in food and energy production from crops in the context of global change, crop simulation models and data sets to feed these models with photosynthesis and respiration parameters are increasingly important. This study provides information on photosynthesis and respiration for three energy crops (sunflower, kenaf, and cynara), reviews relevant information for five other crops (wheat, barley, cotton, tobacco, and grape), and assesses how conserved photosynthesis parameters are among crops. Using large data sets and optimization techniques, the C(3) leaf photosynthesis model of Farquhar, von Caemmerer, and Berry (FvCB) and an empirical night respiration model for tested energy crops accounting for effects of temperature and leaf nitrogen were parameterized. Instead of the common approach of using information on net photosynthesis response to CO(2) at the stomatal cavity (A(n)-C(i)), the model was parameterized by analysing the photosynthesis response to incident light intensity (A(n)-I(inc)). Convincing evidence is provided that the maximum Rubisco carboxylation rate or the maximum electron transport rate was very similar whether derived from A(n)-C(i) or from A(n)-I(inc) data sets. Parameters characterizing Rubisco limitation, electron transport limitation, the degree to which light inhibits leaf respiration, night respiration, and the minimum leaf nitrogen required for photosynthesis were then determined. Model predictions were validated against independent sets. Only a few FvCB parameters were conserved among crop species, thus species-specific FvCB model parameters are needed for crop modelling. Therefore, information from readily available but underexplored A(n)-I(inc) data should be re-analysed, thereby expanding the potential of combining classical photosynthetic data and the biochemical model.  相似文献   

9.
10.
Diatoms and macroinvertebrates are both commonly used for biological assessment of stream condition. As the use of biological assessment techniques increases, resource managers will need to make decisions on which biological tool to use for a particular study. In a study of the Kiewa River, Victoria, Australia we assessed these two components of the biota—macroinvertebrates and diatoms—using indices and pattern analysis, and comparing them with an a priori landscape classification. We also assessed the relationship exhibited between the biological results and environmental variables which are usually significant in stream ecosystems. To make the data comparable we used categorical abundances for both data sets. The pattern analyses showed complementary results, with diatoms more closely related to water quality variables, whereas macroinvertebrates were primarily related to catchment and habitat features. An analysis of a combined data set (diatoms plus macroinvertebrates) showed no extra information was gained. Using categorisation to create consistency between data sets was shown to reduce the information and affect results from the diatom analyses. The results suggested that the locally derived bioassessment models and indices provided a more accurate assessment of the sites than the overseas-derived diatom index. The outcomes are complicated by issues of data weighting, whereby a presence/absence diatom index may have performed better than abundance-weighted indices due to strong dominance of one or two species at a site. Future comparisons will benefit from an increase in the knowledge of regional diatom taxonomy and autecology.  相似文献   

11.
Recent advances in high‐throughput methods of molecular analyses have led to an explosion of studies generating large‐scale ecological data sets. In particular, noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in‐depth assessments of the composition, functions and dynamic changes of complex microbial communities. Because even a single high‐throughput experiment produces large amount of data, powerful statistical techniques of multivariate analysis are well suited to analyse and interpret these data sets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular data set. In this review, we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and data set structure.  相似文献   

12.
Modeling is a means of formulating and testing complex hypotheses. Useful modeling is now possible with biological laboratory microcomputers with which experimenters feel comfortable. Artificial intelligence (AI) is sufficiently similar to modeling that AI techniques, now becoming usable on microcomputers, are applicable to modeling. Microcomputer and AI applications to physiological system studies with multienzyme models and with kinetic models of isolated enzymes are described. Using an IBM PC microcomputer, we have been able to fit kinetic enzyme models; to extend this process to design kinetic experiments by determining the optimal conditions; and to construct an enzyme (hexokinase) kinetics data base. We have also used a PC to do most of the constructing of complex multienzyme models, initially with small simple BASIC programs; alternative methods with standard spreadsheet or data base programs have been defined. Formulating and solving differential equations in appropriate representational languages, and sensitivity analysis, are soon likely to be feasible with PCs. Much of the modeling process can be stated in terms of AI expert systems, using sets of rules for fitting and evaluating models and designing further experiments. AI techniques also permit critiquing and evaluating the data, experiments, and hypotheses being modeled, and can be extended to supervise the calculations involved.  相似文献   

13.
An algorithm has been developed for the determination of nucleotide sequence from data produced in fluorescence-based automated DNA sequencing instruments employing the four-color strategy. This algorithm takes advantage of object oriented programming techniques for modularity and extensibility. The algorithm is adaptive in that data sets from a wide variety of instruments and sequencing conditions can be used with good results. Confidence values are provided on the base calls as an estimate of accuracy. The algorithm iteratively employs confidence determinations from several different modules, each of which examines a different feature of the data for accurate peak identification. Modules within this system can be added or removed for increased performance or for application to a different task. In comparisons with commercial software, the algorithm performed well.  相似文献   

14.
Unbalanced samples are considered a drawback in predictive modelling of species' potential habitats, and a prevalence of 0.5 has been extensively recommended. We argue that unbalanced species distribution data are not such a problem from a statistical point of view, and that good models can be obtained provided that the right predictors and cut-off to convert probabilities into presence/absence are chosen. The effects of unbalanced prevalence should not be confused with those of low-quality data affected by false absences, low sample size, or unrepresentativeness of the environmental and spatial gradient. Finally, we point out the necessity of greater research effort aimed at improving both the quality of training data sets, and the processes of validating and testing of models.  相似文献   

15.
16.
Over the last two decades spatial point pattern analysis (SPPA) has become increasingly popular in ecological research. To direct future work in this area we review studies using SPPA techniques in ecology and related disciplines. We first summarize the key elements of SPPA in ecology (i.e. data types, summary statistics and their estimation, null models, comparison of data and models, and consideration of heterogeneity); second, we review how ecologists have used these key elements; and finally, we identify practical difficulties that are still commonly encountered and point to new methods that allow current key questions in ecology to be effectively addressed. Our review of 308 articles published over the period 1992–2012 reveals that a standard canon of SPPA techniques in ecology has been largely identified and that most of the earlier technical issues that occupied ecologists, such as edge correction, have been solved. However, the majority of studies underused the methodological potential offered by modern SPPA. More advanced techniques of SPPA offer the potential to address a variety of highly relevant ecological questions. For example, inhomogeneous summary statistics can quantify the impact of heterogeneous environments, mark correlation functions can include trait and phylogenetic information in the analysis of multivariate spatial patterns, and more refined point process models can be used to realistically characterize the structure of a wide range of patterns. Additionally, recent advances in fitting spatially‐explicit simulation models of community dynamics to point pattern summary statistics hold the promise for solving the longstanding problem of linking pattern to process. All these newer developments allow ecologists to keep up with the increasing availability of spatial data sets provided by newer technologies, which allow point patterns and environmental variables to be mapped over large spatial extents at increasingly higher image resolutions.  相似文献   

17.
A duplication growth model of gene expression networks   总被引:8,自引:0,他引:8  
  相似文献   

18.
Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.  相似文献   

19.
It has been claimed that blending processes such as trade and exchange have always been more important in the evolution of cultural similarities and differences among human populations than the branching process of population fissioning. In this paper, we report the results of a novel comparative study designed to shed light on this claim. We fitted the bifurcating tree model that biologists use to represent the relationships of species to 21 biological data sets that have been used to reconstruct the relationships of species and/or higher level taxa and to 21 cultural data sets. We then compared the average fit between the biological data sets and the model with the average fit between the cultural data sets and the model. Given that the biological data sets can be confidently assumed to have been structured by speciation, which is a branching process, our assumption was that, if cultural evolution is dominated by blending processes, the fit between the bifurcating tree model and the cultural data sets should be significantly worse than the fit between the bifurcating tree model and the biological data sets. Conversely, if cultural evolution is dominated by branching processes, the fit between the bifurcating tree model and the cultural data sets should be no worse than the fit between the bifurcating tree model and the biological data sets. We found that the average fit between the cultural data sets and the bifurcating tree model was not significantly different from the fit between the biological data sets and the bifurcating tree model. This indicates that the cultural data sets are not less tree-like than are the biological data sets. As such, our analysis does not support the suggestion that blending processes have always been more important than branching processes in cultural evolution. We conclude from this that, rather than deciding how cultural evolution has proceeded a priori, researchers need to ascertain which model or combination of models is relevant in a particular case and why.  相似文献   

20.
Yang X  Belin TR  Boscardin WJ 《Biometrics》2005,61(2):498-506
Across multiply imputed data sets, variable selection methods such as stepwise regression and other criterion-based strategies that include or exclude particular variables typically result in models with different selected predictors, thus presenting a problem for combining the results from separate complete-data analyses. Here, drawing on a Bayesian framework, we propose two alternative strategies to address the problem of choosing among linear regression models when there are missing covariates. One approach, which we call "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. A second strategy is to conduct Bayesian variable selection and missing data imputation simultaneously within one Gibbs sampling process, which we call "simultaneously impute and select" (SIAS). The methods are implemented and evaluated using the Bayesian procedure known as stochastic search variable selection for multivariate normal data sets, but both strategies offer general frameworks within which different Bayesian variable selection algorithms could be used for other types of data sets. A study of mental health services utilization among children in foster care programs is used to illustrate the techniques. Simulation studies show that both ITS and SIAS outperform complete-case analysis with stepwise variable selection and that SIAS slightly outperforms ITS.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号