首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 705 毫秒
1.
Predictive phylogeography seeks to aggregate genetic, environmental and taxonomic data from multiple species in order to make predictions about unsampled taxa using machine‐learning techniques such as Random Forests. To date, organismal trait data have infrequently been incorporated into predictive frameworks due to difficulties inherent to the scoring of trait data across a taxonomically broad set of taxa. We refine predictive frameworks from two North American systems, the inland temperate rainforests of the Pacific Northwest and the Southwestern Arid Lands (SWAL), by incorporating a number of organismal trait variables. Our results indicate that incorporating life history traits as predictor variables improves the performance of the supervised machine‐learning approach to predictive phylogeography, especially for the SWAL system, in which predictions made from only taxonomic and climate variables meets only moderate success. In particular, traits related to reproduction (e.g., reproductive mode; clutch size) and trophic level appear to be particularly informative to the predictive framework. Predictive frameworks offer an important mechanism for integration of organismal trait, environmental data, and genetic data in phylogeographic studies.  相似文献   

2.
3.
Propensity score matching (PSM) and propensity score weighting (PSW) are popular tools to estimate causal effects in observational studies. We address two open issues: how to estimate propensity scores and assess covariate balance. Using simulations, we compare the performance of PSM and PSW based on logistic regression and machine learning algorithms (CART; Bagging; Boosting; Random Forest; Neural Networks; naive Bayes). Additionally, we consider several measures of covariate balance (Absolute Standardized Average Mean (ASAM) with and without interactions; measures based on the quantile‐quantile plots; ratio between variances of propensity scores; area under the curve (AUC)) and assess their ability in predicting the bias of PSM and PSW estimators. We also investigate the importance of tuning of machine learning parameters in the context of propensity score methods. Two simulation designs are employed. In the first, the generating processes are inspired to birth register data used to assess the effect of labor induction on the occurrence of caesarean section. The second exploits more general generating mechanisms. Overall, among the different techniques, random forests performed the best, especially in PSW. Logistic regression and neural networks also showed an excellent performance similar to that of random forests. As for covariate balance, the simplest and commonly used metric, the ASAM, showed a strong correlation with the bias of causal effects estimators. Our findings suggest that researchers should aim at obtaining an ASAM lower than 10% for as many variables as possible. In the empirical study we found that labor induction had a small and not statistically significant impact on caesarean section.  相似文献   

4.
Ternary organic solar cells (OSCs) have progressed significantly in recent years due to the sufficient photon harvesting of the blend photoactive layer including three absorption‐complementary materials. With the rapid development of highly efficient ternary OSCs in photovoltaics, the precise energy‐level alignment of the three active components within ternary OSC devices should be taken into account. The machine‐learning technique is a computational method that can effectively learn from previous historical data to build predictive models. In this study, a dataset of 124 fullerene derivatives‐based ternary OSCs is manually constructed from a diverse range of literature along with their frontier molecular orbital theory levels, and device structures. Different machine‐learning algorithms are trained based on these electronic parameters to predict photovoltaic efficiency. Thus, the best predictive capability is provided by using the Random Forest approach beyond other machine‐learning algorithms in the dataset. Furthermore, the Random Forest algorithm yields valuable insights into the crucial role of lowest unoccupied molecular orbital energy levels of organic donors in the performance of ternary OSCs. The outcome of this study demonstrates a smart strategy for extracting underlying complex correlations in fullerene derivatives‐based ternary OSCs, thereby accelerating the development of ternary OSCs and related research fields.  相似文献   

5.
Lameness is one of the costliest health problems, as well as a welfare concern in dairy cows. However, it is difficult to detect cows with possible lameness, or the ones that are at risk of becoming lame e.g. in the next week or so. In this study, we investigated the ability of three machine learning algorithms, Naïve Bayes (NB), Random Forest (RF) and Multilayer Perceptron (MLP), to predict cases of lameness using milk production and conformation traits. The performance of these algorithms was compared with logistic regression (LR) as the gold standard approach for binary classification. We had a total of 2 535 lameness scores (2 248 sound and 287 unsound) and 29 predictor features from nine dairy herds in Australia to predict lameness incidence. Training was done on 80% of the data within each herd with the remainder used as validation set. Our results indicated that in terms of area under curve of receiver operating characteristics, there were negligible differences between LR (0.67) and NB (0.66) while MLP (0.62) and RF (0.61) underperformed compared to the other two methods. However, the F1-score in NB (27%) outperformed LR (1%), suggesting that NB could potentially be a more reliable method for the prediction of lameness in practice, given enough relevant data are available for proper training, which was a limitation in this study. Considering the small size of our dataset, lack of information about environmental conditions prior to the incidence of lameness, management practices, short time gap between production records and lameness scoring, and farm information, this study proved the concept of using machine learning predictive models to predict the incidence of lameness a priori to its occurrence and thus may become a valuable decision support system for better lameness management in precision dairy farming.  相似文献   

6.
7.
Resource selection functions (RSFs) are tremendously valuable for ecologists and resource managers because they quantify spatial patterns in resource utilization by wildlife, thereby facilitating identification of critical habitat areas and characterizing specific habitat features that are selected or avoided. RSFs discriminate between known‐use resource units (e.g., telemetry locations) and available (or randomly selected) resource units based on an array of environmental features, and in their standard form are performed using logistic regression. As generalized linear models, standard RSFs have some notable limitations, such as difficulties in accommodating nonlinear (e.g., humped or threshold) relationships and complex interactions. Increasingly, ecologists are using flexible machine‐learning methods (e.g., random forests, neural networks) to overcome these limitations. Herein, we investigate the seasonal resource selection patterns of mule deer (Odocoileus hemionus) by comparing a logistic regression framework with random forest (RF), a popular machine‐learning algorithm. Random forest (RF) models detected nonlinear relationships (e.g., optimal ranges for slope and elevation) and complex interactions which would have been very challenging to discover and characterize using standard model‐based approaches. Compared with standard RSF models, RF models exhibited improved predictive skill, provided novel insights about resource selection patterns of mule deer, and, when projected across a relevant geographic space, manifested notable differences in predicted habitat suitability. We recommend that wildlife researchers harness the strengths of machine‐learning tools like RF in addition to “classical” tools (e.g., mixed‐effects logistic regression) for evaluating resource selection, especially in cases where extensive telemetry data sets are available.  相似文献   

8.
In function approximation problems, one of the most common ways to evaluate a learning algorithm consists in partitioning the original data set (input/output data) into two sets: learning, used for building models, and test, applied for genuine out-of-sample evaluation. When the partition into learning and test sets does not take into account the variability and geometry of the original data, it might lead to non-balanced and unrepresentative learning and test sets and, thus, to wrong conclusions in the accuracy of the learning algorithm. How the partitioning is made is therefore a key issue and becomes more important when the data set is small due to the need of reducing the pessimistic effects caused by the removal of instances from the original data set. Thus, in this work, we propose a deterministic data mining approach for a distribution of a data set (input/output data) into two representative and balanced sets of roughly equal size taking the variability of the data set into consideration with the purpose of allowing both a fair evaluation of learning's accuracy and to make reproducible machine learning experiments usually based on random distributions. The sets are generated using a combination of a clustering procedure, especially suited for function approximation problems, and a distribution algorithm which distributes the data set into two sets within each cluster based on a nearest-neighbor approach. In the experiments section, the performance of the proposed methodology is reported in a variety of situations through an ANOVA-based statistical study of the results.  相似文献   

9.
Species distribution modeling often involves high‐dimensional environmental data. Large amounts of data and multicollinearity among covariates impose challenges to statistical models in variable selection for reliable inferences of the effects of environmental factors on the spatial distribution of species. Few studies have evaluated and compared the performance of multiple machine learning (ML) models in handling multicollinearity. Here, we assessed the effectiveness of removal of correlated covariates and regularization to cope with multicollinearity in ML models for habitat suitability. Three machine learning algorithms maximum entropy (MaxEnt), random forests (RFs), and support vector machines (SVMs) were applied to the original data (OD) of 27 landscape variables, reduced data (RD) with 14 highly correlated covariates being removed, and 15 principal components (PC) of the OD accounting for 90% of the original variability. The performance of the three ML models was measured with the area under the curve and continuous Boyce index. We collected 663 nonduplicated presence locations of Eastern wild turkeys (Meleagris gallopavo silvestris) across the state of Mississippi, United States. Of the total locations, 453 locations separated by a distance of ≥2 km were used to train the three ML algorithms on the OD, RD, and PC data, respectively. The remaining 210 locations were used to validate the trained ML models to measure ML performance. Three ML models had excellent performance on the RD and PC data. MaxEnt and SVMs had good performance on the OD data, indicating the adequacy of regularization of the default setting for multicollinearity. Weak learning of RFs through bagging appeared to alleviate multicollinearity and resulted in excellent performance on the OD data. Regularization of ML algorithms may help exploratory studies of the effects of environmental factors on the spatial distribution and habitat suitability of wildlife.  相似文献   

10.
11.
12.
Due to advancements in computational ability, enhanced technology and a reduction in the price of genotyping, more data are being generated for understanding genetic associations with diseases and disorders. However, with the availability of large data sets comes the inherent challenges of new methods of statistical analysis and modeling. Considering a complex phenotype may be the effect of a combination of multiple loci, various statistical methods have been developed for identifying genetic epistasis effects. Among these methods, logic regression (LR) is an intriguing approach incorporating tree-like structures. Various methods have built on the original LR to improve different aspects of the model. In this study, we review four variations of LR, namely Logic Feature Selection, Monte Carlo Logic Regression, Genetic Programming for Association Studies, and Modified Logic Regression-Gene Expression Programming, and investigate the performance of each method using simulated and real genotype data. We contrast these with another tree-like approach, namely Random Forests, and a Bayesian logistic regression with stochastic search variable selection.  相似文献   

13.
目前,基于计算机数学方法对基因的功能注释已成为热点及挑战,其中以机器学习方法应用最为广泛。生物信息学家不断提出有效、快速、准确的机器学习方法用于基因功能的注释,极大促进了生物医学的发展。本文就关于机器学习方法在基因功能注释的应用与进展作一综述。主要介绍几种常用的方法,包括支持向量机、k近邻算法、决策树、随机森林、神经网络、马尔科夫随机场、logistic回归、聚类算法和贝叶斯分类器,并对目前机器学习方法应用于基因功能注释时如何选择数据源、如何改进算法以及如何提高预测性能上进行讨论。  相似文献   

14.
赵学彤  杨亚东  渠鸿竹  方向东 《遗传》2018,40(9):693-703
随着组学技术的不断发展,对于不同层次和类型的生物数据的获取方法日益成熟。在疾病诊治过程中会产生大量数据,通过机器学习等人工智能方法解析复杂、多维、多尺度的疾病大数据,构建临床决策支持工具,辅助医生寻找快速且有效的疾病诊疗方案是非常必要的。在此过程中,机器学习等人工智能方法的选择显得尤为重要。基于此,本文首先从类型和算法角度对临床决策支持领域中常用的机器学习等方法进行简要综述,分别介绍了支持向量机、逻辑回归、聚类算法、Bagging、随机森林和深度学习,对机器学习等方法在临床决策支持中的应用做了相应总结和分类,并对它们的优势和不足分别进行讨论和阐述,为临床决策支持中机器学习等人工智能方法的选择提供有效参考。  相似文献   

15.
《IRBM》2022,43(6):678-686
ObjectivesFeature selection in data sets is an important task allowing to alleviate various machine learning and data mining issues. The main objectives of a feature selection method consist on building simpler and more understandable classifier models in order to improve the data mining and processing performances. Therefore, a comparative evaluation of the Chi-square method, recursive feature elimination method, and tree-based method (using Random Forest) used on the three common machine learning methods (K-Nearest Neighbor, naïve Bayesian classifier and decision tree classifier) are performed to select the most relevant primitives from a large set of attributes. Furthermore, determining the most suitable couple (i.e., feature selection method-machine learning method) that provides the best performance is performed.Materials and methodsIn this paper, an overview of the most common feature selection techniques is first provided: the Chi-Square method, the Recursive Feature Elimination method (RFE) and the tree-based method (using Random Forest). A comparative evaluation of the improvement (brought by such feature selection methods) to the three common machine learning methods (K- Nearest Neighbor, naïve Bayesian classifier and decision tree classifier) are performed. For evaluation purposes, the following measures: micro-F1, accuracy and root mean square error are used on the stroke disease data set.ResultsThe obtained results show that the proposed approach (i.e., Tree Based Method using Random Forest, TBM-RF, decision tree classifier, DTC) provides accuracy higher than 85%, F1-score higher than 88%, thus, better than the KNN and NB using the Chi-Square, RFE and TBM-RF methods.ConclusionThis study shows that the couple - Tree Based Method using Random Forest (TBM-RF) decision tree classifier successfully and efficiently contributes to find the most relevant features and to predict and classify patient suffering of stroke disease.”  相似文献   

16.
With discovery of diverse roles for RNA, its centrality in cellular functions has become increasingly apparent. A number of algorithms have been developed to predict RNA secondary structure. Their performance has been benchmarked by comparing structure predictions to reference secondary structures. Generally, algorithms are compared against each other and one is selected as best without statistical testing to determine whether the improvement is significant. In this work, it is demonstrated that the prediction accuracies of methods correlate with each other over sets of sequences. One possible reason for this correlation is that many algorithms use the same underlying principles. A set of benchmarks published previously for programs that predict a structure common to three or more sequences is statistically analyzed as an example to show that it can be rigorously evaluated using paired two-sample t-tests. Finally, a pipeline of statistical analyses is proposed to guide the choice of data set size and performance assessment for benchmarks of structure prediction. The pipeline is applied using 5S rRNA sequences as an example.  相似文献   

17.
BackgroundPrevious epidemiological studies have examined the prevalence and risk factors for a variety of parasitic illnesses, including protozoan and soil-transmitted helminth (STH, e.g., hookworms and roundworms) infections. Despite advancements in machine learning for data analysis, the majority of these studies use traditional logistic regression to identify significant risk factors.MethodsIn this study, we used data from a survey of 54 risk factors for intestinal parasitosis in 954 Ethiopian school children. We investigated whether machine learning approaches can supplement traditional logistic regression in identifying intestinal parasite infection risk factors. We used feature selection methods such as InfoGain (IG), ReliefF (ReF), Joint Mutual Information (JMI), and Minimum Redundancy Maximum Relevance (MRMR). Additionally, we predicted children’s parasitic infection status using classifiers such as Logistic Regression (LR), Support Vector Machines (SVM), Random Forests (RF) and XGBoost (XGB), and compared their accuracy and area under the receiver operating characteristic curve (AUROC) scores. For optimal model training, we performed tenfold cross-validation and tuned the classifier hyperparameters. We balanced our dataset using the Synthetic Minority Oversampling (SMOTE) method. Additionally, we used association rule learning to establish a link between risk factors and parasitic infections.Key findingsOur study demonstrated that machine learning could be used in conjunction with logistic regression. Using machine learning, we developed models that accurately predicted four parasitic infections: any parasitic infection at 79.9% accuracy, helminth infection at 84.9%, any STH infection at 95.9%, and protozoan infection at 94.2%. The Random Forests (RF) and Support Vector Machines (SVM) classifiers achieved the highest accuracy when top 20 risk factors were considered using Joint Mutual Information (JMI) or all features were used. The best predictors of infection were socioeconomic, demographic, and hematological characteristics.ConclusionsWe demonstrated that feature selection and association rule learning are useful strategies for detecting risk factors for parasite infection. Additionally, we showed that advanced classifiers might be utilized to predict children’s parasitic infection status. When combined with standard logistic regression models, machine learning techniques can identify novel risk factors and predict infection risk.  相似文献   

18.
Tropical forests are significant carbon sinks and their soils’ carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms—including the model tuning and predictor selection—were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models’ predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.  相似文献   

19.
20.
Habitat suitability models, usually referred to as species distribution models (SDMs), are widely applied in ecology for many purposes, including species conservation, habitat discovery, and gain evolutionary insights by estimating the distribution of species. Machine learning algorithms as well as statistical models have been recently used to predict the distribution of species. However, they seemed to have some limitations due to the data and the models used. Therefore, this study proposes a novel approach for assessing habitat suitability based on ensemble learning techniques. Three heterogeneous ensembles were built using the stacked generalization method to model the distribution of four wheatear species (Oenanthe deserti, Oenanthe leucopyga, Oenanthe leucura, and Oenanthe oenanthe) located in Morocco. Initially, a set of base-learners were constructed by virtue of training for each specie's dataset six machine learning algorithms (Multi-Layer Perceptron (MLP), Support Vector Classifier (SVC), K-nearest neighbors (KNN), Decision Trees (DT), Gradient Boosting Classifier (GB), and Random Forest (RF)). Then, the predictions of these base learners were fed as training data to train three meta-learners (Logistic Regression (LR), SVC, and MLP). To evaluate and assess the performance of the proposed approaches, we used: (1) six performance criteria (accuracy, recall, precision, F1-score, AUC, and TSS), (2) Borda Count (BC) ranking method based on multiple criteria to rank the best-performing models, and (3) Scott Knott (SK) test to statistically compare the performance of the presented models. The results based on the six-evaluation metrics showed that stacked ensembles outperformed their singles in all species datasets, and the stacked model with SVC as a meta-learner outperformed the other two ensembles. The results showed the potential of using ensemble learning techniques to model species distribution and recommend the use of the stacked generalization technique as a combination strategy since it gave better results compared to single models in four wheatear species datasets. Moreover, to assess the impact of future climate changes on the distribution of the four wheatear species, the best-performing distribution model was selected and projected into the current and future climatic conditions. The distributions of the Moroccan wheatear birds were found to be slightly affected by future climate changes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号