首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Background: The aim of the present study was to confirm the role of Brachyury in breast cancer and to verify whether four types of machine learning models can use Brachyury expression to predict the survival of patients.Methods: We conducted a retrospective review of the medical records to obtain patient information, and made the patient’s paraffin tissue into tissue chips for staining analysis. We selected 303 patients for research and implemented four machine learning algorithms, including multivariate logistic regression model, decision tree, artificial neural network and random forest, and compared the results of these models with each other. Area under the receiver operating characteristic (ROC) curve (AUC) was used to compare the results.Results: The chi-square test results of relevant data suggested that the expression of Brachyury protein in cancer tissues was significantly higher than that in paracancerous tissues (P=0.0335); patients with breast cancer with high Brachyury expression had a worse overall survival (OS) compared with patients with low Brachyury expression. We also found that Brachyury expression was associated with ER expression (P=0.0489). Subsequently, we used four machine learning models to verify the relationship between Brachyury expression and the survival of patients with breast cancer. The results showed that the decision tree model had the best performance (AUC = 0.781).Conclusions: Brachyury is highly expressed in breast cancer and indicates that patients had a poor prognosis. Compared with conventional statistical methods, decision tree model shows superior performance in predicting the survival status of patients with breast cancer.  相似文献   

2.
Resource selection functions (RSFs) are tremendously valuable for ecologists and resource managers because they quantify spatial patterns in resource utilization by wildlife, thereby facilitating identification of critical habitat areas and characterizing specific habitat features that are selected or avoided. RSFs discriminate between known‐use resource units (e.g., telemetry locations) and available (or randomly selected) resource units based on an array of environmental features, and in their standard form are performed using logistic regression. As generalized linear models, standard RSFs have some notable limitations, such as difficulties in accommodating nonlinear (e.g., humped or threshold) relationships and complex interactions. Increasingly, ecologists are using flexible machine‐learning methods (e.g., random forests, neural networks) to overcome these limitations. Herein, we investigate the seasonal resource selection patterns of mule deer (Odocoileus hemionus) by comparing a logistic regression framework with random forest (RF), a popular machine‐learning algorithm. Random forest (RF) models detected nonlinear relationships (e.g., optimal ranges for slope and elevation) and complex interactions which would have been very challenging to discover and characterize using standard model‐based approaches. Compared with standard RSF models, RF models exhibited improved predictive skill, provided novel insights about resource selection patterns of mule deer, and, when projected across a relevant geographic space, manifested notable differences in predicted habitat suitability. We recommend that wildlife researchers harness the strengths of machine‐learning tools like RF in addition to “classical” tools (e.g., mixed‐effects logistic regression) for evaluating resource selection, especially in cases where extensive telemetry data sets are available.  相似文献   

3.
We develop ways to predict the side chain orientations of residues within a protein structure by using several different statistical machine learning methods. Here side chain orientation of a given residue i is measured by an angle Omega(i) between the vector pointing from the center of the protein structure to the C(i)(alpha) atom and the vector pointing from the C(i)(alpha) atom to the center of its side chain atoms. To predict the Omega(i) angles, we construct statistical models by using several different methods such as general linear regression, a regression tree and bagging, a neural network, and a support vector machine. The root mean square errors for the different models range only from 36.67 to 37.60 degrees and the correlation coefficients are all between 30% and 34%. The performances of different models in the test set are, thus, quite similar, and show the relative predictive power of these models to be significant in comparison with random side chain orientations.  相似文献   

4.
Market impact cost is the most significant portion of implicit transaction costs that can reduce the overall transaction cost, although it cannot be measured directly. In this paper, we employed the state-of-the-art nonparametric machine learning models: neural networks, Bayesian neural network, Gaussian process, and support vector regression, to predict market impact cost accurately and to provide the predictive model that is versatile in the number of variables. We collected a large amount of real single transaction data of US stock market from Bloomberg Terminal and generated three independent input variables. As a result, most nonparametric machine learning models outperformed a-state-of-the-art benchmark parametric model such as I-star model in four error measures. Although these models encounter certain difficulties in separating the permanent and temporary cost directly, nonparametric machine learning models can be good alternatives in reducing transaction costs by considerably improving in prediction performance.  相似文献   

5.
6.
《IRBM》2021,42(5):345-352
Available clinical methods for heart failure (HF) diagnosis are expensive and require a high-level of experts intervention. Recently, various machine learning models have been developed for the prediction of HF where most of them have an issue of over-fitting. Over-fitting occurs when machine learning based predictive models show better performance on the training data yet demonstrate a poor performance on the testing data and the other way around. Developing a machine learning model which is able to produce generalization capabilities (such that the model exhibits better performance on both the training and the testing data sets) could overall minimize the prediction errors. Hence, such prediction models could potentially be helpful to cardiologists for the effective diagnose of HF. This paper proposes a two-stage decision support system to overcome the over-fitting issue and to optimize the generalization factor. The first stage uses a mutual information based statistical model while the second stage uses a neural network. We applied our approach to the HF subset of publicly available Cleveland heart disease database. Our experimental results show that the proposed decision support system has optimized the generalization capabilities and has reduced the mean percent error (MPE) to 8.8% which is significantly less than the recently published studies. In addition, our model exhibits a 93.33% accuracy rate which is higher than twenty eight recently developed HF risk prediction models that achieved accuracy in the range of 57.85% to 92.31%. We can hope that our decision support system will be helpful to cardiologists if deployed in clinical setup.  相似文献   

7.
Deriving predictive models in medicine typically relies on a population approach where a single model is developed from a dataset of individuals. In this paper we describe and evaluate a personalized approach in which we construct a new type of decision tree model called decision-path model that takes advantage of the particular features of a given person of interest. We introduce three personalized methods that derive personalized decision-path models. We compared the performance of these methods to that of Classification And Regression Tree (CART) that is a population decision tree to predict seven different outcomes in five medical datasets. Two of the three personalized methods performed statistically significantly better on area under the ROC curve (AUC) and Brier skill score compared to CART. The personalized approach of learning decision path models is a new approach for predictive modeling that can perform better than a population approach.  相似文献   

8.
Autism spectrum disorder comprises several neurodevelopmental conditions presenting symptoms in social communication and restricted, repetitive behaviors. A major roadblock for drug development for autism is the lack of robust behavioral signatures predictive of clinical efficacy. To address this issue, we further characterized, in a uniform and rigorous way, mouse models of autism that are of interest because of their construct validity and wide availability to the scientific community. We implemented a broad behavioral battery that included but was not restricted to core autism domains, with the goal of identifying robust, reliable phenotypes amenable for further testing. Here we describe comprehensive findings from two known mouse models of autism, obtained at different developmental stages, using a systematic behavioral test battery combining standard tests as well as novel, quantitative, computer-vision based systems. The first mouse model recapitulates a deletion in human chromosome 16p11.2, found in 1% of individuals with autism. The second mouse model harbors homozygous null mutations in Cntnap2, associated with autism and Pitt-Hopkins-like syndrome. Consistent with previous results, 16p11.2 heterozygous null mice, also known as Del(7Slx1b-Sept1)4Aam weighed less than wild type littermates displayed hyperactivity and no social deficits. Cntnap2 homozygous null mice were also hyperactive, froze less during testing, showed a mild gait phenotype and deficits in the three-chamber social preference test, although less robust than previously published. In the open field test with exposure to urine of an estrous female, however, the Cntnap2 null mice showed reduced vocalizations. In addition, Cntnap2 null mice performed slightly better in a cognitive procedural learning test. Although finding and replicating robust behavioral phenotypes in animal models is a challenging task, such functional readouts remain important in the development of therapeutics and we anticipate both our positive and negative findings will be utilized as a resource for the broader scientific community.  相似文献   

9.
Abstract

We develop ways to predict the side chain orientations of residues within a protein structure by using several different statistical machine learning methods. Here side chain orientation of a given residue i is measured by an angle Ωi between the vector pointing from the center of the protein structure to the Cα i atom and the vector pointing from the Cα i atom to the center of its side chain atoms. To predict the Ωi angles, we construct statistical models by using several different methods such as general linear regression, a regression tree and bagging, a neural network, and a support vector machine. The root mean square errors for the different models range only from 36.67 to 37.60 degrees and the correlation coefficients are all between 30% and 34%. The performances of different models in the test set are, thus, quite similar, and show the relative predictive power of these models to be significant in comparison with random side chain orientations.  相似文献   

10.
N4-甲基胞嘧啶(N4-methylcytosine, 4mC)是一种重要的表观遗传修饰,在DNA的修复、表达和复制中发挥重要作用。准确鉴定4mC位点有助于深入研究其生物学功能和机制,由于4mC位点的实验鉴定即耗时又昂贵,特别是考虑到基因序列的快速积累,迫切需要补充有效的计算方法。因此,提供一个快速、准确的4mC位点在线预测平台十分必要。目前,还未见对构建必要的预测模型所需的不同特征的机器学习(machine learning, ML)方法进行全面的分析和评估。我们构建多组特征集,并且采用5种ML方法(如随机森林,支持向量机,集成学习等),提出一种称为“DNA4mcEL”的预测方法。在随机10折交叉验证测试下与现有的预测器相比,DNA4mcEL预测C. elegans、D. melanogaster、A. thaliana、E. coli、G. subterraneus、G. pickeringii 6个物种的精度均有提高。基于本方法的预测器DNA4mcEL在这项任务中显著优于现有的预测器。我们希望通过这个综合调查和建立更准确模型的策略,可以作为激发N4-甲基胞嘧啶预测计算方法未来发展的有用指南,加快新N4-甲基胞嘧啶的发现。DNA4mcEL的独立版本可以从https://github.com/kukuky00/DNA4mcEL.git免费获得。  相似文献   

11.
预测物种潜在分布区——比较SVM与GARP   总被引:2,自引:0,他引:2       下载免费PDF全文
 物种分布与环境因子之间存在着紧密的联系,因此利用环境因子作为预测物种分布模型的变量是当前最普遍的建模思路,但是绝大多数物种分 布预测模型都遇到了难以解决的“高维小样本"问题。该研究通过理论和实践证明,基于结构风险最小化原理的支持向量机(Support vector machine, SVM)算法非常适合“高维小样本"的分类问题。以20种杜鹃花属(Rhododendron)中国特有种为检验对象,利用标本数据和11个1 km×1 km的栅格环境数据层作为模型变量,预测其在中国的潜在分布区,并通过全面的模型评估——专家评估,受试者工作特征(Receiver operator characteristic, ROC)曲线和曲线下方面积(Area under the curve, AUC)——来比较模型的性能。我们实现了以SVM为核心的物种分布预测 系统,并且通过试验证明其无论在计算速度还是预测效果上都远远优于当前广泛使用的规则集合预测的遗传算法(Algorithm for rule-set prediction, GARP)预测系统。  相似文献   

12.
MOTIVATION: Chemical carcinogenicity is an important subject in health and environmental sciences, and a reliable method is expected to identify characteristic factors for carcinogenicity. The predictive toxicology challenge (PTC) 2000-2001 has provided the opportunity for various data mining methods to evaluate their performance. The cascade model, a data mining method developed by the author, has the capability to mine for local correlations in data sets with a large number of attributes. The current paper explores the effectiveness of the method on the problem of chemical carcinogenicity. RESULTS: Rodent carcinogenicity of 417 compounds examined by the National Toxicology Program (NTP) was used as the training set. The analysis by the cascade model, for example, could obtain a rule 'Highly flexible molecules are carcinogenic, if they have no hydrogen bond acceptors in halogenated alkanes and alkenes'. Resulting rules are applied to predict the activity of 185 compounds examined by the FDA. The ROC analysis performed by the PTC organizers has shown that the current method has excellent predictive power for the female rat data. AVAILABILITY: The binary program of DISCAS 2.1 and samples of input data sets on Windows PC are available at http://www.clab.kwansei.ac.jp/mining/discas/discas.html upon request from the author. SUPPLEMENTARY INFORMATION: Summary of prediction results and cross validations is accessible via http://www.clab.kwansei.ac.jp/~okada/BIJ/BIJsupple.htm. Used rules and the prediction results for each molecule are also provided.  相似文献   

13.
14.
Computational predictions of ligand binding is a difficult problem, with more accurate methods being extremely computationally expensive. The use of machine learning for drug binding predictions could possibly leverage the use of biomedical big data in exchange for time-intensive simulations. This paper reviews current trends in the use of machine learning for drug binding predictions, data sources to develop machine learning algorithms, and potential problems that may lead to overfitting and ungeneralizable models. A few popular datasets that can be used to develop virtual high-throughput screening models are characterized using spatial statistics to quantify potential biases. We can see from evaluating some common benchmarks that good performance correlates with models with high-predicted bias scores and models with low bias scores do not have much predictive power. A better understanding of the limits of available data sources and how to fix them will lead to more generalizable models that will lead to novel drug discovery.  相似文献   

15.
Artificial neural networks are becoming increasingly popular as predictive statistical tools in ecosystem ecology and as models of signal processing in behavioural and evolutionary ecology. We demonstrate here that a commonly used network in ecology, the three-layer feed-forward network, trained with the backpropagation algorithm, can be extremely sensitive to the stochastic variation in training data that results from random sampling of the same underlying statistical distribution, with networks converging to several distinct predictive states. Using a random walk procedure to sample error-weight space, and Sammon dimensional reduction of weight arrays, we demonstrate that these different predictive states are not artefactual, due to local minima, but lie at the base of major error troughs in the error-weight surface. We further demonstrate that various gross weight compositions can produce the same predictive state, suggesting the analogy of weight space as a 'patchwork' of multiple predictive states. Our results argue for increased inclusion of stochastic training replication and analysis into ecological and behavioural applications of artificial neural networks.  相似文献   

16.
Ensembles are a well established machine learning paradigm, leading to accurate and robust models, predominantly applied to predictive modeling tasks. Ensemble models comprise a finite set of diverse predictive models whose combined output is expected to yield an improved predictive performance as compared to an individual model. In this paper, we propose a new method for learning ensembles of process-based models of dynamic systems. The process-based modeling paradigm employs domain-specific knowledge to automatically learn models of dynamic systems from time-series observational data. Previous work has shown that ensembles based on sampling observational data (i.e., bagging and boosting), significantly improve predictive performance of process-based models. However, this improvement comes at the cost of a substantial increase of the computational time needed for learning. To address this problem, the paper proposes a method that aims at efficiently learning ensembles of process-based models, while maintaining their accurate long-term predictive performance. This is achieved by constructing ensembles with sampling domain-specific knowledge instead of sampling data. We apply the proposed method to and evaluate its performance on a set of problems of automated predictive modeling in three lake ecosystems using a library of process-based knowledge for modeling population dynamics. The experimental results identify the optimal design decisions regarding the learning algorithm. The results also show that the proposed ensembles yield significantly more accurate predictions of population dynamics as compared to individual process-based models. Finally, while their predictive performance is comparable to the one of ensembles obtained with the state-of-the-art methods of bagging and boosting, they are substantially more efficient.  相似文献   

17.
18.
19.
Classical paper-and-pencil based risk assessment questionnaires are often accompanied by the online versions of the questionnaire to reach a wider population. This study focuses on the loss, especially in risk estimation performance, that can be inflicted by direct transformation from the paper to online versions of risk estimation calculators by ignoring the possibilities of more complex and accurate calculations that can be performed using the online calculators. We empirically compare the risk estimation performance between four major diabetes risk calculators and two, more advanced, predictive models. National Health and Nutrition Examination Survey (NHANES) data from 1999–2012 was used to evaluate the performance of detecting diabetes and pre-diabetes.American Diabetes Association risk test achieved the best predictive performance in category of classical paper-and-pencil based tests with an Area Under the ROC Curve (AUC) of 0.699 for undiagnosed diabetes (0.662 for pre-diabetes) and 47% (47% for pre-diabetes) persons selected for screening. Our results demonstrate a significant difference in performance with additional benefits for a lower number of persons selected for screening when statistical methods are used. The best AUC overall was obtained in diabetes risk prediction using logistic regression with AUC of 0.775 (0.734) and an average 34% (48%) persons selected for screening. However, generalized boosted regression models might be a better option from the economical point of view as the number of selected persons for screening of 30% (47%) lies significantly lower for diabetes risk assessment in comparison to logistic regression (p < 0.001), with a significantly higher AUC (p < 0.001) of 0.774 (0.740) for the pre-diabetes group.Our results demonstrate a serious lack of predictive performance in four major online diabetes risk calculators. Therefore, one should take great care and consider optimizing the online versions of questionnaires that were primarily developed as classical paper questionnaires.  相似文献   

20.
For many years, psychiatrists have tried to understand factors involved in response to medications or psychotherapies, in order to personalize their treatment choices. There is now a broad and growing interest in the idea that we can develop models to personalize treatment decisions using new statistical approaches from the field of machine learning and applying them to larger volumes of data. In this pursuit, there has been a paradigm shift away from experimental studies to confirm or refute specific hypotheses towards a focus on the overall explanatory power of a predictive model when tested on new, unseen datasets. In this paper, we review key studies using machine learning to predict treatment outcomes in psychiatry, ranging from medications and psychotherapies to digital interventions and neurobiological treatments. Next, we focus on some new sources of data that are being used for the development of predictive models based on machine learning, such as electronic health records, smartphone and social media data, and on the potential utility of data from genetics, electrophysiology, neuroimaging and cognitive testing. Finally, we discuss how far the field has come towards implementing prediction tools in real‐world clinical practice. Relatively few retrospective studies to‐date include appropriate external validation procedures, and there are even fewer prospective studies testing the clinical feasibility and effectiveness of predictive models. Applications of machine learning in psychiatry face some of the same ethical challenges posed by these techniques in other areas of medicine or computer science, which we discuss here. In short, machine learning is a nascent but important approach to improve the effectiveness of mental health care, and several prospective clinical studies suggest that it may be working already.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号