首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

The use of internet search data has been demonstrated to be effective at predicting influenza incidence. This approach may be more successful for dengue which has large variation in annual incidence and a more distinctive clinical presentation and mode of transmission.

Methods

We gathered freely-available dengue incidence data from Singapore (weekly incidence, 2004–2011) and Bangkok (monthly incidence, 2004–2011). Internet search data for the same period were downloaded from Google Insights for Search. Search terms were chosen to reflect three categories of dengue-related search: nomenclature, signs/symptoms, and treatment. We compared three models to predict incidence: a step-down linear regression, generalized boosted regression, and negative binomial regression. Logistic regression and Support Vector Machine (SVM) models were used to predict a binary outcome defined by whether dengue incidence exceeded a chosen threshold. Incidence prediction models were assessed using and Pearson correlation between predicted and observed dengue incidence. Logistic and SVM model performance were assessed by the area under the receiver operating characteristic curve. Models were validated using multiple cross-validation techniques.

Results

The linear model selected by AIC step-down was found to be superior to other models considered. In Bangkok, the model has an , and a correlation of 0.869 between fitted and observed. In Singapore, the model has an , and a correlation of 0.931. In both Singapore and Bangkok, SVM models outperformed logistic regression in predicting periods of high incidence. The AUC for the SVM models using the 75th percentile cutoff is 0.906 in Singapore and 0.960 in Bangkok.

Conclusions

Internet search terms predict incidence and periods of large incidence of dengue with high accuracy and may prove useful in areas with underdeveloped surveillance systems. The methods presented here use freely available data and analysis tools and can be readily adapted to other settings.  相似文献   

2.
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.  相似文献   

3.
The prediction of the network of protein-protein interactions (PPI) of an organism is crucial for the understanding of biological processes and for the development of new drugs. Machine learning methods have been successfully applied to the prediction of PPI in yeast by the integration of multiple direct and indirect biological data sources. However, experimental data are not available for most organisms. We propose here an ensemble machine learning approach for the prediction of PPI that depends solely on features independent from experimental data. We developed new estimators of the coevolution between proteins and combined them in an ensemble learning procedure.We applied this method to a dataset of known co-complexed proteins in Escherichia coli and compared it to previously published methods. We show that our method allows prediction of PPI with an unprecedented precision of 95.5% for the first 200 sorted pairs of proteins compared to 28.5% on the same dataset with the previous best method.A close inspection of the best predicted pairs allowed us to detect new or recently discovered interactions between chemotactic components, the flagellar apparatus and RNA polymerase complexes in E. coli.  相似文献   

4.
《Genomics》2020,112(1):837-847
BackgroundGlioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks.ResultsAs a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers.ConclusionMachine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.  相似文献   

5.
Background and objectiveDifferentiating tropical infections are difficult due to its homogenous nature of clinical and laboratorial presentations among them. Sophisticated differential tests and prediction tools are better ways to tackle this issue. Here, we aimed to develop a clinician assisted decision making tool to differentiate the common tropical infections.MethodologyA cross sectional study through 9 item self-administered questionnaire were performed to understand the need of developing a decision making tool and its parameters. The most significant differential parameters among the identified infections were measured through a retrospective study and decision tree was developed. Based on the parameters identified, a multinomial logistic regression model and a machine learning model were developed which could better differentiate the infection.ResultsA total of 40 physicians involved in the management of tropical infections were included for need analysis. Dengue, malaria, leptospirosis and scrub typhus were the common tropical infections in our settings. Sodium, total bilirubin, albumin, lymphocytes and platelets were the laboratory parameters; and abdominal pain, arthralgia, myalgia and urine output were the clinical presentation identified as better predictors. In multinomial logistic regression analysis with dengue as a reference revealed a predictability of 60.7%, 62.5% and 66% for dengue, malaria and leptospirosis, respectively, whereas, scrub typhus showed only 38% of predictability. The multi classification machine learning model observed to have an overall predictability of 55–60%, whereas a binary classification machine learning algorithms showed an average of 79–84% for one vs other and 69–88% for one vs one disease category.ConclusionThis is a first of its kind study where both statistical and machine learning approaches were explored simultaneously for differentiating tropical infections. Machine learning techniques in healthcare sectors will aid in early detection and better patient care.  相似文献   

6.
PurposeTo predict the incidence of radiation-induced hypothyroidism (RHT) in nasopharyngeal carcinoma (NPC) patients, dosiomics features based prediction models were established.Materials and methodsA total of 145 NPC patients treated with radiotherapy from January 2012 to January 2015 were included. Dosiomics features of the dose distribution within thyroid gland were extracted. The minimal-redundancy-maximal-relevance (mRMR) criterion was used to rank the extracted features and selected the most relevant features. Machine learning (ML) algorithms including logistic regression (LR), support vector machine (SVM), random forest (RF), and k-nearest neighbor (KNN) were utilized to establish prediction models, respectively. Nested sampling and hyper-tuning methods were adopted to train and validate the prediction models. The dosiomics-based (DO) prediction models were evaluated through comparing with the dose-volume factor-based (DV) models in terms of the area under the receiver operating characteristic (ROC) curve (AUC). The demographics factors (age and gender) were included in both DO model and DV model.ResultsAge, V45 and 37 dosiomics features exhibited significant correlations with RHT in univariate analysis. For prediction performance, DO prediction models exhibited better results with the best AUC value 0.7 while DV prediction models 0.61. In DO prediction models, the AUC values displayed a trend from ascending to descending with the increasing of selected features. The highest AUC value was achieved when the number of selected features was 3. In DV prediction model, similar trend was not observed.ConclusionThis study established a prediction model based on the dosiomics features with better performance than conventional dose-volume factors, leading to early predict the possible RHT among NPC patients who had received radiotherapy and take precaution measures for NPC patients.  相似文献   

7.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

8.
ObjectiveTalaromycosis is a serious regional disease endemic in Southeast Asia. In China, Talaromyces marneffei (T. marneffei) infections is mainly concentrated in the southern region, especially in Guangxi, and cause considerable in-hospital mortality in HIV-infected individuals. Currently, the factors that influence in-hospital death of HIV/AIDS patients with T. marneffei infection are not completely clear. Existing machine learning techniques can be used to develop a predictive model to identify relevant prognostic factors to predict death and appears to be essential to reducing in-hospital mortality.MethodsWe prospectively enrolled HIV/AIDS patients with talaromycosis in the Fourth People’s Hospital of Nanning, Guangxi, from January 2012 to June 2019. Clinical features were selected and used to train four different machine learning models (logistic regression, XGBoost, KNN, and SVM) to predict the treatment outcome of hospitalized patients, and 30% internal validation was used to evaluate the performance of models. Machine learning model performance was assessed according to a range of learning metrics, including area under the receiver operating characteristic curve (AUC). The SHapley Additive exPlanations (SHAP) tool was used to explain the model.ResultsA total of 1927 HIV/AIDS patients with T. marneffei infection were included. The average in-hospital mortality rate was 13.3% (256/1927) from 2012 to 2019. The most common complications/coinfections were pneumonia (68.9%), followed by oral candida (47.5%), and tuberculosis (40.6%). Deceased patients showed higher CD4/CD8 ratios, aspartate aminotransferase (AST) levels, creatinine levels, urea levels, uric acid (UA) levels, lactate dehydrogenase (LDH) levels, total bilirubin levels, creatine kinase levels, white blood-cell counts (WBC) counts, neutrophil counts, procaicltonin levels and C-reactive protein (CRP) levels and lower CD3+ T-cell count, CD8+ T-cell count, and lymphocyte counts, platelet (PLT), high-density lipoprotein cholesterol (HDL), hemoglobin (Hb) levels than those of surviving patients. The predictive XGBoost model exhibited 0.71 sensitivity, 0.99 specificity, and 0.97 AUC in the training dataset, and our outcome prediction model provided robust discrimination in the testing dataset, showing an AUC of 0.90 with 0.69 sensitivity and 0.96 specificity. The other three models were ruled out due to poor performance. Septic shock and respiratory failure were the most important predictive features, followed by uric acid, urea, platelets, and the AST/ALT ratios.ConclusionThe XGBoost machine learning model is a good predictor in the hospitalization outcome of HIV/AIDS patients with T. marneffei infection. The model may have potential application in mortality prediction and high-risk factor identification in the talaromycosis population.  相似文献   

9.
BackgroundPrevious epidemiological studies have examined the prevalence and risk factors for a variety of parasitic illnesses, including protozoan and soil-transmitted helminth (STH, e.g., hookworms and roundworms) infections. Despite advancements in machine learning for data analysis, the majority of these studies use traditional logistic regression to identify significant risk factors.MethodsIn this study, we used data from a survey of 54 risk factors for intestinal parasitosis in 954 Ethiopian school children. We investigated whether machine learning approaches can supplement traditional logistic regression in identifying intestinal parasite infection risk factors. We used feature selection methods such as InfoGain (IG), ReliefF (ReF), Joint Mutual Information (JMI), and Minimum Redundancy Maximum Relevance (MRMR). Additionally, we predicted children’s parasitic infection status using classifiers such as Logistic Regression (LR), Support Vector Machines (SVM), Random Forests (RF) and XGBoost (XGB), and compared their accuracy and area under the receiver operating characteristic curve (AUROC) scores. For optimal model training, we performed tenfold cross-validation and tuned the classifier hyperparameters. We balanced our dataset using the Synthetic Minority Oversampling (SMOTE) method. Additionally, we used association rule learning to establish a link between risk factors and parasitic infections.Key findingsOur study demonstrated that machine learning could be used in conjunction with logistic regression. Using machine learning, we developed models that accurately predicted four parasitic infections: any parasitic infection at 79.9% accuracy, helminth infection at 84.9%, any STH infection at 95.9%, and protozoan infection at 94.2%. The Random Forests (RF) and Support Vector Machines (SVM) classifiers achieved the highest accuracy when top 20 risk factors were considered using Joint Mutual Information (JMI) or all features were used. The best predictors of infection were socioeconomic, demographic, and hematological characteristics.ConclusionsWe demonstrated that feature selection and association rule learning are useful strategies for detecting risk factors for parasite infection. Additionally, we showed that advanced classifiers might be utilized to predict children’s parasitic infection status. When combined with standard logistic regression models, machine learning techniques can identify novel risk factors and predict infection risk.  相似文献   

10.
BackgroundIn the past few decades, several researchers have proposed highly accurate prediction models that have typically relied on climate parameters. However, climate factors can be unreliable and can lower the effectiveness of prediction when they are applied in locations where climate factors do not differ significantly. The purpose of this study was to improve a dengue surveillance system in areas with similar climate by exploiting the infection rate in the Aedes aegypti mosquito and using the support vector machine (SVM) technique for forecasting the dengue morbidity rate.ConclusionsThe infection rates of the Ae. aegypti female mosquitoes and larvae improved the morbidity rate forecasting efficiency better than the climate parameters used in classical frameworks. We demonstrated that the SVM-R-based model has high generalization performance and obtained the highest prediction performance compared to classical models as measured by the accuracy, sensitivity, specificity, and mean absolute error (MAE).  相似文献   

11.
12.
利用基因组数据和生物信息学分析方法,快速鉴定耐药基因并预测耐药表型,为细菌耐药状况监测提供了有力辅助手段。目前,已有的数十个耐药数据库及其相关分析工具这些资源为细菌耐药基因的识别以及耐药表型的预测提供了数据信息和技术手段。随着细菌基因组数据的持续增加以及耐药表型数据的不断积累,大数据和机器学习能够更好地建立耐药表型与基因组信息之间的相关性,因此,构建高效的耐药表型预测模型成为研究热点。本文围绕细菌耐药基因的识别和耐药表型的预测,针对耐药相关数据库、耐药特征识别理论与方法、耐药数据的机器学习与表型预测等方面展开讨论,以期为细菌耐药的相关研究提供手段和思路。  相似文献   

13.
Protein function prediction with high-throughput data   总被引:1,自引:0,他引:1  
Zhao XM  Chen L  Aihara K 《Amino acids》2008,35(3):517-530
  相似文献   

14.
《IRBM》2020,41(2):71-79
ObjectivesHeart failure is a group of complex clinical syndromes that lead to ventricular filling or impaired ejection ability due to abnormal heart structure or function. Difficult treatment, poor prognosis and high mortality are the main characteristics of heart failure. According to admission data and past medical use, the 30-day mortality rate of patients with heart failure was obtained and the main characteristics affecting the 30-day mortality of patients with heart failure were determined.Material and methodsBased on the data of April 2016 to July 2018 of Shanxi Acadeny of Medical Sciences, and we chose 4,682 information on heart failure patients, of which 539 died in the hospital by screening. We built a 30-day mortality prediction model for patients with heart failure. The model can fuse clinical data and text data through multiple kernel learning, and input the fused data into the recurrent attention model. It can not only predict the 30-day mortality of patients with heart failure, but also the influencing factors of prognosis of patients with heart failure were also obtained.ResultsThe prediction accuracy of the recurrent attention network is obviously higher than that of other machine learning models, and the accuracy rate reaches 93.4%. The AUC value of the area under the ROC curve of the model reaches 87%, which is obviously higher than that of the traditional machine learning models such as decision tree, naive Bayesian and support vector machine. In addition, the model can also reach a conclusion that New York heart function classification, age, NT—ProBNP, LVEF, β-blockers, ventricular arrhythmia, high blood pressure, coronary heart disease (CHD) and bronchitis were independent risk factors for death. And patients with revascularization, ACEI/ARB drugs, β-blockers, spironolactone have a better prognosis than non-users. This provides an important reference for doctors to better treat and manage patients with heart failure.ConclusionExperiments show that the prognostic effect of the recurrent attention model is significantly higher than that of other traditional machine learning models. Because the model increases the attention mechanism, the important features affecting the prognostic results are obtained, which enables doctors to prescribe drugs according to the symptoms, take timely precautions and help patients to treat in time.  相似文献   

15.
MOTIVATION: Both small interfering RNAs (siRNAs) and antisense oligonucleotides can selectively block gene expression. Although the two methods rely on different cellular mechanisms, these methods share the common property that not all oligonucleotides (oligos) are equally effective. That is, if mRNA target sites are picked at random, many of the antisense or siRNA oligos will not be effective. Algorithms that can reliably predict the efficacy of candidate oligos can greatly reduce the cost of knockdown experiments, but previous attempts to predict the efficacy of antisense oligos have had limited success. Machine learning has not previously been used to predict siRNA efficacy. RESULTS: We develop a genetic programming based prediction system that shows promising results on both antisense and siRNA efficacy prediction. We train and evaluate our system on a previously published database of antisense efficacies and our own database of siRNA efficacies collected from the literature. The best models gave an overall correlation between predicted and observed efficacy of 0.46 on both antisense and siRNA data. As a comparison, the best correlations of support vector machine classifiers trained on the same data were 0.40 and 0.30, respectively.  相似文献   

16.
To predict rice blast, many machine learning methods have been proposed. As the quality and quantity of input data are essential for machine learning techniques, this study develops three artificial neural network (ANN)-based rice blast prediction models by combining two ANN models, the feed-forward neural network (FFNN) and long short-term memory (LSTM), with diverse input datasets, and compares their performance. The Blast_Weather_FFNN model had the highest recall score (66.3%) for rice blast prediction. This model requires two types of input data: blast occurrence data for the last 3 years and weather data (daily maximum temperature, relative humidity, and precipitation) between January and July of the prediction year. This study showed that the performance of an ANN-based disease prediction model was improved by applying suitable machine learning techniques together with the optimization of hyperparameter tuning involving input data. Moreover, we highlight the importance of the systematic collection of long-term disease data.  相似文献   

17.
Yuan Z  Burrage K  Mattick JS 《Proteins》2002,48(3):566-570
A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15% that splits the dataset evenly (an equal number of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1% for single sequence input and 73.9% for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis.  相似文献   

18.
BackgroundMachine learning (ML) has been gradually integrated into oncologic research but seldom applied to predict cervical cancer (CC), and no model has been reported to predict survival and site-specific recurrence simultaneously. Thus, we aimed to develop ML models to predict survival and site-specific recurrence in CC and to guide individual surveillance.MethodsWe retrospectively collected data on CC patients from 2006 to 2017 in four hospitals. The survival or recurrence predictive value of the variables was analyzed using multivariate Cox, principal component, and K-means clustering analyses. The predictive performances of eight ML models were compared with logistic or Cox models. A novel web-based predictive calculator was developed based on the ML algorithms.ResultsThis study included 5112 women for analysis (268 deaths, 343 recurrences): (1) For site-specific recurrence, larger tumor size was associated with local recurrence, while positive lymph nodes were associated with distant recurrence. (2) The ML models exhibited better prognostic predictive performance than traditional models. (3) The ML models were superior to traditional models when multiple variables were used. (4) A novel predictive web-based calculator was developed and externally validated to predict survival and site-specific recurrence.ConclusionML models might be a better analytic approach in CC prognostic prediction than traditional models as they can predict survival and site-specific recurrence simultaneously, especially when using multiple variables. Moreover, our novel web-based calculator may provide clinicians with useful information and help them make individual postoperative follow-up plans and further treatment strategies.  相似文献   

19.
Prediction of protein–protein interactions (PPIs) commonly involves a significant computational component. Rapid recent advances in the power of computational methods for protein interaction prediction motivate a review of the state-of-the-art. We review the major approaches, organized according to the primary source of data utilized: protein sequence, protein structure, and protein co-abundance. The advent of deep learning (DL) has brought with it significant advances in interaction prediction, and we show how DL is used for each source data type. We review the literature taxonomically, present example case studies in each category, and conclude with observations about the strengths and weaknesses of machine learning methods in the context of the principal sources of data for protein interaction prediction.  相似文献   

20.
Pluripotent stem cells are able to self-renew, and to differentiate into all adult cell types. Many studies report data describing these cells, and characterize them in molecular terms. Machine learning yields classifiers that can accurately identify pluripotent stem cells, but there is a lack of studies yielding minimal sets of best biomarkers (genes/features). We assembled gene expression data of pluripotent stem cells and non-pluripotent cells from the mouse. After normalization and filtering, we applied machine learning, classifying samples into pluripotent and non-pluripotent with high cross-validated accuracy. Furthermore, to identify minimal sets of best biomarkers, we used three methods: information gain, random forests and a wrapper of genetic algorithm and support vector machine (GA/SVM). We demonstrate that the GA/SVM biomarkers work best in combination with each other; pathway and enrichment analyses show that they cover the widest variety of processes implicated in pluripotency. The GA/SVM wrapper yields best biomarkers, no matter which classification method is used. The consensus best biomarker based on the three methods is Tet1, implicated in pluripotency just recently. The best biomarker based on the GA/SVM wrapper approach alone is Fam134b, possibly a missing link between pluripotency and some standard surface markers of unknown function processed by the Golgi apparatus.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号