首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 203 毫秒
1.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

2.
基质辅助激光解吸/电离飞行时间质谱(matrix-assisted laser desorption/ionization time-of-flight mass spectrometry,MALDI-TOF MS)是一种新兴的高通量技术,已广泛应用于临床微生物、食品微生物和水产微生物的快速鉴定。如何进一步提高MALDI-TOF MS在微生物鉴定中的分辨率是该技术当前面临的一大挑战。为了高效处理大量高维微生物MALDI-TOF MS数据,各种机器学习算法得到了应用。本文综述了机器学习在微生物MALDI-TOFMS鉴定中的应用。首先,本文在介绍机器学习在微生物MALDI-TOF MS分类中的工作流程后,进一步对MALDI-TOF MS的数据特征、MALDI-TOF MS数据库、数据的预处理和模型的性能评估进行了描述。然后讨论了典型的机器学习分类算法和集成学习算法的应用。简单的机器学习算法很难满足微生物MALDI-TOF MS分类的高分辨率的需求,而组合不同机器学习算法和集成学习算法可以获得更好的微生物分类性能。在MALDI-TOF MS数据的预处理方面,小波算法和遗传算法的应用最广,它们...  相似文献   

3.
随着第二代DNA测序技术的发展,研究人员积累了大量的肠道菌群数据,研究表明肠道菌群与宿主健康状况存在密切联系,因此如何对复杂、高维的肠道菌群数据进行建模分析,是当前生物信息学研究中的重要挑战。人工智能的兴起为处理肠道菌群数据,揭示肠道菌群与宿主表型之间的复杂关系提供了可能。综述了现阶段肠道菌群与宿主表型之间的相关研究,重点介绍了常用的5种机器学习算法(线性回归、支持向量机、K-近邻、随机森林、人工神经网络)的理论原理及在相关研究中的应用,对预测宿主表型的机器学习算法选择提出了建议,并对该领域的未来发展进行了展望,以期为利用机器学习对肠道菌群宿主表型预测提供参考依据。  相似文献   

4.
近年来,随着计算机硬件、软件工具和数据丰度的不断突破,以机器学习为代表的人工智能技术在生物、基础医学和药学等领域的应用不断拓展和融合,极大地推动了这些领域的发展,尤其是药物研发领域的变革。其中,药物-靶标相互作用(drug-target interactions, DTI)的识别是药物研发领域中的重要难题和人工智能技术交叉融合的热门方向,研究人员在DTI预测方面做了大量的工作,构建了许多重要的数据库,开发或拓展了各类机器学习算法和工具软件。对基于机器学习的DTI预测的基本流程进行了介绍,并对利用机器学习预测DTI的研究进行了回顾,同时对不同的机器学习方法运用于DTI预测的优缺点进行了简单总结,以期对开发更加有效的预测算法和DTI预测的发展提供帮助。  相似文献   

5.
基于机器学习的肠道菌群数据建模与分析研究综述   总被引:1,自引:0,他引:1  
人体肠道菌群与人类的健康和疾病存在密切关系,对肠道菌群的宏基因组数据进行建模和分析,在疾病预测及诊断相关领域科学研究和社会应用方面均具有重要意义。本文从大数据分析和机器学习的角度,对人体肠道菌群数据的建模、分析和预测算法的原理、过程以及典型研究应用实例进行综述,以期推动肠道菌群分析相关研究发展以及探索结合机器学习算法进行肠道菌群分析的有效方式,同时也为开发基于肠道菌群数据的新型诊疗手段提供借鉴,推动我国精准医疗事业发展。  相似文献   

6.
随着高通量测序技术的飞速发展,植物基因组学研究目前已经积累了海量多组学数据。因此如何开发和改进相关处理软件工具,从而有效利用这些海量数据发掘有用的生物学信息,成为当下亟需解决的重要科学问题。其中机器学习方法凭借其显著的预测、分类、数据挖掘和集成能力,在此领域受到广泛关注。本文系统综述了不同类型机器学习方法的基本原理和流程,以及这些方法在植物基因组功能预测中的研究进展,重点总结了机器学习模型在植物分子相互作用预测、重要功能位点预测、功能注释、作物育种等方面的应用成果,并展望了该领域未来的发展方向和应用前景。本文有助于植物研究者快速了解和应用机器学习方法,从而推进植物遗传相关机制的研究和作物性状改良。  相似文献   

7.
为比较青藏高原柴达木马亚成体腹泻与健康个体粪便微生物群落多样性和结构组成的差异, 我们利用16S rRNA测序技术对采集的腹泻 (n = 3) 和健康 (n = 13) 个体粪便样本细菌的组成与分布进行分析比较,并利用实时荧光定量PCR测定相关菌属的含量。结果显示,无论健康还是腹泻,厚壁菌门、拟杆菌门、疣微菌门、变形杆菌门和螺旋体门是柴达木马亚成体粪便中的优势菌门。相比健康组,腹泻组粪便微生物的Alpha多样性显著下降 (P < 0.05),厚壁菌门相对丰度下降而变形杆菌门的相对丰度显著增加 (P < 0.05),推断这两个门中的梭菌属、普雷沃菌属、纤杆菌属等丰度的失衡可能是导致柴达木马腹泻的原因之一。此外,通过机器学习的随机森林算法筛选出12个对健康和腹泻柴达木马亚成体粪便微生物差异具有较大影响的特征菌属,包括甲烷短杆菌属、纤杆菌属、Paludibacter、肉食杆菌属和迷踪菌属等。研究揭示了健康和腹泻柴达木马亚成体粪便微生物组的变化,为进一步研究青藏高原地区家畜腹泻提供一定的数据支持。  相似文献   

8.
微生物油脂是未来燃料和食品用油的重要潜在资源。近年来,随着系统生物学技术的快速发展,从全局角度理解产油微生物生理代谢及脂质积累的特征成为研究热点。组学技术作为系统生物学研究的重要工具被广泛用于揭示产油微生物脂质高效生产的机制研究中,这为产油微生物理性遗传改造和发酵过程控制提供了基础。文中对组学技术在产油微生物中的应用概况进行了综述,介绍了产油微生物组学分析常用的样品前处理及数据分析方法,综述了包括基因组、转录组、蛋白(修饰)组及代谢(脂质)组等在内的多种组学技术,以及组学数据基础上的数学模型在揭示产油微生物脂质高效生产机制中的研究,并对未来发展和应用进行了展望。  相似文献   

9.
高通量技术的迅猛发展促使微生物生态学研究获得了重大突破,掀起了元基因组学(Metagenomics)研究的热潮。元基因组学通常被定义为对未培养的环境样本中微生物群体的DNA序列分析。随着微生物组学数据的日益剧增,微生物大数据的高效管理与分析越来越受到研究者的关注。如何从海量的微生物组数据中挖掘出具有科研价值的数据信息并应用于实际问题成为当前的研究热点。目前已有很多计算生物学程序工具及数据库用于元基因组数据的分析与管理。本文主要综述了随着高通量测序技术的进步,国际上主要的微生物组计划及微生物组数据平台,如人类微生物组项目(human microbiome project,HMP)、地球微生物组项目(earth microbiome project,EMP)、欧盟的肠道微生物组计划(metagenomics of human intestinal tract,MetaHIT)、MG-RAST、i Microbe、整合微生物组(integration microbial genomes,IMG)以及EBI Metagenomics等;介绍了微生物数据分析的主要流程与工具;提出了建设多源异构的微生物生态数据管理与分析系统的必要性。  相似文献   

10.
在全基因组关联研究(genome-wide association studies,GWAS)中已鉴定到大量与疾病和复杂性状相关的突变位点,其中绝大部分位于基因组上的非编码区,通过多种方式参与到基因表达调控与表型产生的过程中。近年来,如何对这些突变进行系统地注释和鉴定研究是疾病基因组学研究领域的一大挑战。机器学习算法的快速发展为相关研究工作提供了新的契机。结合多组学的数据特征,机器学习方法能够对基因组上的非编码区突变进行大规模与高准确性注释和预测,对于揭示突变的具体致病机制以及指导下游实验验证具有重要的参考价值。本文主要针对机器学习算法在非编码区突变注释研究中的应用进展进行综述,并对当前研究的不足之处和未来的研究方向进行讨论,以期为相关的研究工作提供参考。  相似文献   

11.
Researches on the microbiome have been actively conducted worldwide and the results have shown human gut bacterial environment significantly impacts on immune system, psychological conditions, cancers, obesity, and metabolic diseases. Thanks to the development of sequencing technology, microbiome studies with large number of samples are eligible on an acceptable cost nowadays. Large samples allow analysis of more sophisticated modeling using machine learning approaches to study relationships between microbiome and various traits. This article provides an overview of machine learning methods for non-data scientists interested in the association analysis of microbiomes and host phenotypes. Once genomic feature of microbiome is determined, various analysis methods can be used to explore the relationship between microbiome and host phenotypes that include penalized regression, support vector machine (SVM), random forest, and artificial neural network (ANN). Deep neural network methods are also touched. Analysis procedure from environment setup to extract analysis results are presented with Python programming language.  相似文献   

12.
Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.  相似文献   

13.
Identification of environment specific marker-features is one of the key objectives of many metagenomic studies. It aims to identify such features in microbiome datasets that may serve as markers of the contrasting or comparable states. Hypothesis testing and black-box machine learnt models which are conventionally used for identification of these features are generally not exhaustive, especially because they generally do-not provide any quantifiable relevance (context) of/between the identified features. We present MarkerML web-server, that seeks to leverage the emergence of interpretable machine learning for facilitating the contextual discovery of metagenomic features of interest. It does so through a comprehensive and automated application of the concept of Shapley Additive Explanations in companionship to the compositionality accounted hypothesis testing for the multi-variate microbiome datasets. MarkerML not only helps in identification of marker-features, but also enables insights into the role and inter-dependence of the identified features in driving the decision making of the supervised machine learnt model. Generation of high quality and intuitive visualizations spanning prediction effect plots, model performance reports, feature dependency plots, Shapley and abundance informed cladograms (Sungrams), hypothesis tested violin plots along-with necessary provisions for excluding the participant bias and ensuring reproducibility of results, further seek to make the platform a useful asset for the scientists in the field of microbiome (and even beyond). The MarkerML web-server is freely available for the academic community at https://microbiome.igib.res.in/markerml/.  相似文献   

14.
Microbes play an essential role in the decomposition process but were poorly understood in their succession and behaviour. Previous researches have shown that microbes show predictable behaviour that starts at death and changes during the decomposition process. Research of such behaviour enhances the understanding of decomposition and benefits estimating the postmortem interval (PMI) in forensic investigations, which is critical but faces multiple challenges. In this study, we combined microbial community characterization, microbiome sequencing from different organs (i.e. brain, heart and cecum) and machine learning algorithms [random forest (RF), support vector machine (SVM) and artificial neural network (ANN)] to investigate microbial succession pattern during corpse decomposition and estimate PMI in a mouse corpse system. Microbial communities exhibited significant differences between the death point and advanced decay stages. Enterococcus faecalis, Anaerosalibacter bizertensis, Lactobacillus reuteri, and so forth were identified as the most informative species in the decomposition process. Furthermore, the ANN model combined with the postmortem microbial data set from the cecum, which was the best combination among all candidates, yielded a mean absolute error of 1.5 ± 0.8 h within 24-h decomposition and 14.5 ± 4.4 h within 15-day decomposition. This integrated model can serve as a reliable and accurate technology in PMI estimation.  相似文献   

15.
Data mining in bioinformatics using Weka   总被引:8,自引:0,他引:8  
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY: http://www.cs.waikato.ac.nz/ml/weka.  相似文献   

16.
With big data becoming widely available in healthcare, machine learning algorithms such as random forest (RF) that ignores time-to-event information and random survival forest (RSF) that handles right-censored data are used for individual risk prediction alternatively to the Cox proportional hazards (Cox-PH) model. We aimed to systematically compare RF and RSF with Cox-PH. RSF with three split criteria [log-rank (RSF-LR), log-rank score (RSF-LRS), maximally selected rank statistics (RSF-MSR)]; RF, Cox-PH, and Cox-PH with splines (Cox-S) were evaluated through a simulation study based on real data. One hundred eighty scenarios were investigated assuming different associations between the predictors and the outcome (linear/linear and interactions/nonlinear/nonlinear and interactions), training sample sizes (500/1000/5000), censoring rates (50%/75%/93%), hazard functions (increasing/decreasing/constant), and number of predictors (seven, 15 including noise variables). Methods' performance was evaluated with time-dependent area under curve and integrated Brier score. In all scenarios, RF had the worst performance. In scenarios with a low number of events (⩽70), Cox-PH was at least noninferior to RSF, whereas under linearity assumption it outperformed RSF. Under the presence of interactions, RSF performed better than Cox-PH as the number of events increased whereas Cox-S reached at least similar performance with RSF under nonlinear effects. RSF-LRS performed slightly worse than RSF-LR and RSF-MSR when including noise variables and interaction effects. When applied to real data, models incorporating survival time performed better. Although RSF algorithms are a promising alternative to conventional Cox-PH as data complexity increases, they require a higher number of events for training. In time-to-event analysis, algorithms that consider survival time should be used.  相似文献   

17.
Aim Trait‐based risk assessment for invasive species is becoming an important tool for identifying non‐indigenous species that are likely to cause harm. Despite this, concerns remain that the invasion process is too complex for accurate predictions to be made. Our goal was to test risk assessment performance across a range of taxonomic and geographical scales, at different points in the invasion process, with a range of statistical and machine learning algorithms. Location Regional to global data sets. Methods We selected six data sets differing in size, geography and taxonomic scope. For each data set, we created seven risk assessment tools using a range of statistical and machine learning algorithms. Performance of tools was compared to determine the effects of data set size and scale, the algorithm used, and to determine overall performance of the trait‐based risk assessment approach. Results Risk assessment tools with good performance were generated for all data sets. Random forests (RF) and logistic regression (LR) consistently produced tools with high performance. Other algorithms had varied performance. Despite their greater power and flexibility, machine learning algorithms did not systematically outperform statistical algorithms. Geographic scope of the data set, and size of the data set, did not systematically affect risk assessment performance. Main conclusions Across six representative data sets, we were able to create risk assessment tools with high performance. Additional data sets could be generated for other taxonomic groups and regions, and these could support efforts to prevent the arrival of new invaders. Random forests and LR approaches performed well for all data sets and could be used as a standard approach to risk assessment development.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号