首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

2.
Penicillin-Binding Proteins are peptidases that play an important role in cell-wall biogenesis in bacteria and thus maintaining bacterial infections. A wide class of β-lactam drugs are known to act on these proteins and inhibit bacterial infections by disrupting the cell-wall biogenesis pathway. Penicillin-Binding proteins have recently gained importance with the increase in the number of multi-drug resistant bacteria. In this work, we have collected a dataset of over 700 Penicillin-Binding and non-Penicillin Binding Proteins and extracted various sequence-related features. We then created models to classify the proteins into Penicillin-Binding and non-binding using supervised machine learning algorithms such as Support Vector Machines and Random Forest. We obtain a good classification performance for both the models using both the methods.  相似文献   

3.
  1. Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.
  2. Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.
  3. The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.
  4. The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa.
  相似文献   

4.
Despite growing concerns over the health of global invertebrate diversity, terrestrial invertebrate monitoring efforts remain poorly geographically distributed. Machine-assisted classification has been proposed as a potential solution to quickly gather large amounts of data; however, previous studies have often used unrealistic or idealized datasets to train and test their models.In this study, we describe a practical methodology for including machine learning in ecological data acquisition pipelines. Here we train and test machine learning algorithms to classify over 72,000 terrestrial invertebrate specimens from morphometric data and contextual metadata. All vouchered specimens were collected in pitfall traps by the National Ecological Observatory Network (NEON) at 45 locations across the United States from 2016 to 2019. Specimens were photographed, and two separate machine learning paradigms were used to classify them. In the first, we used a convolutional neural network (ResNet-50), and in the second, we extracted morphometric data as feature vectors using ImageJ and used traditional machine learning methods to classify specimens. Issues stemming from inconsistent taxonomic label specificity were resolved by making classifications at the lowest identified taxonomic level (LITL). Taxa with too few specimens to be included in the training dataset were classified by the model using zero-shot classification.When classifying specimens that were known and seen by our models, we reached a maximum accuracy of 72.7% using eXtreme Gradient Boosting (XGBoost) at the LITL. This nearly matched the maximum accuracy achieved by the CNN of 72.8% at the LITL. Models that were trained without contextual metadata underperformed models with contextual metadata. We also classified invertebrate taxa that were unknown to the model using zero-shot classification, reaching a maximum accuracy of 65.5% when using the ResNet-50, compared to 39.4% when using XGBoost.The general methodology outlined here represents a realistic application of machine learning as a tool for ecological studies. We found that more advanced and complex machine learning methods such as convolutional neural networks are not necessarily more accurate than traditional machine learning methods. Hierarchical and LITL classifications allow for flexible taxonomic specificity at the input and output layers. These methods also help address the ‘long tail’ problem of underrepresented taxa missed by machine learning models. Finally, we encourage researchers to consider more than just morphometric data when training their models, as we have shown that the inclusion of contextual metadata can provide significant improvements to accuracy.  相似文献   

5.

Background

The pervasive nature of plastics has raised concerns about the impact of continuous exposure to plastic additives on human health. Of particular concern is the use of phthalates in the production of flexible polyvinyl chloride (PVC) products. Di-2-ethylhexyl-phthalate (DEHP) is a commonly used phthalate ester plasticizer that imparts flexibility and elasticity to PVC products. Recent epidemiological studies have reported correlations between urinary phthalate concentrations and cardiovascular disease, including an increased risk of high blood pressure and coronary risk. Yet, there is little direct evidence linking phthalate exposure to adverse effects in human cells, including cardiomyocytes.

Methods and Results

The effect of DEHP on calcium handling was examined using monolayers of gCAMP3 human embryonic stem cell-derived cardiomyocytes, which contain an endogenous calcium sensor. Cardiomyocytes were exposed to DEHP (5 – 50 μg/mL), and calcium transients were recorded using a Zeiss confocal imaging system. DEHP exposure (24 – 72 hr) had a negative chronotropic and inotropic effect on cardiomyocytes, increased the minimum threshold voltage required for external pacing, and modified connexin-43 expression. Application of Wy-14,643 (100 μM), an agonist for the peroxisome proliferator-activated receptor alpha, did not replicate DEHP’s effects on calcium transient morphology or spontaneous beating rate.

Conclusions

Phthalates can affect the normal physiology of human cardiomyocytes, including DEHP elicited perturbations in cardiac calcium handling and intercellular connectivity. Our findings call for additional studies to clarify the extent by which phthalate exposure can alter cardiac function, particularly in vulnerable patient populations who are at risk for high phthalate exposure.  相似文献   

6.
The quality of electrophysiological recordings varies a lot due to technical and biological variability and neuroscientists inevitably have to select “good” recordings for further analyses. This procedure is time-consuming and prone to selection biases. Here, we investigate replacing human decisions by a machine learning approach. We define 16 features, such as spike height and width, select the most informative ones using a wrapper method and train a classifier to reproduce the judgement of one of our expert electrophysiologists. Generalisation performance is then assessed on unseen data, classified by the same or by another expert. We observe that the learning machine can be equally, if not more, consistent in its judgements as individual experts amongst each other. Best performance is achieved for a limited number of informative features; the optimal feature set being different from one data set to another. With 80–90% of correct judgements, the performance of the system is very promising within the data sets of each expert but judgments are less reliable when it is used across sets of recordings from different experts. We conclude that the proposed approach is relevant to the selection of electrophysiological recordings, provided parameters are adjusted to different types of experiments and to individual experimenters.  相似文献   

7.
8.
Differences in mRNA expression levels have been observed in failing versus non-failing human hearts for several membrane channel proteins and accessory subunits. These differences may play a causal role in electrophysiological changes observed in human heart failure and atrial fibrillation, such as action potential (AP) prolongation, increased AP triangulation, decreased intracellular calcium transient (CaT) magnitude and decreased CaT triangulation. Our goal is to investigate whether the information contained in mRNA measurements can be used to predict cardiac electrophysiological remodeling in heart failure using computational modeling. Using mRNA data recently obtained from failing and non-failing human hearts, we construct failing and non-failing cell populations incorporating natural variability and up/down regulation of channel conductivities. Six biomarkers are calculated for each cell in each population, at cycle lengths between 1500 ms and 300 ms. Regression analysis is performed to determine which ion channels drive biomarker variability in failing versus non-failing cardiomyocytes. Our models suggest that reported mRNA expression changes are consistent with AP prolongation, increased AP triangulation, increased CaT duration, decreased CaT triangulation and amplitude, and increased delay between AP and CaT upstrokes in the failing population. Regression analysis reveals that changes in AP biomarkers are driven primarily by reduction in I, and changes in CaT biomarkers are driven predominantly by reduction in I and SERCA. In particular, the role of I is pacing rate dependent. Additionally, alternans developed at fast pacing rates for both failing and non-failing cardiomyocytes, but the underlying mechanisms are different in control and heart failure.  相似文献   

9.
10.
Human-induced pluripotent stem cells (hiPSCs) can differentiate into functional cardiomyocytes; however, the electrophysiological properties of hiPSC-derived cardiomyocytes have yet to be fully characterized. We performed detailed electrophysiological characterization of highly pure hiPSC-derived cardiomyocytes. Action potentials (APs) were recorded from spontaneously beating cardiomyocytes using a perforated patch method and had atrial-, nodal-, and ventricular-like properties. Ventricular-like APs were more common and had maximum diastolic potentials close to those of human cardiac myocytes, AP durations were within the range of the normal human electrocardiographic QT interval, and APs showed expected sensitivity to multiple drugs (tetrodotoxin, nifedipine, and E4031). Early afterdepolarizations (EADs) were induced with E4031 and were bradycardia dependent, and EAD peak voltage varied inversely with the EAD take-off potential. Gating properties of seven ionic currents were studied including sodium (I(Na)), L-type calcium (I(Ca)), hyperpolarization-activated pacemaker (I(f)), transient outward potassium (I(to)), inward rectifier potassium (I(K1)), and the rapidly and slowly activating components of delayed rectifier potassium (I(Kr) and I(Ks), respectively) current. The high purity and large cell numbers also enabled automated patch-clamp analysis. We conclude that these hiPSC-derived cardiomyocytes have ionic currents and channel gating properties underlying their APs and EADs that are quantitatively similar to those reported for human cardiac myocytes. These hiPSC-derived cardiomyocytes have the added advantage that they can be used in high-throughput assays, and they have the potential to impact multiple areas of cardiovascular research and therapeutic applications.  相似文献   

11.
Finding efficient analytical techniques is overwhelmingly turning into a bottleneck for the effectiveness of large biological data. Machine learning offers a novel and powerful tool to advance classification and modeling solutions in molecular biology. However, these methods have been less frequently used with empirical population genetics data. In this study, we developed a new combined approach of data analysis using microsatellite marker data from our previous studies of olive populations using machine learning algorithms. Herein, 267 olive accessions of various origins including 21 reference cultivars, 132 local ecotypes, and 37 wild olive specimens from the Iranian plateau, together with 77 of the most represented Mediterranean varieties were investigated using a finely selected panel of 11 microsatellite markers. We organized data in two ‘4-targeted’ and ‘16-targeted’ experiments. A strategy of assaying different machine based analyses (i.e. data cleaning, feature selection, and machine learning classification) was devised to identify the most informative loci and the most diagnostic alleles to represent the population and the geography of each olive accession. These analyses revealed microsatellite markers with the highest differentiating capacity and proved efficiency for our method of clustering olive accessions to reflect upon their regions of origin. A distinguished highlight of this study was the discovery of the best combination of markers for better differentiating of populations via machine learning models, which can be exploited to distinguish among other biological populations.  相似文献   

12.
Preterm birth has been shown to induce an altered developmental trajectory of brain structure and function. With the aid support vector machine (SVM) classification methods we aimed to investigate whether MRI data, collected in adolescence, could be used to predict whether an individual had been born preterm or at term. To this end we collected T1–weighted anatomical MRI data from 143 individuals (69 controls, mean age 14.6y). The inclusion criteria for those born preterm were birth weight ≤ 1500g and gestational age < 37w. A linear SVM was trained on the grey matter segment of MR images in two different ways. First, all the individuals were used for training and classification was performed by the leave–one–out method, yielding 93% correct classification (sensitivity = 0.905, specificity = 0.942). Separately, a random half of the available data were used for training twice and each time the other, unseen, half of the data was classified, resulting 86% and 91% accurate classifications. Both gestational age (R = –0.24, p<0.04) and birth weight (R = –0.51, p < 0.001) correlated with the distance to decision boundary within the group of individuals born preterm. Statistically significant correlations were also found between IQ (R = –0.30, p < 0.001) and the distance to decision boundary. Those born small for gestational age did not form a separate subgroup in these analyses. The high rate of correct classification by the SVM motivates further investigation. The long–term goal is to automatically and non–invasively predict the outcome of preterm–born individuals on an individual basis using as early a scan as possible.  相似文献   

13.
For current computational intelligence techniques, a major challenge is how to learn new concepts in changing environment. Traditional learning schemes could not adequately address this problem due to a lack of dynamic data selection mechanism. In this paper, inspired by human learning process, a novel classification algorithm based on incremental semi-supervised support vector machine (SVM) is proposed. Through the analysis of prediction confidence of samples and data distribution in a changing environment, a “soft-start” approach, a data selection mechanism and a data cleaning mechanism are designed, which complete the construction of our incremental semi-supervised learning system. Noticeably, with the ingenious design procedure of our proposed algorithm, the computation complexity is reduced effectively. In addition, for the possible appearance of some new labeled samples in the learning process, a detailed analysis is also carried out. The results show that our algorithm does not rely on the model of sample distribution, has an extremely low rate of introducing wrong semi-labeled samples and can effectively make use of the unlabeled samples to enrich the knowledge system of classifier and improve the accuracy rate. Moreover, our method also has outstanding generalization performance and the ability to overcome the concept drift in a changing environment.  相似文献   

14.
A recently developed machine learning algorithm referred to as Extreme Learning Machine (ELM) was used to classify machine control commands out of time series of spike trains of ensembles of CA1 hippocampus neurons (n = 34) of a rat, which was performing a target-to-goal task on a two-dimensional space through a brain-machine interface system. Performance of ELM was analyzed in terms of training time and classification accuracy. The results showed that some processes such as class code prefix, redundancy code suffix and smoothing effect of the classifiers' outputs could improve the accuracy of classification of robot control commands for a brain-machine interface system.  相似文献   

15.
Machine learning of functional class from phenotype data   总被引:5,自引:0,他引:5  
MOTIVATION: Mutant phenotype growth experiments are an important novel source of functional genomics data which have received little attention in bioinformatics. We applied supervised machine learning to the problem of using phenotype data to predict the functional class of Open Reading Frames (ORFs) in Saccaromyces cerevisiae. Three sources of data were used: TRansposon-Insertion Phenotypes, Localization and Expression in Saccharomyces (TRIPLES), European Functional Analysis Network (EUROFAN) and Munich Information Center for Protein Sequences (MIPS). The analysis of the data presented a number of challenges to machine learning: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We modified the algorithm C4.5 to deal with these problems. RESULTS: Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 ORFs of unknown function at an estimated accuracy of > or = 80%.  相似文献   

16.
A wide interest has been observed in the medical health care applications that interpret neuroimaging scans by machine learning systems. This research proposes an intelligent, automatic, accurate, and robust classification technique to classify the human brain magnetic resonance image (MRI) as normal or abnormal, to cater down the human error during identifying the diseases in brain MRIs. In this study, fast discrete wavelet transform (DWT), principal component analysis (PCA), and least squares support vector machine (LS-SVM) are used as basic components. Firstly, fast DWT is employed to extract the salient features of brain MRI, followed by PCA, which reduces the dimensions of the features. These reduced feature vectors also shrink the memory storage consumption by 99.5%. At last, an advanced classification technique based on LS-SVM is applied to brain MR image classification using reduced features. For improving the efficiency, LS-SVM is used with non-linear radial basis function (RBF) kernel. The proposed algorithm intelligently determines the optimized values of the hyper-parameters of the RBF kernel and also applied k-fold stratified cross validation to enhance the generalization of the system. The method was tested by 340 patients’ benchmark datasets of T1-weighted and T2-weighted scans. From the analysis of experimental results and performance comparisons, it is observed that the proposed medical decision support system outperformed all other modern classifiers and achieves 100% accuracy rate (specificity/sensitivity 100%/100%). Furthermore, in terms of computation time, the proposed technique is significantly faster than the recent well-known methods, and it improves the efficiency by 71%, 3%, and 4% on feature extraction stage, feature reduction stage, and classification stage, respectively. These results indicate that the proposed well-trained machine learning system has the potential to make accurate predictions about brain abnormalities from the individual subjects, therefore, it can be used as a significant tool in clinical practice.  相似文献   

17.
18.
Existing computational pipelines for quantitative analysis of high‐content microscopy data rely on traditional machine learning approaches that fail to accurately classify more than a single dataset without substantial tuning and training, requiring extensive analysis. Here, we demonstrate that the application of deep learning to biological image data can overcome the pitfalls associated with conventional machine learning classifiers. Using a deep convolutional neural network (DeepLoc) to analyze yeast cell images, we show improved performance over traditional approaches in the automated classification of protein subcellular localization. We also demonstrate the ability of DeepLoc to classify highly divergent image sets, including images of pheromone‐arrested cells with abnormal cellular morphology, as well as images generated in different genetic backgrounds and in different laboratories. We offer an open‐source implementation that enables updating DeepLoc on new microscopy datasets. This study highlights deep learning as an important tool for the expedited analysis of high‐content microscopy data.  相似文献   

19.
This study investigated the feasibility of using hyperspectral imaging technique for nondestructive measurement of color components (ΔL*, Δa* and Δb*) and classify tea leaves during different drying periods. Hyperspectral images of tea leaves at five drying periods were acquired in the spectral region of 380–1030 nm. The three color features were measured by the colorimeter. Different preprocessing algorithms were applied to select the best one in accordance with the prediction results of partial least squares regression (PLSR) models. Competitive adaptive reweighted sampling (CARS) and successive projections algorithm (SPA) were used to identify the effective wavelengths, respectively. Different models (least squares-support vector machine [LS-SVM], PLSR, principal components regression [PCR] and multiple linear regression [MLR]) were established to predict the three color components, respectively. SPA-LS-SVM model performed excellently with the correlation coefficient (rp) of 0.929 for ΔL*, 0.849 for Δa*and 0.917 for Δb*, respectively. LS-SVM model was built for the classification of different tea leaves. The correct classification rates (CCRs) ranged from 89.29% to 100% in the calibration set and from 71.43% to 100% in the prediction set, respectively. The total classification results were 96.43% in the calibration set and 85.71% in the prediction set. The result showed that hyperspectral imaging technique could be used as an objective and nondestructive method to determine color features and classify tea leaves at different drying periods.  相似文献   

20.
Understanding the relationship between physiological measurements from human subjects and their demographic data is important within both the biometric and forensic domains. In this paper we explore the relationship between measurements of the human hand and a range of demographic features. We assess the ability of linear regression and machine learning classifiers to predict demographics from hand features, thereby providing evidence on both the strength of relationship and the key features underpinning this relationship. Our results show that we are able to predict sex, height, weight and foot size accurately within various data-range bin sizes, with machine learning classification algorithms out-performing linear regression in most situations. In addition, we identify the features used to provide these relationships applicable across multiple applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号