首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Multibeam echosounders (MBES) are increasingly becoming the tool of choice for marine habitat mapping applications. In turn, the rapid expansion of habitat mapping studies has resulted in a need for automated classification techniques to efficiently map benthic habitats, assess confidence in model outputs, and evaluate the importance of variables driving the patterns observed. The benthic habitat characterisation process often involves the analysis of MBES bathymetry, backscatter mosaic or angular response with observation data providing ground truth. However, studies that make use of the full range of MBES outputs within a single classification process are limited. We present an approach that integrates backscatter angular response with MBES bathymetry, backscatter mosaic and their derivatives in a classification process using a Random Forests (RF) machine-learning algorithm to predict the distribution of benthic biological habitats. This approach includes a method of deriving statistical features from backscatter angular response curves created from MBES data collated within homogeneous regions of a backscatter mosaic. Using the RF algorithm we assess the relative importance of each variable in order to optimise the classification process and simplify models applied. The results showed that the inclusion of the angular response features in the classification process improved the accuracy of the final habitat maps from 88.5% to 93.6%. The RF algorithm identified bathymetry and the angular response mean as the two most important predictors. However, the highest classification rates were only obtained after incorporating additional features derived from bathymetry and the backscatter mosaic. The angular response features were found to be more important to the classification process compared to the backscatter mosaic features. This analysis indicates that integrating angular response information with bathymetry and the backscatter mosaic, along with their derivatives, constitutes an important improvement for studying the distribution of benthic habitats, which is necessary for effective marine spatial planning and resource management.  相似文献   

2.
Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia’s marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to ‘small p and large n’ problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and caution should be taken when applying filter FS methods in selecting predictive models.  相似文献   

3.
Networks of marine protected areas (MPAs) are being adopted globally to protect ecosystems and supplement fisheries management. The state of California recently implemented a coast-wide network of MPAs, a statewide seafloor mapping program, and ecological characterizations of species and ecosystems targeted for protection by the network. The main goals of this study were to use these data to evaluate how well seafloor features, as proxies for habitats, are represented and replicated across an MPA network and how well ecological surveys representatively sampled fish habitats inside MPAs and adjacent reference sites. Seafloor data were classified into broad substrate categories (rock and sediment) and finer scale geomorphic classifications standard to marine classification schemes using surface analyses (slope, ruggedness, etc.) done on the digital elevation model derived from multibeam bathymetry data. These classifications were then used to evaluate the representation and replication of seafloor structure within the MPAs and across the ecological surveys. Both the broad substrate categories and the finer scale geomorphic features were proportionately represented for many of the classes with deviations of 1-6% and 0-7%, respectively. Within MPAs, however, representation of seafloor features differed markedly from original estimates, with differences ranging up to 28%. Seafloor structure in the biological monitoring design had mismatches between sampling in the MPAs and their corresponding reference sites and some seafloor structure classes were missed entirely. The geomorphic variables derived from multibeam bathymetry data for these analyses are known determinants of the distribution and abundance of marine species and for coastal marine biodiversity. Thus, analyses like those performed in this study can be a valuable initial method of evaluating and predicting the conservation value of MPAs across a regional network.  相似文献   

4.
Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation.  相似文献   

5.
This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.  相似文献   

6.
Identifying fixed bed roughness scales of hydrodynamic relevance to waves and currents is challenging around coral reefs due to their highly inhomogeneous bathymetry. In order to characterize the spatial variability in reef roughness, a quantitative analysis of high-resolution sidescan sonar backscatter is performed for the identification of distinct substrates around a tropical reef and is related to echo sounder-based roughness measurements. Data were collected in the vicinity of the Kilo Nalu Observatory on the south shore of Oahu using sidescan sonar and a narrow beam echo sounder incorporated in a REMUS-100 (Remote Environmental Monitoring UnitS) autonomous underwater vehicle (AUV). With basic statistics and principal component analysis of variables derived from the backscatter data, it is possible to discriminate between areas of rough reef, bare reef, and rippled sand. Echo sounder-derived spectral analysis did not reveal dominant length scales. However, by combining the seabed classification obtained from sidescan measurements with echo sounder data, spectral root mean square (RMS) height values of approximately 3.3 cm and 7.3 cm are assigned to the bare reef and rough reef areas, respectively, for roughness with wavelengths between 0.2 and 6 m.  相似文献   

7.
Genetic relationships and population structure of 8 horse breeds in the Czech and Slovak Republics were investigated using classification methods for breed discrimination. To demonstrate genetic differences among these breeds, we used genetic information — genotype data of microsatellite markers and classification algorithms — to perform a probabilistic prediction of an individual’s breed. In total, 932 unrelated animals were genotyped for 17 microsatellite markers recommended by the ISAG for parentage testing (AHT4, AHT5, ASB2, HMS3, HMS6, HMS7, HTG4, HTG10, VHL20, HTG6, HMS2, HTG7, ASB17, ASB23, CA425, HMS1, LEX3). Algorithms of classification methods — J48 (decision trees); Naive Bayes, Bayes Net (probability predictors); IB1, IB5 (instance-based machine learning methods); and JRip (decision rules) — were used for analysis of their classification performance and of results of classification on this genotype dataset. Selected classification methods (Naive Bayes, Bayes Net, IB1), based on machine learning and principles of artificial intelligence, appear usable for these tasks.  相似文献   

8.
Skin is the largest organ and outer enclosure of the integumentary system that protects the human body from pathogens. Among various cancers in the world, skin cancer is one of the most commonly diagnosed cancer which can be either melanoma or non-melanoma. Melanoma cancers are very fatal compared with non-melanoma cancers but the chances of survival rate are high when diagnosed and treated earlier. The main aim of this work is to analyze and investigate the performance of Non-Subsampled Bendlet Transform (NSBT) on various classifiers for detecting melanoma from dermoscopic images. NSBT is a multiscale and multidirectional transform based on second order shearlet system which precisely classifies the curvature over other directional representation systems. Here two-phase classification is employed using k-Nearest Neighbour (kNN), Naive Bayes (NB), Decision Trees (DT) and Support Vector Machines (SVM). The first phase classification is used to classify the images of PH2 database into normal and abnormal images and the second phase classification classifies the abnormal images into benign and malignant. Experimental result shows the improvement in classification accuracy, sensitivity and specificity compared with the state of art methods.  相似文献   

9.
Modelling approaches have the potential to significantly contribute to the spatial management of the deep-sea ecosystem in a cost effective manner. However, we currently have little understanding of the accuracy of such models, developed using limited data, of varying resolution. The aim of this study was to investigate the performance of predictive models constructed using non-simulated (real world) data of different resolution. Predicted distribution maps for three deep-sea habitats were constructed using MaxEnt modelling methods using high resolution multibeam bathymetric data and associated terrain derived variables as predictors. Model performance was evaluated using repeated 75/25 training/test data partitions using AUC and threshold-dependent assessment methods. The overall extent and distribution of each habitat, and the percentage contained within an existing MPA network were quantified and compared to results from low resolution GEBCO models. Predicted spatial extent for scleractinian coral reef and Syringammina fragilissima aggregations decreased with an increase in model resolution, whereas Pheronema carpenteri total suitable area increased. Distinct differences in predicted habitat distribution were observed for all three habitats. Estimates of habitat extent contained within the MPA network all increased when modelled at fine scale. High resolution models performed better than low resolution models according to threshold-dependent evaluation. We recommend the use of high resolution multibeam bathymetry data over low resolution bathymetry data for use in modelling approaches. We do not recommend the use of predictive models to produce absolute values of habitat extent, but likely areas of suitable habitat. Assessments of MPA network effectiveness based on calculations of percentage area protection (policy driven conservation targets) from low resolution models are likely to be fit for purpose.  相似文献   

10.
Epi-macrobenthic species richness, abundance and composition are linked with type, assemblage and structural complexity of seabed habitat within coastal ecosystems. However, the evaluation of these habitats is highly hindered by limitations related to both waterborne surveys (slow acquisition, shallow water and low reactivity) and water clarity (turbid for most coastal areas). Substratum type/diversity and bathymetric features were elucidated using a supervised method applied to airborne bathymetric LiDAR waveforms over Saint-Siméon–Bonaventure''s nearshore area (Gulf of Saint-Lawrence, Québec, Canada). High-resolution underwater photographs were taken at three hundred stations across an 8-km2 study area. Seven models based upon state-of-the-art machine learning techniques such as Naïve Bayes, Regression Tree, Classification Tree, C 4.5, Random Forest, Support Vector Machine, and CN2 learners were tested for predicting eight epi-macrobenthic species diversity metrics as a function of the class number. The Random Forest outperformed other models with a three-discretized Simpson index applied to epi-macrobenthic communities, explaining 69% (Classification Accuracy) of its variability by mean bathymetry, time range and skewness derived from the LiDAR waveform. Corroborating marine ecological theory, areas with low Simpson epi-macrobenthic diversity responded to low water depths, high skewness and time range, whereas higher Simpson diversity relied upon deeper bottoms (correlated with stronger hydrodynamics) and low skewness and time range. The degree of species heterogeneity was therefore positively linked with the degree of the structural complexity of the benthic cover. This work underpins that fully exploited bathymetric LiDAR (not only bathymetrically derived by-products), coupled with proficient machine learner, is able to rapidly predict habitat characteristics at a spatial resolution relevant to epi-macrobenthos diversity, ranging from clear to turbid waters. This method might serve both to nurture marine ecological theory and to manage areas with high species heterogeneity where navigation is hazardous and water clarity opaque to passive optical sensors.  相似文献   

11.
G-Protein Coupled Receptors (GPCR) are the largest family of membrane bound receptor and plays a vital role in various biological processes with their amenability to drug intervention. They are the spotlight for the pharmaceutical industry. Experimental methods are both time consuming and expensive so there is need to develop a computational approach for classification to expedite the drug discovery process. In the present study domain based classification model has been developed by employing and evaluating various machine learning approaches like Bagging, J48, Bayes net, and Naive Bayes. Various softwares are available for predicting domains. The result and accuracy of output for the same input varies for these software''s. Thus, there is dilemma in choosing any one of it. To address this problem, a simulation model has been developed using well known five softwares for domain prediction to explore the best predicted result with maximum accuracy. The classifier is developed for classification up to 3 levels for class A. An accuracy of 98.59% by Naïve Bayes for level I, 92.07% by J48 for level II and 82.14% by Bagging for level III has been achieved.  相似文献   

12.
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.  相似文献   

13.
A Bayesian network classification methodology for gene expression data.   总被引:5,自引:0,他引:5  
We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.  相似文献   

14.
Spatial mapping of the marine environment is challenging when the properties concerned are difficult to measure except by shore-based analysis of discrete samples of material, usually from sparsely distributed sites. This is the case for many seabed sediment properties. We developed an indirect approach to mapping the organic content of coastal sediments from hydro-acoustic reflectance data. The basis was that both organic matter and acoustic reflectance are related to sediment type and grain size composition. Hence there is a collateral relationship between organic matter content and reflectance properties which can be exploited to enable high resolution mapping. We surveyed an area of seabed off the east coast of Scotland using a vessel mounted single beam echosounder with RoxAnn signal processing. Organic carbon, nitrogen and phytoplankton pigment contents were then measured in material from grab and core samples collected at intervals over a year. Relationships between the organic components and hydro–acoustic characteristics were derived by general additive models, and used to construct high resolution maps from the acoustic survey data. Our method is an advance on traditional interpolation techniques sparse spatial data, and represents a generic approach that could be applied to other properties.  相似文献   

15.
Epilepsy is a neurological disorder characterized by the presence of recurring seizures. Like many other neurological disorders, epilepsy can be assessed by the electroencephalogram (EEG). The EEG signal is highly non-linear and non-stationary, and hence, it is difficult to characterize and interpret it. However, it is a well-established clinical technique with low associated costs. In this work, we propose a methodology for the automatic detection of normal, pre-ictal, and ictal conditions from recorded EEG signals. Four entropy features namely Approximate Entropy (ApEn), Sample Entropy (SampEn), Phase Entropy 1 (S1), and Phase Entropy 2 (S2) were extracted from the collected EEG signals. These features were fed to seven different classifiers: Fuzzy Sugeno Classifier (FSC), Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Probabilistic Neural Network (PNN), Decision Tree (DT), Gaussian Mixture Model (GMM), and Naive Bayes Classifier (NBC). Our results show that the Fuzzy classifier was able to differentiate the three classes with a high accuracy of 98.1%. Overall, compared to previous techniques, our proposed strategy is more suitable for diagnosis of epilepsy with higher accuracy.  相似文献   

16.
Multibeam surveys were carried out in four areas to the west of Scotland where the coral Lophelia pertusa had previously been recorded. Distinctive seabed mounds were found in one area; video images from the mounds showed coral reef formation, and grab samples recovered L. pertusa reef framework and rubble. Skeleton samples were dated to 3,800 years BP. Grab samples contained 123 species of fauna. The reef structures, termed the Mingulay Reef Complex, were identified as topographic mound-like structures from the bathymetric data and were also visible on the backscatter images. The location of the reefs coincides with Atlantic bottom waters, close to a primary productivity centre and mixing zone, in an area where currents are likely to be accelerated by rocky seafloor ridges. This study shows that multibeam echosounders are powerful tools to locate and map deep-water coral reefs irrespective of water depth. Electronic Supplementary Material Supplementary material is available for this article at and is accessible for authorized users.  相似文献   

17.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

18.
Today's acoustic monitoring devices are capable of recording and storing tremendous amounts of data. Until recently, the classification of animal vocalizations from field recordings has been relegated to qualitative approaches. For large-scale acoustic monitoring studies, qualitative approaches are very time-consuming and suffer from the bias of subjectivity. Recent developments in supervised learning techniques can provide rapid, accurate, species-level classification of bioacoustics data. We compared the classification performances of four supervised learning techniques (random forests, support vector machines, artificial neural networks, and discriminant function analysis) for five different classification tasks using bat echolocation calls recorded by a popular frequency-division bat detector. We found that all classifiers performed similarly in terms of overall accuracy with the exception of discriminant function analysis, which had the lowest average performance metrics. Random forests had the advantage of high sensitivities, specificities, and predictive powers across the majority of classification tasks, and also provided metrics for determining the relative importance of call features in distinguishing between groups. Overall classification accuracy for each task was slightly lower than reported accuracies using calls recorded by time-expansion detectors. Myotis spp. were particularly difficult to separate; classifiers performed best when members of this genus were combined in genus-level classification and analyzed separately at the level of species. Additionally, we identified and ranked the relative contributions of all predictor features to classifier accuracy and found measurements of frequency, total call duration, and characteristic slope to be the most important contributors to classification success. We provide recommendations to maximize accuracy and efficiency when analyzing acoustic data, and suggest an application of automated bioacoustics monitoring to contribute to wildlife monitoring efforts.  相似文献   

19.
This paper focuses on the problem of selecting relevant features extracted from human polysomnographic (PSG) signals to perform accurate sleep/wake stages classification. Extraction of various features from the electroencephalogram (EEG), the electro-oculogram (EOG) and the electromyogram (EMG) processed in the frequency and time domains was achieved using a database of 47 night sleep recordings obtained from healthy adults in laboratory settings. Multiple iterative feature selection and supervised classification methods were applied together with a systematic statistical assessment of the classification performances. Our results show that using a simple set of features such as relative EEG powers in five frequency bands yields an agreement of 71% with the whole database classification of two human experts. These performances are within the range of existing classification systems. The addition of features extracted from the EOG and EMG signals makes it possible to reach about 80% of agreement with the expert classification. The most significant improvement on classification accuracy is obtained on NREM sleep stage I, a stage of transition between sleep and wakefulness.  相似文献   

20.
MOTIVATION: Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics projects. RESULTS: We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in Escherichia coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine (SVM) classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient (MCC) of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75 and 68% for positive and negative class, respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号