首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Tropical forests are significant carbon sinks and their soils’ carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms—including the model tuning and predictor selection—were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models’ predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.  相似文献   

2.
Yasui Y  Pepe M  Hsu L  Adam BL  Feng Z 《Biometrics》2004,60(1):199-206
Training data in a supervised learning problem consist of the class label and its potential predictors for a set of observations. Constructing effective classifiers from training data is the goal of supervised learning. In biomedical sciences and other scientific applications, class labels may be subject to errors. We consider a setting where there are two classes but observations with labels corresponding to one of the classes may in fact be mislabeled. The application concerns the use of protein mass-spectrometry data to discriminate between serum samples from cancer and noncancer patients. The patients in the training set are classified on the basis of tissue biopsy. Although biopsy is 100% specific in the sense that a tissue that shows itself to have malignant cells is certainly cancer, it is less than 100% sensitive. Reference gold standards that are subject to this special type of misclassification due to imperfect diagnosis certainty arise in many fields. We consider the development of a supervised learning algorithm under these conditions and refer to it as partially supervised learning. Boosting is a supervised learning algorithm geared toward high-dimensional predictor data, such as those generated in protein mass-spectrometry. We propose a modification of the boosting algorithm for partially supervised learning. The proposal is to view the true class membership of the samples that are labeled with the error-prone class label as missing data, and apply an algorithm related to the EM algorithm for minimization of a loss function. To assess the usefulness of the proposed method, we artificially mislabeled a subset of samples and applied the original and EM-modified boosting (EM-Boost) algorithms for comparison. Notable improvements in misclassification rates are observed with EM-Boost.  相似文献   

3.
Flow cytometry is a widely used technique for the analysis of cell populations in the study and diagnosis of human diseases. It yields large amounts of high-dimensional data, the analysis of which would clearly benefit from efficient computational approaches aiming at automated diagnosis and decision support. This article presents our analysis of flow cytometry data in the framework of the DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukemia (AML) Challenge, 2011. In the challenge, example data was provided for a set of 179 subjects, comprising healthy donors and 23 cases of AML. The participants were asked to provide predictions with respect to the condition of 180 patients in a test set. We extracted feature vectors from the data in terms of single marker statistics, including characteristic moments, median and interquartile range of the observed values. Subsequently, we applied Generalized Matrix Relevance Learning Vector Quantization (GMLVQ), a machine learning technique which extends standard LVQ by an adaptive distance measure. Our method achieved the best possible performance with respect to the diagnoses of test set patients. The extraction of features from the flow cytometry data is outlined in detail, the machine learning approach is discussed and classification results are presented. In addition, we illustrate how GMLVQ can provide deeper insight into the problem by allowing to infer the relevance of specific markers and features for the diagnosis.  相似文献   

4.
基质辅助激光解吸/电离飞行时间质谱(matrix-assisted laser desorption/ionization time-of-flight mass spectrometry,MALDI-TOF MS)是一种新兴的高通量技术,已广泛应用于临床微生物、食品微生物和水产微生物的快速鉴定。如何进一步提高MALDI-TOF MS在微生物鉴定中的分辨率是该技术当前面临的一大挑战。为了高效处理大量高维微生物MALDI-TOF MS数据,各种机器学习算法得到了应用。本文综述了机器学习在微生物MALDI-TOFMS鉴定中的应用。首先,本文在介绍机器学习在微生物MALDI-TOF MS分类中的工作流程后,进一步对MALDI-TOF MS的数据特征、MALDI-TOF MS数据库、数据的预处理和模型的性能评估进行了描述。然后讨论了典型的机器学习分类算法和集成学习算法的应用。简单的机器学习算法很难满足微生物MALDI-TOF MS分类的高分辨率的需求,而组合不同机器学习算法和集成学习算法可以获得更好的微生物分类性能。在MALDI-TOF MS数据的预处理方面,小波算法和遗传算法的应用最广,它们...  相似文献   

5.
6.
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the 'curse of dimensionality', occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable.  相似文献   

7.
Building an accurate disease risk prediction model is an essential step in the modern quest for precision medicine. While high-dimensional genomic data provides valuable data resources for the investigations of disease risk, their huge amount of noise and complex relationships between predictors and outcomes have brought tremendous analytical challenges. Deep learning model is the state-of-the-art methods for many prediction tasks, and it is a promising framework for the analysis of genomic data. However, deep learning models generally suffer from the curse of dimensionality and the lack of biological interpretability, both of which have greatly limited their applications. In this work, we have developed a deep neural network (DNN) based prediction modeling framework. We first proposed a group-wise feature importance score for feature selection, where genes harboring genetic variants with both linear and non-linear effects are efficiently detected. We then designed an explainable transfer-learning based DNN method, which can directly incorporate information from feature selection and accurately capture complex predictive effects. The proposed DNN-framework is biologically interpretable, as it is built based on the selected predictive genes. It is also computationally efficient and can be applied to genome-wide data. Through extensive simulations and real data analyses, we have demonstrated that our proposed method can not only efficiently detect predictive features, but also accurately predict disease risk, as compared to many existing methods.  相似文献   

8.
People often deviate from their individual Nash equilibrium strategy in game experiments based on the prisoner’s dilemma (PD) game and the public goods game (PGG), whereas conditional cooperation, or conformity, is supported by the data from these experiments. In a complicated environment with no obvious “dominant” strategy, conformists who choose the average strategy of the other players in their group could be able to avoid risk by guaranteeing their income will be close to the group average. In this paper, we study the repeated PD game and the repeated m-person PGG, where individuals’ strategies are restricted to the set of conforming strategies. We define a conforming strategy by two parameters, initial action in the game and the influence of the other players’ choices in the previous round. We are particularly interested in the tit-for-tat (TFT) strategy, which is the well-known conforming strategy in theoretical and empirical studies. In both the PD game and the PGG, TFT can prevent the invasion of non-cooperative strategy if the expected number of rounds exceeds a critical value. The stability analysis of adaptive dynamics shows that conformity in general promotes the evolution of cooperation, and that a regime of cooperation can be established in an AllD population through TFT-like strategies. These results provide insight into the emergence of cooperation in social dilemma games.  相似文献   

9.
10.
Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.  相似文献   

11.
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.  相似文献   

12.

Background

The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects.

Methodology/Findings

We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer’s dataset, we investigated the performance of MBS-IGain.

Conclusions/Significance

When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer’s dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions.  相似文献   

13.
MOTIVATION: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. RESULTS: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. AVAILABILITY: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request.  相似文献   

14.
《IRBM》2014,35(5):244-254
ObjectiveThe overall goal of the study is to detect coronary artery lesions regardless their nature, calcified or hypo-dense. To avoid explicit modelling of heterogeneous lesions, we adopted an approach based on machine learning and using unsupervised or semi-supervised classifiers. The success of the classifiers based on machine learning strongly depends on the appropriate choice of features differentiating between lesions and regular appearance. The specific goal of this article is to propose a novel strategy devised to select the best feature set for the classifiers used, out of a given set of candidate features.Materials and methodsThe features are calculated in image planes orthogonal to the artery centerline, and the classifier assigns to each of these cross-sections a label “healthy” or “diseased”. The contribution of this article is a feature-selection strategy based on the empirical risk function that is used as a criterion in the initial feature ranking and in the selection process itself. We have assessed this strategy in association with two classifiers based on the density-level detection approach that seeks outliers from the distribution corresponding to the regular appearance. The method was evaluated using a total of 13,687 cross-sections extracted from 53 coronary arteries in 15 patients.ResultsUsing the feature subset selected by the risk-based strategy, balanced error rates achieved by the unsupervised and semi-supervised classifiers respectively were equal to 13.5% and 15.4%. These results were substantially better than the rates achieved using feature subsets selected by supervised strategies. The unsupervised and semi-supervised methods also outperformed supervised classifiers using feature subsets selected by the corresponding supervised strategies.DiscussionSupervised methods require large data sets annotated by experts, both to select the features and to train the classifiers, and collecting these annotations is time-consuming. With these methods, lesions whose appearance differs from the training data may remain undetected. Lesion-detection problem is highly imbalanced, since healthy cross-sections usually are much more numerous than the diseased ones. Training the classifiers based on the density-level detection approach needs a small number of annotations or no annotations at all. The same annotations are sufficient to compute the empirical risk and to perform the selection. Therefore, our strategy associated with an unsupervised or semi-supervised classifier requires a considerably smaller number of annotations as compared to conventional supervised selection strategies. The approach proposed is also better suited for highly imbalanced problems and can detect lesions differing from the training set.ConclusionThe risk-based selection strategy, associated with classifiers using the density-level detection approach, outperformed other strategies and classifiers when used to detect coronary artery lesions. It is well suited for highly imbalanced problems, where the lesions are represented as low-density regions of the feature space, and it can be used in other anomaly detection problems interpretable as a binary classification problem where the empirical risk can be calculated.  相似文献   

15.
In medical statistics, many alternative strategies are available for building a prediction model based on training data. Prediction models are routinely compared by means of their prediction performance in independent validation data. If only one data set is available for training and validation, then rival strategies can still be compared based on repeated bootstraps of the same data. Often, however, the overall performance of rival strategies is similar and it is thus difficult to decide for one model. Here, we investigate the variability of the prediction models that results when the same modelling strategy is applied to different training sets. For each modelling strategy we estimate a confidence score based on the same repeated bootstraps. A new decomposition of the expected Brier score is obtained, as well as the estimates of population average confidence scores. The latter can be used to distinguish rival prediction models with similar prediction performances. Furthermore, on the subject level a confidence score may provide useful supplementary information for new patients who want to base a medical decision on predicted risk. The ideas are illustrated and discussed using data from cancer studies, also with high-dimensional predictor space.  相似文献   

16.
Hundreds of 'molecular signatures' have been proposed in the literature to predict patient outcome in clinical settings from high-dimensional data, many of which eventually failed to get validated. Validation of such molecular research findings is thus becoming an increasingly important branch of clinical bioinformatics. Moreover, in practice well-known clinical predictors are often already available. From a statistical and bioinformatics point of view, poor attention has been given to the evaluation of the added predictive value of a molecular signature given that clinical predictors or an established index are available. This article reviews procedures that assess and validate the added predictive value of high-dimensional molecular data. It critically surveys various approaches for the construction of combined prediction models using both clinical and molecular data, for validating added predictive value based on independent data, and for assessing added predictive value using a single data set.  相似文献   

17.
Posture segmentation plays an essential role in human motion analysis. The state-of-the-art method extracts sufficiently high-dimensional features from 3D depth images for each 3D point and learns an efficient body part classifier. However, high-dimensional features are memory-consuming and difficult to handle on large-scale training dataset. In this paper, we propose an efficient two-stage dimension reduction scheme, termed biview learning, to encode two independent views which are depth-difference features (DDF) and relative position features (RPF). Biview learning explores the complementary property of DDF and RPF, and uses two stages to learn a compact yet comprehensive low-dimensional feature space for posture segmentation. In the first stage, discriminative locality alignment (DLA) is applied to the high-dimensional DDF to learn a discriminative low-dimensional representation. In the second stage, canonical correlation analysis (CCA) is used to explore the complementary property of RPF and the dimensionality reduced DDF. Finally, we train a support vector machine (SVM) over the output of CCA. We carefully validate the effectiveness of DLA and CCA utilized in the two-stage scheme on our 3D human points cloud dataset. Experimental results show that the proposed biview learning scheme significantly outperforms the state-of-the-art method for human posture segmentation.  相似文献   

18.
Neural learning algorithms generally involve a number of identical processing units, which are fully or partially connected, and involve an update function, such as a ramp, a sigmoid or a Gaussian function for instance. Some variations also exist, where units can be heterogeneous, or where an alternative update technique is employed, such as a pulse stream generator. Associated with connections are numerical values that must be adjusted using a learning rule, and and dictated by parameters that are learning rule specific, such as momentum, a learning rate, a temperature, amongst others. Usually, neural learning algorithms involve local updates, and a global interaction between units is often discouraged, except in instances where units are fully connected, or involve synchronous updates. In all of these instances, concurrency within a neural algorithm cannot be fully exploited without a suitable implementation strategy. A design scheme is described for translating a neural learning algorithm from inception to implementation on a parallel machine using PVM or MPI libraries, or onto programmable logic such as FPGAs. A designer must first describe the algorithm using a specialised Neural Language, from which a Petri net (PN) model is constructed automatically for verification, and building a performance model. The PN model can be used to study issues such as synchronisation points, resource sharing and concurrency within a learning rule. Specialised constructs are provided to enable a designer to express various aspects of a learning rule, such as the number and connectivity of neural nodes, the interconnection strategies, and information flows required by the learning algorithm. A scheduling and mapping strategy is then used to translate this PN model onto a multiprocessor template. We demonstrate our technique using a Kohonen and backpropagation learning rules, implemented on a loosely coupled workstation cluster, and a dedicated parallel machine, with PVM libraries.  相似文献   

19.
The advent of high-throughput sequencing technology has resulted in the ability to measure millions of single-nucleotide polymorphisms (SNPs) from thousands of individuals. Although these high-dimensional data have paved the way for better understanding of the genetic architecture of common diseases, they have also given rise to challenges in developing computational methods for learning epistatic relationships among genetic markers. We propose a new method, named cuckoo search epistasis (CSE) for identifying significant epistatic interactions in population-based association studies with a case–control design. This method combines a computationally efficient Bayesian scoring function with an evolutionary-based heuristic search algorithm, and can be efficiently applied to high-dimensional genome-wide SNP data. The experimental results from synthetic data sets show that CSE outperforms existing methods including multifactorial dimensionality reduction and Bayesian epistasis association mapping. In addition, on a real genome-wide data set related to Alzheimer''s disease, CSE identified SNPs that are consistent with previously reported results, and show the utility of CSE for application to genome-wide data.  相似文献   

20.
Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号