首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Buckley–James (BJ) model is a typical semiparametric accelerated failure time model, which is closely related to the ordinary least squares method and easy to be constructed. However, traditional BJ model built on linearity assumption only captures simple linear relationships, while it has difficulty in processing nonlinear problems. To overcome this difficulty, in this paper, we develop a novel regression model for right-censored survival data within the learning framework of BJ model, basing on random survival forests (RSF), extreme learning machine (ELM), and L2 boosting algorithm. The proposed method, referred to as ELM-based BJ boosting model, employs RSF for covariates imputation first, then develops a new ensemble of ELMs—ELM-based boosting algorithm for regression by ensemble scheme of L2 boosting, and finally, uses the output function of the proposed ELM-based boosting model to replace the linear combination of covariates in BJ model. Due to fitting the logarithm of survival time with covariates by the nonparametric ELM-based boosting method instead of the least square method, the ELM-based BJ boosting model can capture both linear covariate effects and nonlinear covariate effects. In both simulation studies and real data applications, in terms of concordance index and integrated Brier sore, the proposed ELM-based BJ boosting model can outperform traditional BJ model, two kinds of BJ boosting models proposed by Wang et al., RSF, and Cox proportional hazards model.  相似文献   

2.
The relationship between the mortality of cervical cancer and soil trace elements of 23 regions of China was investigated. A total of 25 elements (i.e., Na, K, Mg, Ca, Sr, Hg, Pb, B, Tm, Th, U, Sn, Hf, Bi, Ta, Te, Mo, Br, I, As, Cr, Cu, Fe, Zn, and Se) were considered. First, 23 samples were split into the training set with 12 samples and the test set with 11 samples. Then, a combination strategy called genetic algorithm–partial least squares (GA–PLS) was used to pick out five important elements. i.e., Br, Ta, Pb, Cr, and As. Afterwards, the classic partial least squares (PLS) model and least square support vector machine (LSSVM) model were developed and compared. The results revealed that the SVM model significantly outperforms the PLS model, indicating that the combination of GA–PLS and LSSVM can serve as a potential tool for predicting the mortality of cancer based on trace elements.  相似文献   

3.
支持向量机与神经网络的关系研究   总被引:2,自引:0,他引:2  
支持向量机是一种基于统计学习理论的新颖的机器学习方法,由于其出色的学习性能,该技术已成为当前国际机器学习界的研究热点,该方法已经广泛用于解决分类和回归问题.本文将结构风险函数应用于径向基函数网络学习中,同时讨论了支持向量回归模型和径向基函数网络之间的关系.仿真实例表明所给算法提高了径向基函数网络的泛化性能.  相似文献   

4.
Prostate cancer is the most common non-cutaneous malignancy and second leading cause of cancer mortality in men. The principle goal of this study was explore the feasibility of applying boosting coupled with trace element analysis of hair, for accurately distinguishing prostate cancer from healthy person. A total of 113 subjects containing 55 healthy men and 58 prostate cancers were collected. Based on a special index of variable importance and a forward selection scheme, only nine elements (i.e., Zn, Cr, Mg, Ca, Al, P, Cd, Fe, and Mo) were picked out from 20 candidate elements for modeling the relationship. As a result, an ensemble classifier consisting of only eight decision stumps achieved an overall accuracy of 98.2%, a sensitivity of 100%, and a specificity of 96.4% on the independent test set while all subjects on the training set are classified correctly. It seems that integrating boosting and element analysis of hair can serve as a valuable tool of diagnosing prostate cancer in practice.  相似文献   

5.
Ecologists collect their data manually by visiting multiple sampling sites. Since there can be multiple species in the multiple sampling sites, manually classifying them can be a daunting task. Much work in literature has focused mostly on statistical methods for classification of single species and very few studies on classification of multiple species. In addition to looking at multiple species, we noted that classification of multiple species result in multi-class imbalanced problem. This study proposes to use machine learning approach to classify multiple species in population ecology. In particular, bagging (random forests (RF) and bagging classification trees (bagCART)) and boosting (boosting classification trees (bootCART), gradient boosting machines (GBM) and adaptive boosting classification trees (AdaBoost)) classifiers were evaluated for their performances on imbalanced multiple fish species dataset. The recall and F1-score performance metrics were used to select the best classifier for the dataset. The bagging classifiers (RF and bagCART) achieved high performances on the imbalanced dataset while the boosting classifiers (bootCART, GBM and AdaBoost) achieved lower performances on the imbalanced dataset. We found that some machine learning classifiers were sensitive to imbalanced dataset hence they require data resampling to improve their performances. After resampling, the bagging classifiers (RF and bagCART) had high performances compared to boosting classifiers (bootCART, GBM and AdaBoost). The strong performances shown by bagging classifiers (RF and bagCART) suggest that they can be used for classifying multiple species in ecological studies.  相似文献   

6.
Foot placement strategy is an essential aspect in the study of movement involving full body displacement. To get beyond a qualitative analysis, this paper provides a foot placement classification and analysis method that can be used in sports, rehabilitation or ergonomics. The method is based on machine learning using a weighted k-nearest neighbors algorithm. The learning phase is performed by an observer who classifies a set of trials. The algorithm then automatically reproduces this classification on subsequent sets. The method also provides detailed analysis of foot placement strategy, such as estimating the average foot placements for each class or visualizing the variability of strategies. An example of applying the method to a manual material handling task demonstrates its usefulness. During the lifting phase, the foot placements were classified into four groups: front, contralateral foot behind, ipsilateral foot behind, and parallel. The accuracy of the classification, assessed with a holdout method, is about 97%. In this example, the classification method makes it possible to observe and analyze the handler’s foot placement strategies with regards to the performed task.  相似文献   

7.
Latent class regression on latent factors   总被引:1,自引:0,他引:1  
In the research of public health, psychology, and social sciences, many research questions investigate the relationship between a categorical outcome variable and continuous predictor variables. The focus of this paper is to develop a model to build this relationship when both the categorical outcome and the predictor variables are latent (i.e. not observable directly). This model extends the latent class regression model so that it can include regression on latent predictors. Maximum likelihood estimation is used and two numerical methods for performing it are described: the Monte Carlo expectation and maximization algorithm and Gaussian quadrature followed by quasi-Newton algorithm. A simulation study is carried out to examine the behavior of the model under different scenarios. A data example involving adolescent health is used for demonstration where the latent classes of eating disorders risk are predicted by the latent factor body satisfaction.  相似文献   

8.
This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.  相似文献   

9.
Genomic selection (GS) is a method for predicting breeding values of plants or animals using many molecular markers that is commonly implemented in two stages. In plant breeding the first stage usually involves computation of adjusted means for genotypes which are then used to predict genomic breeding values in the second stage. We compared two classical stage-wise approaches, which either ignore or approximate correlations among the means by a diagonal matrix, and a new method, to a single-stage analysis for GS using ridge regression best linear unbiased prediction (RR-BLUP). The new stage-wise method rotates (orthogonalizes) the adjusted means from the first stage before submitting them to the second stage. This makes the errors approximately independently and identically normally distributed, which is a prerequisite for many procedures that are potentially useful for GS such as machine learning methods (e.g. boosting) and regularized regression methods (e.g. lasso). This is illustrated in this paper using componentwise boosting. The componentwise boosting method minimizes squared error loss using least squares and iteratively and automatically selects markers that are most predictive of genomic breeding values. Results are compared with those of RR-BLUP using fivefold cross-validation. The new stage-wise approach with rotated means was slightly more similar to the single-stage analysis than the classical two-stage approaches based on non-rotated means for two unbalanced datasets. This suggests that rotation is a worthwhile pre-processing step in GS for the two-stage approaches for unbalanced datasets. Moreover, the predictive accuracy of stage-wise RR-BLUP was higher (5.0–6.1 %) than that of componentwise boosting.  相似文献   

10.
An understanding of the dynamic behavior of trace elements leaching from coal mine spoil is important in predicting the groundwater quality. The relationship between trace element concentrations and leaching times, pH values of the media is studied. Column leaching tests conducted in the laboratory showed that there was a close correlation between pH value and trace element concentrations. The longer the leaching time, the higher the trace element concentrations. Different trace elements are differently affected by pH values of leaching media. A numerical model for water flow and trace element transport has been developed based on analyzing the characteristics of migration and transformation of trace elements leached from coal mine spoil. Solutions to the coupled model are accomplished by Eulerian-Lagrangian localized adjoint method. Numerical simulation shows that rainfall intensity determined maximum leaching depth. As rainfall intensity is 3.6ml/s, the outflow concentrations indicate a breakthrough of trace elements beyond the column base, with peak concentration at 90cm depth. And the subsurface pollution range has a trend of increase with time. The model simulations are compared to experimental results of trace element concentrations, with reasonable agreement between them. The analysis and modeling of trace elements suggested that the infiltration of rainwater through the mine spoil might lead to potential groundwater pollution. It provides theoretical evidence for quantitative assessment soil-water quality of trace element transport on environment pollution.  相似文献   

11.
Weakly-supervised learning has recently emerged in the classification context where true labels are often scarce or unreliable. However, this learning setting has not yet been extensively analyzed for regression problems, which are typical in macroecology. We further define a novel computational setting of structurally noisy and incomplete target labels, which arises, for example, when the multi-output regression task defines a distribution such that outputs must sum up to unity. We propose an algorithmic approach to reduce noise in the target labels and improve predictions. We evaluate this setting with a case study in global vegetation modelling, which involves building a model to predict the distribution of vegetation cover from climatic conditions based on global remote sensing data. We compare the performance of the proposed approach to several incomplete target baselines. The results indicate that the error in the targets can be reduced by our proposed partial-imputation algorithm. We conclude that handling structural incompleteness in the target labels instead of using only complete observations for training helps to better capture global associations between vegetation and climate.  相似文献   

12.
13.
Machine learning methods without tears: a primer for ecologists   总被引:1,自引:0,他引:1  
Machine learning methods, a family of statistical techniques with origins in the field of artificial intelligence, are recognized as holding great promise for the advancement of understanding and prediction about ecological phenomena. These modeling techniques are flexible enough to handle complex problems with multiple interacting elements and typically outcompete traditional approaches (e.g., generalized linear models), making them ideal for modeling ecological systems. Despite their inherent advantages, a review of the literature reveals only a modest use of these approaches in ecology as compared to other disciplines. One potential explanation for this lack of interest is that machine learning techniques do not fall neatly into the class of statistical modeling approaches with which most ecologists are familiar. In this paper, we provide an introduction to three machine learning approaches that can be broadly used by ecologists: classification and regression trees, artificial neural networks, and evolutionary computation. For each approach, we provide a brief background to the methodology, give examples of its application in ecology, describe model development and implementation, discuss strengths and weaknesses, explore the availability of statistical software, and provide an illustrative example. Although the ecological application of machine learning approaches has increased, there remains considerable skepticism with respect to the role of these techniques in ecology. Our review encourages a greater understanding of machin learning approaches and promotes their future application and utilization, while also providing a basis from which ecologists can make informed decisions about whether to select or avoid these approaches in their future modeling endeavors.  相似文献   

14.
Geometric features of the aorta are linked to patient risk of rupture in the clinical decision to electively repair an ascending aortic aneurysm (AsAA). Previous approaches have focused on relationship between intuitive geometric features (e.g., diameter and curvature) and wall stress. This work investigates the feasibility of a machine learning approach to establish the linkages between shape features and FEA-predicted AsAA rupture risk, and it may serve as a faster surrogate for FEA associated with long simulation time and numerical convergence issues. This method consists of four main steps: (1) constructing a statistical shape model (SSM) from clinical 3D CT images of AsAA patients; (2) generating a dataset of representative aneurysm shapes and obtaining FEA-predicted risk scores defined as systolic pressure divided by rupture pressure (rupture is determined by a threshold criterion); (3) establishing relationship between shape features and risk by using classifiers and regressors; and (4) evaluating such relationship in cross-validation. The results show that SSM parameters can be used as strong shape features to make predictions of risk scores consistent with FEA, which lead to an average risk classification accuracy of 95.58% by using support vector machine and an average regression error of 0.0332 by using support vector regression, while intuitive geometric features have relatively weak performance. Compared to FEA, this machine learning approach is magnitudes faster. In our future studies, material properties and inhomogeneous thickness will be incorporated into the models and learning algorithms, which may lead to a practical system for clinical applications.  相似文献   

15.

Background

Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly.

Methods

In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets.

Results

The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task.

Conclusions

No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.
  相似文献   

16.
Trace elements are indispensable for the effective and proper functioning of biological systems. Recent years have demonstrated the conspicuous lack of knowledge about trace-element physiology. Establishment of reference values is a very difficult task, requiring the consideration of and compensation for a number of possible simultaneous phenomena. Peripheral blood has been used in medical diagnosis for a very long time, because, among other things, it is easily accessible. In the search for signs of lack or excess of minerals and trace elements in disease, the interest has been focused mainly on blood plasma or serum. The utilization of blood cells as a marker model is proposed here. The advent of the nuclear microprobe made possible the determination of elemental profiles of individual cells. The techniques of blood cell separation and preparation for microprobe analysis are presented and discussed. As an example of a possible diagnostic application, a set of reference data from a control group is compared to corresponding data from a group of patients suffering from acute myeloid leukemia.  相似文献   

17.
Capturing complex dependence structures between outcome variables (e.g., study endpoints) is of high relevance in contemporary biomedical data problems and medical research. Distributional copula regression provides a flexible tool to model the joint distribution of multiple outcome variables by disentangling the marginal response distributions and their dependence structure. In a regression setup, each parameter of the copula model, that is, the marginal distribution parameters and the copula dependence parameters, can be related to covariates via structured additive predictors. We propose a framework to fit distributional copula regression via model-based boosting, which is a modern estimation technique that incorporates useful features like an intrinsic variable selection mechanism, parameter shrinkage and the capability to fit regression models in high-dimensional data setting, that is, situations with more covariates than observations. Thus, model-based boosting does not only complement existing Bayesian and maximum-likelihood based estimation frameworks for this model class but rather enables unique intrinsic mechanisms that can be helpful in many applied problems. The performance of our boosting algorithm for copula regression models with continuous margins is evaluated in simulation studies that cover low- and high-dimensional data settings and situations with and without dependence between the responses. Moreover, distributional copula boosting is used to jointly analyze and predict the length and the weight of newborns conditional on sonographic measurements of the fetus before delivery together with other clinical variables.  相似文献   

18.
Variable selection and model choice in geoadditive regression models   总被引:3,自引:0,他引:3  
Kneib T  Hothorn T  Tutz G 《Biometrics》2009,65(2):626-634
Summary .  Model choice and variable selection are issues of major concern in practical regression analyses, arising in many biometric applications such as habitat suitability analyses, where the aim is to identify the influence of potentially many environmental conditions on certain species. We describe regression models for breeding bird communities that facilitate both model choice and variable selection, by a boosting algorithm that works within a class of geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, and varying coefficients. The major modeling components are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a smooth component with one degree of freedom to obtain a fair comparison between the model terms. A generic representation of the geoadditive model allows us to devise a general boosting algorithm that automatically performs model choice and variable selection.  相似文献   

19.

The aim of the present work was to investigate the trace elements and the correlation with flavonoids from Sparganii rhizoma. The ICP-AES and ultraviolet-visible spectroscopy were employed to analyze trace elements and flavonoids. The concentrations of trace elements and flavonoids were calculated using standard curve. The content of flavonoids was expressed as rutin equivalents. The cluster analysis was applied to evaluate geographical features of S. rhizoma from different geographical regions. The correlation analysis was used to obtain the relationship between the trace elements and flavonoids. The results indicated that the 15 trace elements were measured and the K, Ca, Mg, Na, Mn, Al, Cu, and Zn are rich in Sparganii rhizome. The different producing regions samples were classified into four groups. There was a weak relationship between trace elements and flavonoids.

  相似文献   

20.
Data mining in bioinformatics using Weka   总被引:8,自引:0,他引:8  
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY: http://www.cs.waikato.ac.nz/ml/weka.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号