首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Data transformations prior to analysis may be beneficial in classification tasks. In this article we investigate a set of such transformations on 2D graph-data derived from facial images and their effect on classification accuracy in a high-dimensional setting. These transformations are low-variance in the sense that each involves only a fixed small number of input features. We show that classification accuracy can be improved when penalized regression techniques are employed, as compared to a principal component analysis (PCA) pre-processing step. In our data example classification accuracy improves from 47% to 62% when switching from PCA to penalized regression. A second goal is to visualize the resulting classifiers. We develop importance plots highlighting the influence of coordinates in the original 2D space. Features used for classification are mapped to coordinates in the original images and combined into an importance measure for each pixel. These plots assist in assessing plausibility of classifiers, interpretation of classifiers, and determination of the relative importance of different features.  相似文献   

2.
Abstract. The use of Generalized Linear Models (GLM) in vegetation analysis has been advocated to accommodate complex species response curves. This paper investigates the potential advantages of using classification and regression trees (CART), a recursive partitioning method that is free of distributional assumptions. We used multiple logistic regression (a form of GLM) and CART to predict the distribution of three major oak species in California. We compared two types of model: polynomial logistic regression models optimized to account for non‐linearity and factor interactions, and simple CART‐models. Each type of model was developed using learning data sets of 2085 and 410 sample cases, and assessed on test sets containing 2016 and 3691 cases respectively. The responses of the three species to environmental gradients were varied and often non‐homogeneous or context dependent. We tested the methods for predictive accuracy: CART‐models performed significantly better than our polynomial logistic regression models in four of the six cases considered, and as well in the two remaining cases. CART also showed a superior ability to detect factor interactions. Insight gained from CART‐models then helped develop improved parametric models. Although the probabilistic form of logistic regression results is more adapted to test theories about species responses to environmental gradients, we found that CART‐models are intuitive, easy to develop and interpret, and constitute a valuable tool for modeling species distributions.  相似文献   

3.
Markey MK  Tourassi GD  Floyd CE 《Proteomics》2003,3(9):1678-1679
A classification and regression tree (CART) model was trained to classify 41 clinical specimens as disease/nondisease based on 26 variables computed from the mass-to-charge ratio (m/z) and peak heights of proteins identified by mass spectroscopy. The CART model built on all of the specimens (no cross-validation) had an error rate of 4/41 = 10%. The CART model suggests that mass spectra peaks in the 8000-10,000, 20,000-30,000, 45,000-60, 000, and >125,000 m/z ranges may be valuable in distinguishing between the disease/nondisease specimens. The area under the receiver operating characteristics curve was 0.80 +/- 0.07 for leave-one-out cross-validation.  相似文献   

4.
We evaluated the predictive power of two classification techniques, one parametric – discriminant function analysis (DFA) and the other non-parametric – classification and regression tree analysis (CART), in order to provide a non-subjective quantitative method of determining age class in Vancouver Island marmots ( Marmota vancouverensis ) and hoary marmots ( Marmota caligata ). For both techniques we used morphological measurements of known-age male and female marmots from two independent population studies to build and test predictive models of age class. Both techniques had high predictive power (69–86%) for both sexes and both species. Overall, the two methods performed identically with 81% correct classification. DFA was marginally better at discriminating among older more challenging age classes compared to CART. However, in our test samples, cases with missing values in any of the discriminant variables were deleted and hence unclassified by DFA, whereas CART used values from closely correlated variables to substitute for the missing values. Therefore, overall, CART performed better (CART 81% vs DFA 76%) because of its ability to classify incomplete cases. Correct classification rates were approximately 10% higher for hoary marmots than for Vancouver Island marmots, a result that could be attributed to different sets of morphological measurements. Zygomatic arch breadth measured in hoary marmots was the most important predictor of age class in both sexes using both classification techniques. We recommend that CART analysis be performed on data-sets with incomplete records and used as a variable screening tool prior to DFA on more complete data-sets.  相似文献   

5.
为了建立乙型肝炎病毒(Hepatitis B virus,HBV)再激活的预测模型,提出CART(classification and regression tree)特征选择方法应用在原发性肝癌患者精确放疗后HBV再激活的危险因素分析中,进而建立基于CART和Bayes算法的HBV再激活预测模型。实验结果显示:CART算法划分了多组具有优秀分类能力的特征节点集(危险因素),尤其当特征节点集为HBV DNA水平、外放边界、放疗总剂量、V20和KPS评分时,在CART和Bayes预测模型中的分类正确性分别为88.51%和86.69%,得到HBV再激活正确性贡献度的排序为KPS评分全肝平均剂量V20放疗总剂量V10;当甲胎蛋白AFP出现时,增加了HBV再激活的预测正确性。  相似文献   

6.
Colony collapse disorder (CCD), a syndrome whose defining trait is the rapid loss of adult worker honey bees, Apis mellifera L., is thought to be responsible for a minority of the large overwintering losses experienced by U.S. beekeepers since the winter 2006-2007. Using the same data set developed to perform a monofactorial analysis (PloS ONE 4: e6481, 2009), we conducted a classification and regression tree (CART) analysis in an attempt to better understand the relative importance and interrelations among different risk variables in explaining CCD. Fifty-five exploratory variables were used to construct two CART models: one model with and one model without a cost of misclassifying a CCD-diagnosed colony as a non-CCD colony. The resulting model tree that permitted for misclassification had a sensitivity and specificity of 85 and 74%, respectively. Although factors measuring colony stress (e.g., adult bee physiological measures, such as fluctuating asymmetry or mass of head) were important discriminating values, six of the 19 variables having the greatest discriminatory value were pesticide levels in different hive matrices. Notably, coumaphos levels in brood (a miticide commonly used by beekeepers) had the highest discriminatory value and were highest in control (healthy) colonies. Our CART analysis provides evidence that CCD is probably the result of several factors acting in concert, making afflicted colonies more susceptible to disease. This analysis highlights several areas that warrant further attention, including the effect of sublethal pesticide exposure on pathogen prevalence and the role of variability in bee tolerance to pesticides on colony survivorship.  相似文献   

7.
Lee BK  Lessler J  Stuart EA 《PloS one》2011,6(3):e18174
Propensity score weighting is sensitive to model misspecification and outlying weights that can unduly influence results. The authors investigated whether trimming large weights downward can improve the performance of propensity score weighting and whether the benefits of trimming differ by propensity score estimation method. In a simulation study, the authors examined the performance of weight trimming following logistic regression, classification and regression trees (CART), boosted CART, and random forests to estimate propensity score weights. Results indicate that although misspecified logistic regression propensity score models yield increased bias and standard errors, weight trimming following logistic regression can improve the accuracy and precision of final parameter estimates. In contrast, weight trimming did not improve the performance of boosted CART and random forests. The performance of boosted CART and random forests without weight trimming was similar to the best performance obtainable by weight trimmed logistic regression estimated propensity scores. While trimming may be used to optimize propensity score weights estimated using logistic regression, the optimal level of trimming is difficult to determine. These results indicate that although trimming can improve inferences in some settings, in order to consistently improve the performance of propensity score weighting, analysts should focus on the procedures leading to the generation of weights (i.e., proper specification of the propensity score model) rather than relying on ad-hoc methods such as weight trimming.  相似文献   

8.
The use of penalized logistic regression for cancer classification using microarray expression data is presented. Two dimension reduction methods are respectively combined with the penalized logistic regression so that both the classification accuracy and computational speed are enhanced. Two other machine-learning methods, support vector machines and least-squares regression, have been chosen for comparison. It is shown that our methods have achieved at least equal or better results. They also have the advantage that the output probability can be explicitly given and the regression coefficients are easier to interpret. Several other aspects, such as the selection of penalty parameters and components, pertinent to the application of our methods for cancer classification are also discussed.  相似文献   

9.
Classification and regression tree (CART) modelling was used to determine infectious hypodermal and haematopoietic necrosis virus (IHHNV) resistance and susceptibility in Penaeus stylirostris. In a previous study, eight random amplified polymorphic DNA (RAPD) markers and viral load values using real-time quantitative PCR were obtained and used as the training data set in order to create numerous regression tree models. Specifically, the genetic markers were used as categorical predictor variables and viral load values as the dependent response variable. To determine which model has the highest predictive accuracy for future samples, RAPD fingerprint data was generated from new Penaues stylirostris IHHNV resistant and susceptible individuals and used to test the regression models. The best performing tree was a four terminal node tree with three genetic markers as significant variables. Marker-assisted breeding practices may benefit from the creation of regression tree models that apply genetic markers as predictive factors. To our knowledge this is the first study to use RAPD markers as predictors within a CART prediction model to determine viral susceptibility.  相似文献   

10.
Motivated by a clinical prediction problem, a simulation study was performed to compare different approaches for building risk prediction models. Robust prediction models for hospital survival in patients with acute heart failure were to be derived from three highly correlated blood parameters measured up to four times, with predictive ability having explicit priority over interpretability. Methods that relied only on the original predictors were compared with methods using an expanded predictor space including transformations and interactions. Predictors were simulated as transformations and combinations of multivariate normal variables which were fitted to the partly skewed and bimodally distributed original data in such a way that the simulated data mimicked the original covariate structure. Different penalized versions of logistic regression as well as random forests and generalized additive models were investigated using classical logistic regression as a benchmark. Their performance was assessed based on measures of predictive accuracy, model discrimination, and model calibration. Three different scenarios using different subsets of the original data with different numbers of observations and events per variable were investigated. In the investigated setting, where a risk prediction model should be based on a small set of highly correlated and interconnected predictors, Elastic Net and also Ridge logistic regression showed good performance compared to their competitors, while other methods did not lead to substantial improvements or even performed worse than standard logistic regression. Our work demonstrates how simulation studies that mimic relevant features of a specific data set can support the choice of a good modeling strategy.  相似文献   

11.
MOTIVATION: One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one. RESULTS: In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate.  相似文献   

12.
MOTIVATION: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate. RESULTS: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models. AVAILABILITY: R library GeneLogit at http://geocities.com/jg_liao  相似文献   

13.
Personalized medicine aims to identify those patients who have good or poor prognosis for overall disease outcomes or therapeutic efficacy for a specific treatment. A well-established approach is to identify a set of biomarkers using statistical methods with a classification algorithm to identify patient subgroups for treatment selection. However, there are potential false positives and false negatives in classification resulting in incorrect patient treatment assignment. In this paper, we propose a hybrid mixture model taking uncertainty in class labels into consideration, where the class labels are modeled by a Bernoulli random variable. An EM algorithm was developed to estimate the model parameters, and a parametric bootstrap method was used to test the significance of the predictive variables that were associated with subgroup memberships. Simulation experiments showed that the proposed method averagely had higher accuracy in identifying the subpopulations than the Naïve Bayes classifier and logistic regression. A breast cancer dataset was analyzed to illustrate the proposed hybrid mixture model.  相似文献   

14.
Risk stratification for spontaneous bacterial peritonitis (SBP) in patients with cirrhosis and ascites helps guide care. Existing prediction models, such as end-stage liver disease (MELD) score, are accurate but controversial in clinical practice. We developed and validated a practical user-friendly bedside tool for SBP risk stratification of patients with cirrhosis and ascites. Using classification and regression tree (CART) analysis, a model was developed for prediction of SBP in cirrhosis with ascites. The CART model was derived on data collected from 676 patients admitted from January 2007 to December 2009 retrospectively, and then was prospectively tested in another independent 198 inpatients between January 2010 and December 2010. The accuracy of CART model was evaluated using the area under the receiver operating characteristic curve. The performance of the model was further validated by comparing its predictive accuracy with that of the MELD score. Furthermore, the model was used to stratify SBP among patients with MELD scores under 15. CART analysis identified four variables for prediction of SBP: creatinine, total bilirubin, prothrombin time and white blood cell count, and three risk groups: low (2.0%), intermediate (27.5–33.3%) and high (60.6–86.4%) risk. The accuracy of CART model (0.881) exceeded that of MELD (0.791). Subjects in the intermediate risk and high risk groups had 22.21-fold (95% confident interval (CI), 9.98–49.45) and 173.50-fold (95% CI, 77.68–634.33) increased risk of SBP, respectively, comparing with the low risk group. Similar results were found when this risk stratification was applied to the validation cohort. Cirrhotic patients with ascites at low, intermediate, and high risk for SBP can be easily identified using CART model, which provides clinicians with a validated, practical bedside tool for SBP risk stratification.  相似文献   

15.
Aim We modelled the relationship of breeding evidence for five species of forest songbirds (ruby-crowned kinglet (Regulus calendula) Blackburnian warbler (Dendroica fusca), black-throated blue warbler (Dendroica caerulescens), bay-breasted warbler (Dendrioca castanea) and Connecticut warbler (Oporornis agilis)) and a variety of macro-climate variables to examine the importance of climate as a factor determining distribution of breeding in these species and to assess the usefulness of spatial predictions generated from these models. Location Modelling was conducted over the entire province of Ontario, Canada, an area of ≈900,000 km2. Methods Data on the distribution of breeding in the province was derived from the Breeding Bird Atlas of Ontario. We used logistic regression to model the relationship between the probability of breeding (assessed in 10 km×10 km blocks) and estimates of a variety of climate variables at the same scale. Models were selected that had the least number of explanatory variables while at the same time having close to the best possible classification accuracy. Results The final models for these five species had from one to six explanatory variables and an overall concordance of 70.4% to 86.3% indicating a good classification accuracy. Results from subsampling 50% of the original data ten times indicate that (1) the classification accuracy of the model for data used to generate the model is not very sensitive to the specific observations used to generate the model (2) the classification accuracy of test data is close to the classification accuracy of the model data and (3) the classification accuracy of the test data is not dependent on the specific observations used to generate the model. We generated a spatial prediction of the probability of occurrence of each species for Ontario using the relationships defined by the logistic regression models and using 1 km gridded estimates of the necessary climate variables. These probability maps closely matched the maps of observed evidence of breeding from the Atlas. Main conclusions Although mechanisms controlling breeding distribution cannot be determined using this method, we can conclude that (1) macro-climate is an important factor directly and/or indirectly determining distribution of breeding in these species and (2) spatial predictions of probability of breeding are accurate enough to be useful in predicting probability of breeding in unsampled areas.  相似文献   

16.
The present paper demonstrates the application of CART (classification and regression trees) to control a mosquito vector (Culex quinquefasciatus) for bancroftian filariasis in India. The database on filariasis and a commercially available software CART (Salford systems Inc. USA) were used in this study. Baseline entomological data related to bancroftian filariasis was utilized for deriving prediction rules. The data was categorized into three different aspects, namely (1) mosquito abundance, (2) meteorological and (3) socio-economic details. This data was taken from a database developed for a project entitled "Database management system for the control of bancroftian filariasis" sponsored by Ministry of Communication and Information Technology (MC&IT), Government of India, New Delhi. Predictor variables (maximum temperature, minimum temperature, rain fall, relative humidity, wind speed, house type) were ranked by CART according to their influence on the target variable (month). The approach is useful for forecasting vector (mosquito) densities in forthcoming seasons.  相似文献   

17.
To date, few consistent relationships between survival in rehabilitation programs and diagnostic measures recorded upon admission have been identified for harbor seal pups. Veterinary records for 718 unweaned Pacific harbor seal pups (Phoca vitulina richardii) admitted to a rehabilitation center were examined to identify clinical factors associated with preweaning survival and develop a triage tool to stratify pups according to their risk of mortality. Physical, serum chemical, and hematological variables were examined and their relationship with survival to weaning was assessed by logistic regression and classification and regression tree (CART) analysis. Survival to weaning was 85.1% and many clinical variables reflecting the pups’ age, size, growth, injuries, and blood parameters were associated with the likelihood of survival. A decision tree model, consisting of serum concentrations of phosphorus, sodium, and calcium, successfully stratified harbor seal pups into clinical subgroups according to their preweaning mortality risk. For both the derivation and validation cohorts, pups classified as “high risk” had significantly lower odds of survival, while those classified as “low risk” had significantly greater odds of survival. This simple decision tree could serve as a practical triage tool to help identify and direct care towards pups at higher risk of preweaning mortality.  相似文献   

18.
This article describes DP-Bind, a web server for predicting DNA-binding sites in a DNA-binding protein from its amino acid sequence. The web server implements three machine learning methods: support vector machine, kernel logistic regression and penalized logistic regression. Prediction can be performed using either the input sequence alone or an automatically generated profile of evolutionary conservation of the input sequence in the form of PSI-BLAST position-specific scoring matrix (PSSM). PSSM-based kernel logistic regression achieves the accuracy of 77.2%, sensitivity of 76.4% and specificity of 76.6%. The outputs of all three individual methods are combined into a consensus prediction to help identify positions predicted with high level of confidence. AVAILABILITY: Freely available at http://lcg.rit.albany.edu/dp-bind. SUPPLEMENTARY INFORMATION: http://lcg.rit.albany.edu/dp-bind/dpbind_supplement.html.  相似文献   

19.
为探讨小流域尺度丘陵区的高分辨率数字土壤制图方法,通过对景观相分类的探索,配合应用不同尺度的Geomorphons(GM)微地形特征数据构成分类变量组参与高分辨率土壤pH、黏粒含量和阳离子交换量的预测制图,并与传统数字高程模型衍生变量和遥感变量进行组合与比较分析。此外,采用支持向量机、偏最小二乘回归和随机森林3种机器学习模型择优与残差回归克里金复合参与预测模型的构建与评价。结果表明: 景观及多尺度微地形分类变量组的应用分别提高小流域尺度丘陵地貌区pH、黏粒含量和阳离子交换量预测精度的18.8%、8.2%和8.7%。包含植被信息的景观相分类图相比土地利用数据有更高的模型贡献度;5 m分辨率的GM微地形分类图相比低分辨率的分类图更适宜高精度的预测制图。黏粒含量使用随机森林复合模型有最高的预测精度,而pH和阳离子交换量则不适宜在随机森林模型的基础上加入残差回归克里金模型。景观-多尺度微地形分类变量、数字高程模型衍生变量和遥感变量三者结合的模型预测表现最佳,表明多元变量在起伏地形区域相比单一数据源能够包含更多的土壤有效信息。由GM数据和地表景观数据组成的景观分类变量组作为主要变量能够解释小流域丘陵区部分土壤属性约40%的空间变异。在同类型土壤预测制图研究中,多分辨率GM及景观分类数据有潜力作为环境变量参与预测模型的构建。  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号