首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
For years, tree-structured analytic methods have appealed to researchers for many reasons: data mining, exploratory data analysis, and formation and testing of non parametric and parametric models among them. Classification and Regression Tree (CART) analysis has offered one of the more efficient and accurate of these methods since it was first presented in a 1984 monograph by Leo Breiman, Jerome Freidman, Richard Olshen, and Charles Stone (Breiman et al, 1984). Until recently, however, only a command-line interface has been available in powerful applications of the method, limiting its accessibility to users without the time and support to learn both the method and the command syntax. Now, Salford Systems, in collaboration with the authors of CART, has added a graphical user interface and a several new features to the original FORTRAN source code and produced a Windows version of CARTTM (v. 3.1) for the rest of us.  相似文献   

2.
陈磊  刘毅慧 《生物信息学》2011,9(3):229-234
基因芯片技术是基因组学中的重要研究工具。而基因芯片数据( 微阵列数据) 往往是高维的,使得降维成为微阵列数据分析中的一个必要步骤。本文对美国哈佛医学院 G. J. Gordon 等人提供的肺癌微阵列数据进行分析。通过 t- test,Wilcoxon 秩和检测分别提取微阵列数据特征属性,后根据 CART( Classification and Regression Tree) 算法,以 Gini 差异性指标作为误差函数,用提取的特征属性广延的构造分类树; 再进行剪枝找到最优规模的树,目的是提高树的泛化性能使得能很好适应新的预测数据。实验证明: 该方法对肺癌微阵列数据分类识别率达到 96% 以上,且很稳定; 并可以得到人们容易理解的分类规则和分类关键基因。  相似文献   

3.
MOTIVATION: Searches for near exact sequence matches are performed frequently in large-scale sequencing projects and in comparative genomics. The time and cost of performing these large-scale sequence-similarity searches is prohibitive using even the fastest of the extant algorithms. Faster algorithms are desired. RESULTS: We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near-exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called 'windows' using multiple offsets. Each window is mapped into a vector of dimension 4(k) which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4-6. Then we create a tree-structured index of the windows in vector space, with tree-structured vector quantization (TSVQ). We identify the nearest neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest-neighbor windows in the database. When the tree is balanced this yields an O(logn) complexity for the search. This complexity was observed in our computations. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequences or matching ESTs to genomic sequence. The algorithm is also an effective filtration method. Specifically, it can be used as a preprocessing step for other search methods to reduce the complexity of searching one large database against another. For the problem of identifying overlapping fragments in the assembly of 120 000 fragments from a 1.5 megabase genomic sequence, SST is 15 times faster than BLAST when we consider both building and searching the tree. For searching alone (i.e. after building the tree index), SST 27 times faster than BLAST. AVAILABILITY: Request from the authors.  相似文献   

4.
Logistic Multiple Regression, Principal Component Regression and Classification and Regression Tree Analysis (CART), commonly used in ecological modelling using GIS, are compared with a relatively new statistical technique, Multivariate Adaptive Regression Splines (MARS), to test their accuracy, reliability, implementation within GIS and ease of use. All were applied to the same two data sets, covering a wide range of conditions common in predictive modelling, namely geographical range, scale, nature of the predictors and sampling method. We ran two series of analyses to verify if model validation by an independent data set was required or cross‐validation on a learning data set sufficed. Results show that validation by independent data sets is needed. Model accuracy was evaluated using the area under Receiver Operating Characteristics curve (AUC). This measure was used because it summarizes performance across all possible thresholds, and is independent of balance between classes. MARS and Regression Tree Analysis achieved the best prediction success, although the CART model was difficult to use for cartographic purposes due to the high model complexity.  相似文献   

5.
A real-time plant species recognition under an unconstrained environment is a challenging and time-consuming process. The recognition model should cope up with the computer vision challenges such as scale variations, illumination changes, camera viewpoint or object orientation changes, cluttered backgrounds and structure of leaf (simple or compound). In this paper, a bilateral convolutional neural network (CNN) with machine learning classifiers are investigated in relation to the real-time implementation of plant species recognition. The CNN models considered are MobileNet, Xception and DenseNet-121. In the bilateral CNNs (Homogeneous/Heterogeneous type), the models are connected using the cascade early fusion strategy. The Bilateral CNN is used in the process of feature extraction. Then, the extracted features are classified using different machine learning classifiers such as Linear Discriminant Analysis (LDA), multinomial Logistic Regression (MLR), Naïve Bayes (NB), k-Nearest Neighbor (k−NN), Classification and Regression Tree (CART), Random Forest Classifier (RF), Bagging Classifier (BC), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). From the experimental investigation, it is observed that the multinomial Logistic Regression classifier performed better compared to other classifiers, irrespective of the bilateral CNN models (Homogeneous - MoMoNet, XXNet, DeDeNet; Heterogeneous - MoXNet, XDeNet, MoDeNet). It is also observed that the MoDeNet + MLR model attained the state-of-the-art results (Flavia: 98.71%, Folio: 96.38%, Swedish Leaf: 99.41%, custom created Leaf-12: 99.39%), irrespective of the dataset. The number of misprediction/class is highly reduced by utilizing the MoDeNet + MLR model for real-time plant species recognition.  相似文献   

6.
Su X  Fan J 《Biometrics》2004,60(1):93-99
A method of constructing trees for correlated failure times is put forward. It adopts the backfitting idea of classification and regression trees (CART) (Breiman et al., 1984, in Classification and Regression Trees). The tree method is developed based on the maximized likelihoods associated with the gamma frailty model and standard likelihood-related techniques are incorporated. The proposed method is assessed through simulations conducted under a variety of model configurations and illustrated using the chronic granulomatous disease (CGD) study data.  相似文献   

7.
Circulating tumour cells (CTC) in patients with metastatic carcinomas are associated with poor survival and can be used to guide therapy. Classification of CTC however remains subjective, as they are morphologically heterogeneous. We acquired digital images, using the CellSearch™ system, from blood of 185 castration resistant prostate cancer (CRPC) patients and 68 healthy subjects to define CTC by computer algorithms. Patient survival data was used as the training parameter for the computer to define CTC. The computer-generated CTC definition was validated on a separate CRPC dataset comprising 100 patients. The optimal definition of the computer defined CTC (aCTC) was stricter as compared to the manual CellSearch CTC (mCTC) definition and as a consequence aCTC were less frequent. The computer-generated CTC definition resulted in hazard ratios (HRs) of 2.8 for baseline and 3.9 for follow-up samples, which is comparable to the mCTC definition (baseline HR 2.9, follow-up HR 4.5). Validation resulted in HRs at baseline/follow-up of 3.9/5.4 for computer and 4.8/5.8 for manual definitions. In conclusion, we have defined and validated CTC by clinical outcome using a perfectly reproducing automated algorithm.  相似文献   

8.
Pearson''s correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait–environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen''s kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets) and the best RE in the same 13 data sets, with values ranging from 0.393 to 0.948 (statistically significant in 12 data sets). RR produced the best mean for both κ and RE in one data set (0.148 and 0.381, respectively). Regarding the wheat data sets, SVC-lin presented the best κ in 12 of the 16 data sets, with outcomes ranging from 0.280 to 0.580 (statistically significant in 4 data sets) and the best RE in 9 data sets ranging from 0.484 to 0.821 (statistically significant in 5 data sets). SVC-rbf (0.235), RR (0.265) and RHKS (0.422) gave the best κ in one data set each, while RHKS and BL tied for the last one (0.234). Finally, BL presented the best RE in two data sets (0.738 and 0.750), RFR (0.636) and SVC-rbf (0.617) in one and RHKS in the remaining three (0.502, 0.458 and 0.586). The difference between the performance of SVC-lin and that of the rest of the models was not so pronounced at higher percentiles of the distribution. The behaviour of regression and classification algorithms varied markedly when selection was done at different thresholds, that is, κ and RE for each algorithm depended strongly on the selection percentile. Based on the results, we propose classification method as a promising alternative for GS in plant breeding.  相似文献   

9.
10.
防止外来生物入侵造成危害的重要手段是阻止可能造成入侵的物种进入适合其生存的地区.论文以1864个美国外来入侵物种斑马纹贻贝定点发生数据和开放式基础地理信息数据库Daymet的34个环境变量为主要信息源,采用逻辑斯蒂回归(LR)、分类与回归树模型(CART)、基于规则的遗传算法(GARP)、最大熵法(Maxent)4种途径,建立美国大陆部分潜在生境预测模型,从接受者运行特征曲线下面积(AUC)、Pearson相关系数、Kappa值3个方面来检验模型预测精度,在此基础上分析斑马纹贻贝的空间分布规律及其环境影响因素.研究结果表明:在3个评价指标中,4个生态位模型预测精度均达到优良水平,其中Maxent在物种现实生境模拟、主要生态环境因子筛选、环境因子对物种生境影响的定量描述方面都表现出了优越的性能;距水源距离、海拔高度、降水频率、太阳辐射是影响物种空间分布的主要环境因子.论文提出的研究方法对中国外来入侵物种生境预测具有较强的借鉴意义,研究结果对中国海洋外来入侵物种沙筛贝的预测与防治,具有一定的指导作用.  相似文献   

11.
MOTIVATION: mRNA expression data obtained from high-throughput DNA microarrays exhibit strong departures from homogeneity of variances. Often a complex relationship between mean expression value and variance is seen. Variance stabilization of such data is crucial for many types of statistical analyses, while regularization of variances (pooling of information) can greatly improve overall accuracy of test statistics. RESULTS: A Classification and Regression Tree (CART) procedure is introduced for variance stabilization as well as regularization. The CART procedure adaptively clusters genes by variances. Using both local and cluster wide information leads to improved estimation of population variances which improves test statistics. Whereas making use of cluster wide information allows for variance stabilization of data. AVAILABILITY: Sufficient details for our CART procedure are given so that the interested reader can program the method for themselves. The algorithm is also accessible within the Java software package BAMarray(TM), which is freely available to non-commercial users at www.bamarray.com. CONTACT: hemant.ishwaran@gmail.com.  相似文献   

12.
Relative Survival is the ratio of the overall survival of a group of patients to the expected survival for a demographically similar group. It is commonly used in disease registries to estimate the effect of a particular disease when the true cause of death is not reliably known. Regression models for relative survival have been described and we extend these models to allow for clustered responses by embedding them into the class of Generalized linear mixed models (GLMM). The method is motivated and demonstrated by a data set from the HALLUCA study, an epidemiological study which investigated provision of medical care to lung cancer patients in the region of Halle in the eastern part of Germany.  相似文献   

13.
Forest management practices directly influence microhabitat characteristics important to the survival of fungi. Because fungal populations perform key ecological processes, there is interest in forestry practices that minimize deleterious effects on their habitats. We investigated the effects on fungal sporocarp diversity of modified uneven-aged forest management practices in northern hardwood ecosystems, including a technique called Structural Complexity Enhancement (SCE). SCE is designed to accelerate late-successional stand development; it was compared against two conventional selection systems (single tree and group) and unmanipulated controls. These were applied in a randomized block design to a mature, multi-aged forest in Vermont, USA. Eight years after treatment, fungal species richness was significantly greater in SCE plots compared to conventional selection harvests and controls (p < 0.001). Seven forest structure variables were tested for their influence on fungal species richness using a Classification and Regression Tree. The results suggested that dead tree and downed log recruitment, as well as maintenance of high levels of aboveground biomass, under SCE had a particularly strong effect on fungal diversity. Our findings show it is possible to increase fungal diversity using forestry practices that enhance stand structural complexity and late-successional forest characteristics.  相似文献   

14.
BackgroundBrazil has consolidated a relevant position in the world market, being the largest exporter and second producer of beef. Genetics, feeding system, geographic origin and climate influence the multielement profile of beef. The feasibility of combining classification algorithms with major and trace elements was evaluated as a tool for authentication of beef cuts.MethodsAnimals of Angus, Nelore and Wagyu crossbreeds, raised in a vertically integrated system, were sampled at the slaughterhouse for chuck steak, rump cap and sirloin steak. Supervised learning algorithms i.e. Classification and Regression Tree (CART), Multilayer Perceptron (MLP), Naïve Bayes (NB), Random Forest (RF) and Sequential Minimal Optimization (SMO) were used to build classification models based on the multielement profile of beef determined by neutron activation analysis.ResultsBr, Co, Cs, Fe, K, Na, Rb, Se and Zn were determined in the beef samples. The classification accuracy values obtained for the beef cuts were 96% (MLP), 95% (SMO), 91% (RF), 86% (NB) and 70% (CART).ConclusionThe Multilayer Perceptron algorithm provided the best classification performance towards authentication of beef cuts on basis of major and trace element mass fractions.  相似文献   

15.
The reverse engineering of gene regulatory networks using gene expression profile data has become crucial to gain novel biological knowledge. Large amounts of data that need to be analyzed are currently being produced due to advances in microarray technologies. Using current reverse engineering algorithms to analyze large data sets can be very computational-intensive. These emerging computational requirements can be met using parallel computing techniques. It has been shown that the Network Identification by multiple Regression (NIR) algorithm performs better than the other ready-to-use reverse engineering software. However it cannot be used with large networks with thousands of nodes - as is the case in biological networks - due to the high time and space complexity. In this work we overcome this limitation by designing and developing a parallel version of the NIR algorithm. The new implementation of the algorithm reaches a very good accuracy even for large gene networks, improving our understanding of the gene regulatory networks that is crucial for a wide range of biomedical applications.  相似文献   

16.
Dust source susceptibility modeling and mapping is the first step in managing the impacts of dust on environmental systems and human health. In this study, satellite products and terrestrial data were used to detect dust sources in central Iran using remote sensing and machine learning techniques. After recording 890 sites as dust sources based on field surveys and determining 14 independent variables affecting wind erosion and dust sources, dust source distribution maps were prepared through GLM (Generalized Linear Model), CTA (Classification Tree Analysis), ANN (Artificial Neural Network), MARS (Multivariate Adaptive Regression Spline), RF (Random Forest), Maxent (Maximum Entropy), and ensemble algorithms. Specifically, 70% of dust source sites were used as training data and 30% were used for algorithm performance evaluation through different statistical methods such as partial ROC (Receiver Operator Characteristic), sensitivity, specificity, and TSS (True Skill Statistics). According to the results, following the ensemble model, RF had the highest and GLM had the lowest performance in dust source detection. According to the ensemble model, precipitation with a mean weight of 0.3 followed by evaporation, temperature, and soil moisture with mean weight of 0.173, 0.16, and 0.153, respectively, were the main driving forces in dust susceptibility mapping. This model classified 40.92% of the study area with low potential, 15.37% with medium potential, 25.77% with high potential, and 17.94% with very high potential. The research findings indicate that the integration of remote sensing and prediction algorithms can be used as a useful means for predicting the spatial distribution of dust sources in arid and semi-arid regions.  相似文献   

17.
Question: How does one best choose native vegetation types and site them in reclamation of disturbed sites ranging from cropland and strip mines? Application: World‐wide, demonstrated in SE Montana. Methods: We assumed that pre‐disturbance native communities are the best targets for revegetation, and that the environmental facet each occupies naturally provides its optimal habitat. Given this assumption, we used pre‐strip‐mine data (800 points from a 88 km2 site) to demonstrate statistical methods for identifying native communities, describing them, and determining their environments. Results and conclusions: Classification and pruning analysis provided an objective method for choosing the number of target community types to be used in reclamation. The composition of eight target types, identified with these analyses, was described with a relevé table to provide a species list, target cover levels and support the choice of species to be seeded. As a basis for siting communities, we modeled community presence as a function of topography, slope/aspect, and substrate. Logistic GLMs identified the optimal environment for each community. Classification and Regression Tree (CART) analysis identified the most probable community in each environmental facet. Topography and slope were generally the best predictors in these models. Because our analyses relate native vegetation to undisturbed environments, our results may apply best to sites with minimal substrate disturbance (i.e. better to abandoned cropland than to strip‐mined sites).  相似文献   

18.
In this paper, the viability of using Fuzzy-Rule-Based Regression Modeling (FRM) algorithm for tool performance and degradation detection is investigated. The FRM is developed based on a multi-layered fuzzy-rule-based hybrid system with Multiple Regression Models (MRM) embedded into a fuzzy logic inference engine that employs Self Organizing Maps (SOM) for clustering. The FRM converts a complex nonlinear problem to a simplified linear format in order to further increase the accuracy in prediction and rate of convergence. The efficacy of the proposed FRM is tested through a case study - namely to predict the remaining useful life of a ball nose milling cutter during a dry machining process of hardened tool steel with a hardness of 52-54 HRc. A comparative study is further made between four predictive models using the same set of experimental data. It is shown that the FRM is superior as compared with conventional MRM, Back Propagation Neural Networks (BPNN) and Radial Basis Function Networks (RBFN) in terms of prediction accuracy and learning speed.  相似文献   

19.
Although plant invasions are often associated with disturbance, localized disturbances can promote invasion either by: (i) creating sites where individuals establish; or (ii) enabling an invader to colonize the entire stand. The former is expected when both establishment and survival to reproductive age require disturbed conditions, whereas the latter should occur in systems when either establishment or survival are limited to disturbed sites. We investigated the role of localized disturbance, specifically treefalls, in the invasion of the Asian Rubus phoenicolasius in a deciduous forest in Maryland, USA. We investigated the density and demography of R. phoenicolasius in treefall gaps of various sizes, but identical age to non‐gap areas, using Classification and Regression Tree (CART) analyses to identify the most important predictors. To explore how the demography of established individuals responds to disturbed versus undisturbed conditions, we carried out a garden experiment with three different levels of shade (5, 12 and 22% full sun). We found vegetative and sexual reproduction, and seedling establishment, to be limited to large gaps in an old stand, but not in a stand in an earlier age of succession. However, in the garden experiment, established plants were able to survive and grow under all shade treatments. These findings indicate that R. phoenicolasius requires disturbances such as treefalls to establish in forests, but established plants will survive canopy closure, leading to stand‐wide invasion. Managers should be able to prevent invasion, however, by inspecting large gaps for new recruits every 3 years.  相似文献   

20.
Summary Accurately assessing a patient’s risk of a given event is essential in making informed treatment decisions. One approach is to stratify patients into two or more distinct risk groups with respect to a specific outcome using both clinical and demographic variables. Outcomes may be categorical or continuous in nature; important examples in cancer studies might include level of toxicity or time to recurrence. Recursive partitioning methods are ideal for building such risk groups. Two such methods are Classification and Regression Trees (CART) and a more recent competitor known as the partitioning Deletion/Substitution/Addition (partDSA) algorithm, both of which also utilize loss functions (e.g., squared error for a continuous outcome) as the basis for building, selecting, and assessing predictors but differ in the manner by which regression trees are constructed. Recently, we have shown that partDSA often outperforms CART in so‐called “full data” settings (e.g., uncensored outcomes). However, when confronted with censored outcome data, the loss functions used by both procedures must be modified. There have been several attempts to adapt CART for right‐censored data. This article describes two such extensions for partDSA that make use of observed data loss functions constructed using inverse probability of censoring weights. Such loss functions are consistent estimates of their uncensored counterparts provided that the corresponding censoring model is correctly specified. The relative performance of these new methods is evaluated via simulation studies and illustrated through an analysis of clinical trial data on brain cancer patients. The implementation of partDSA for uncensored and right‐censored outcomes is publicly available in the R package, partDSA .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号