首页 | 官方网站   微博 | 高级检索  
     

随机森林算法基本思想及其在生态学中的应用——以云南松分布模拟为例
引用本文:张雷,王琳琳,张旭东,刘世荣,孙鹏森,王同立.随机森林算法基本思想及其在生态学中的应用——以云南松分布模拟为例[J].生态学报,2014,34(3):650-659.
作者姓名:张雷  王琳琳  张旭东  刘世荣  孙鹏森  王同立
作者单位:中国林业科学研究院林业研究所,国家林业局林木培育重点实验室,北京 100091;北京林业大学林学院,北京 100083;中国林业科学研究院林业研究所,国家林业局林木培育重点实验室,北京 100091;中国林业科学研究院森林生态环境与保护研究所, 国家林业局森林生态环境重点实验室, 北京 100091;中国林业科学研究院森林生态环境与保护研究所, 国家林业局森林生态环境重点实验室, 北京 100091;Department of Forest Sciences, University of British Columbia, 3041-2424 Main Mall, Vancouver B.C. Canada V6T 1Z4
基金项目:国家自然科学基金资助项目(41301056,31290223);中央公益性院所基本科研业务专项资助项目(RIF2012-04);林业公益性行业科研专项资助项目(201104006,200804001);国家“十二五”科技支撑项目课题资助项目(2011BAD38B04)
摘    要:通常来讲,生态学者对于解释生态关系、描述格局和过程、进行空间或时间预测比较感兴趣。这些工作可以通过模拟输出值(响应)与一些特征值(即解释变量)的关系来实现。然而,生态数据模拟遇到了挑战,这是因为响应变量和预测变量可能是连续变量或离散变量。需要解释的生态关系通常是非线性的,并且解释变量之间具有复杂的相互作用关系。响应变量和解释变量存在缺失值并不是不常有的现象,奇异值也经常出现在生态数据中。此外,生态学者通常希望生态模型即要易于建立又易要于解释。通常是利用多种统计方法来分析处理各种各样情景中出现的独特的生态问题,这些模型包括(多元)逻辑回归、线性模型、生存模型、方差分析等等。随机森林是一个可以处理所有这些问题的有效方法。随机森林可以用来做分类、聚类、回归和生存分析、评估变量的重要性、检测数据中的奇异值、对缺失数据进行插补等。鉴于随机森林本身在算法上的优势,将就随机森林在生态学中的应用进行总结,对建模过程进行概述,并以云南松分布模拟研究为例,对其主要功能特点进行案例展示。通过对随机森林的一般术语、概念和建模思想进行介绍,有利于读者掌握本方法的应用本质,可以预见随机森林在生态学研究中将得到更多的应用和发展。

关 键 词:随机森林  分类回归树  变量重要性  多维数据  物种分布模拟
收稿时间:6/3/2013 12:00:00 AM
修稿时间:2013/9/22 0:00:00

The basic principle of random forest and its applications in ecology: a case study of Pinus yunnanensis
ZHANG Lei,ZHANG Xudong,LIU Shirong,SUN Pengsen,WANG Tongli.The basic principle of random forest and its applications in ecology: a case study of Pinus yunnanensis[J].Acta Ecologica Sinica,2014,34(3):650-659.
Authors:ZHANG Lei  ZHANG Xudong  LIU Shirong  SUN Pengsen  WANG Tongli
Affiliation:Key Laboratory of Forest Silviculture of the State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China;College of Forestry, Beijing Forestry University, Beijing 100083, China;Key Laboratory of Forest Silviculture of the State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China;Key Laboratory of Forest Ecology and Environment of State Forestry Administration, Institute of Forest Ecology, Environment and Protection, Chinese Academy of Forestry, Beijing 100091, China;Key Laboratory of Forest Ecology and Environment of State Forestry Administration, Institute of Forest Ecology, Environment and Protection, Chinese Academy of Forestry, Beijing 100091, China;Department of Forest Sciences, University of British Columbia, 3041-2424 Main Mall, Vancouver B.C. Canada V6T 1Z4
Abstract:Ecological data are often complex. The explanatory and the response variables may be categorical variables or numerical variables. The ecological relationships that need to be defined are often nonlinear and involve high-order interactions between explanatory variables. Missing values for both response and predictor variables are very common, and outliers almost always exist. Random forest (RF), a novel machine learning technique, is ideally suited for the analysis of complex ecological data. RF predictors are a ensemble-learning approach based on regression or classification trees. Instead of building one classification tree (classifier), the RF algorithm builds multiple classifiers using randomly selected subsets of the observations and random subsets of the predictor variables. The predictions from the ensemble of trees are then averaged in the case of regression trees, or tallied using a voting system for classification trees. RF is efficient to support flexible modelling strategies. RF is capable of detecting and making use of more complex relationships among the variables. RF is unexcelled in accuracy among current algorithms and does not overfit. It also generates an internal unbiased estimate of the generalization error as the forest building progresses. Potential applications of RF to ecology include: classification and regression analysis, survival analysis, variable importance estimate and data proximities. Proximities can be used for clustering, detecting outliers, multi-dimensional scaling, and unsupervised classification. RF can interpolate missing value and maintain high accuracy even when a large proportion of the data are missing. RF can handle thousands of input variables without variable exclusion. It runs efficiently on large data bases. RF can also handle a spectrum of response types, including categorical, numeric, ratings, and survival data. Another advantage of the RF is that it requires only two user-defined parameters (The number of trees and the number of randomly selected predictive variables used to split the nodes) to be defined. These two parameters should be optimized in order to improve predictive accuracy. In recent years, RF has been widely used by ecologists to model complex ecological relationships because they are easy to implement and easy to interpret. To understand and use the RF, further information about how they are computed is useful. Here, we summarized the basic principle of RF and showed how RF handle complex data by modelling the geographical distribution of Yunan Pine (Pinus yunnanensis) in China. RF is a robust and widely used technique in the field of species distribution modelling (SDM), since it meets the basic needs of SDM: simulating species distribution and identifying the main drivers of species distribution. In this work, RF showed a high predictive performance in simulating the distribution of Yunan Pine, which was consistent with the multi-dimensional scaling plot that showed it was possible to separate the presences from the absences. We also estimated the relative importance of predictor variables and produced the partial dependence plots for selected predictor variables for random forest predictions of the presences of Yunan Pine. The main aim of the article is to familiarize the reader with the general concepts, terminology and basic principle behind RF. We believe RF will get more applications and development in ecology.
Keywords:random forest  classification and regression tree  variable importance  multi-dimensional scaling  specie distribution modelling
本文献已被 CNKI 等数据库收录!
点击此处可从《生态学报》浏览原始摘要信息
点击此处可从《生态学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号