The difference between decision tree ID3, CART, C4.5

历史历史: Cart proposed in 1984, ID3 proposed in 1986, c4.5

theoretically proposed in 1993 Total For that, C4.5 is an algorithm based on ID3 optimized output, which mainly optimizes the calculation method of node branches. After optimization, it solves the attribute that always prefers bias value in ID3 branching process. ID3 is the information gain branch:

and CART is generally the GINI coefficient branch:

C4.5 is generally the information gain rate branch:

工程上 In general: The main difference between CART and C4.5 lies in the classification results. CART can be classified by regression analysis, C4.5 can only be classified; C4.5 sub-nodes can be multi-pointed, and CART is countless two. Fork node; In this way, the CART-based “tree group” random forest is developed, and the difference between the sample data of the “tree group” GBDT

based on the regression tree is: ID3 can only process categorical variables, C4.5 and CART can handle both continuous and classified independent variables. ID3 is sensitive to missing values, while C4.5 and CART can handle missing values ​​in multiple ways. Considering only the sample size, the small sample suggests considering c4.5, and the large sample suggests considering cart. In the process of c4.5, the data set needs to be sorted multiple times, the processing cost is time-consuming, and the cart itself is a statistical method of large sample. The generalization error is larger under the small sample processing

target dependent variable Differences: ID3 and C4.5 can only be classified, and CART (category regression tree) can not only do classification (0/1) but also regression (0-1). Multi-forks (low, medium, and high) can be produced on ID3 and C4.5 nodes, while the CART nodes are always binary (low, non-low)

sample feature differences: In the use of feature variables, the multi-point categorical variables are only used once in the ID3 and C4.5 levels, and CART can reuse the optimization differences in the

decision tree generation process multiple times: C4.5 is corrected by the tree branches cut accuracy, and CART is the direct use of all the data found in the tree structure of all compare

Author: slade_sal Link: Source: Short book The copyright of the book is owned by the author. Any form of reprint should be contacted by the author for authorization and the source.