[srm] 決策樹

單一決策樹
$$分割M-1次,得到M個節點R_m,反應變數可為有K個水準的分類變數或連續變數$$
節點不純度(Impurity)
Classification error rate
$$\begin{aligned} E_m &= \frac{\text{# misclassified}}{\text{# samples in } R_m} \\ &= 1 - \max_k(\hat{p_{mk}}) \\ \text{(因為最大機率對應的 } &k \text{ 就是 } \hat{Y}_m\text{,所以誤分類率是 } 1 - \text{分到最大的)} \end{aligned}$$
Gini index
$$\begin{aligned} G_m &= \sum_{k=1}^K \hat{p_{mk}} (1 - \hat{p_{mk}}) \\ &= 1 - \sum_{k=1}^K \hat{p_{mk}}^2 \\ &\quad \text{(若 K = 2,則 } G_m = 2\hat{p_{m1}} (1 - \hat{p_{m1}})\text{)} \end{aligned}$$
Cross-entropy
$$D_m = -\sum_{k=1}^K \hat{p_{mk}} \operatorname{ln}(\hat{p_{mk}})$$
整體不純度
$$\begin{aligned} I_{\text{overall}} &= \sum_{m=1}^M \frac{n_m}{N} \cdot I(R_m) \\ E_{\text{overall}} &= \frac{\text{# misclassified}}{N} \\ D_{\text{overall}} &= -\frac{1}{N}\sum_{m=1}^M\sum_{k=1}^K n_{mk}\operatorname{ln} (\hat{p_{mk}}) \\ Deviance&=-2\sum^M_{m=1}\sum_{k=1}^K n_{mk}\operatorname{ln} (\hat{p_{mk}})=2ND_{overall} \\ \text{Residual mean deviance}&=\frac{Deviance}{N-M} \end{aligned}$$
重要度(Importance)
使用某變數切割的節點的不純度,在切割後減少的量愈多則愈重要。
(即某節點的不純度*樣本數加權-分割後的不純度*樣本數加權)
$$\operatorname{Importance}(j)=\sum_{m\in{j}}[n_m\cdot I(R_m)-(n_{mLeft}\cdot I(R_{mLeft})+n_{mRight}\cdot I(R_{mRight}))]$$
混和決策樹
重要度(Importance)
定義與單一樹差不多,但每棵樹的該變數重要度取平均
Bagging
使用bootstrap進行取出放回抽樣B次(每次size一樣是N),然後以下式估計
$$\hat{f_{bag}}(x)=\frac{1}{B}\sum^B_{b=1} \hat{f_{bag}^{*b}}(x)$$
Random Forest
跟Bagging差不多,但是每次分割的時候只從p個解釋變數裡面選m個當候選變數,m的選擇建議為
$$m\approx\sqrt{p}$$
Boosting
先假設估計值都是0,然後殘差是y,給定學習率λ,重複訓練B次分割d次的小樹計算殘差,並以λ更新估計值與殘差。最後以下式估計
$$\hat{f_{boosting}}(x)=\lambda\sum^B_{b=1} \hat{f_{boosting}^{*b}(x)}$$
若d=1,因為每次都只切一刀,解釋變數間沒有交互作用,所以每個樹可以直接根據解釋變數加起來
$$\hat{f_{boosting}}(x)=\sum^p_{j=1}f_{j}(x)$$
Subscribe to my newsletter
Read articles from etori directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

etori
etori
test bio