1.重要参数criterion
mse | |
friedman_mse | ? |
mae | ? |
mse:
在回归树中,mse不仅是分支质量衡量标准,也是我们常用的衡量回归树回归质量的衡量指标,但是回归树接口返回的时R平方,并不是mse。
2.(扩展)交叉验证--多次划分训练集,测试集多次计算模型的稳定性
from sklearn.datasets import load_boston from sklearn.model_selection import cross_val_score from sklearn.tree import DecisionTreeRegressor boston = load_boston() regressor = DecisionTreeRegressor(random_state=0) cross_val_score(regressor ,boston.data ,boston.target ,cv=10 ,scoring='neg_mean_squared_error'
3.一维回归曲线绘制
1.导库
import numpy as np from sklearn.tree import DecisionTreeRegressor import matplotlib.pylot as plt
2.创建一条含有噪音的正弦曲线
rng=np.random.RandomState(1) #生成随机数种子 rng.rand(80,1) #生成8行10列个0,1之间随机数,特征必须是二维的 X = np.sort(5*rng.rand(80,1),axis=0) y = np.sin(X).ravel() #标签必须是一维 y[::5] += 3*(0.5 - rng.rand(16)) #转化为-0.5,0.5之间
3.实例化,训练模型
regr_1 = DecisionTreeRegressor(max_depth=2) regr_2 = DecisionTreeRegressor(max_depth=5) regr_1.fit(X, y) regr_2.fit(X, y)
4.生成测试集,导入模型
x_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] #生成(起始点,终止点,步长);升维度 y_1 = regr_1.predict(X_test) y_2 = regr_2.predict(X_test)
5.绘图
plt.figure() plt.scatter(X, y, s=20, edgecolor="black",c="darkorange", label="data") plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=2", linewidth=2) plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2) plt.xlabel("data") plt.ylabel("target") plt.title("Decision Tree Regression") plt.legend() plt.show()
4.(扩展)用网格搜索调整参数
import numpy as np from sklearn.model_selection import GridSearchCV gini_thresholds = np.linspace(0,0.5,20) parameters = { "splitter":("best","random") ,"criterion":("gini","entropy") ,"max_depth":[*range(1,10)] ,"min_samples_leaf":[*range(1,50,5)] ,"min_impurity_decrease":[*np.linspace(0,0.5,20)] } clf = DecisionTreeRegressor(random_state=20) GS = GridSearchCV(clf, parameters, cv=10) GS.fit(Xtrain,Ytrain) GS.best_params_ GS.best_score_
5.总结:决策树的优缺点
优点 | 1.可回归可分类 2.白盒模型容易观察给定情况 3.可以处理多标签问题 |
缺点 | 1.过拟合产生复杂树,调参困难 2.靠局部优化试图达到整体最优,但无法确定,而且不稳定,容易微小变化改变树 |