什么是交叉验证?它有哪些常见的方法?代码举例讲解

交叉验证是一种模型评估方法,用于评估机器学习模型的泛化能力。它的基本思想是:将数据集分为训练集和验证集,用训练集训练模型,用验证集评估模型性能。

常见的交叉验证方法有:

  • K折交叉验证:将数据集划分为K个相同大小的子集,每次取K-1个子集作为训练集,1个子集作为验证集,进行K次训练和验证,最后取平均性能。
  • 留一交叉验证:每次取N-1个样本作为训练集,剩下的1个样本作为验证集,进行N次训练和验证,最后取平均性能。
  • 重复随机子采样:随机选择一定比例的样本作为训练集,其余作为验证集,重复多次,最后取平均性能。
  • 时间序列交叉验证:对于时间序列数据,每次取前面时间点的数据作为训练集,最后时间点的数据作为验证集。

代码示例:

K折交叉验证:

python
from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
scores = []
for train_index, val_index in kf.split(x, y):
     x_train, x_val = x[train_index], x[val_index]  
     y_train, y_val = y[train_index], y[val_index]
     model.fit(x_train, y_train)
     scores.append(model.score(x_val, y_val)) 
mean_score = np.mean(scores) 
print(mean_score)

留一交叉验证:

python
scores = []
for i in range(len(x)):
     x_train, x_val = x[:i], x[i]  
     y_train, y_val = y[:i], y[i]  
     model.fit(x_train, y_train) 
     scores.append(model.score(x_val, y_val))
mean_score = np.mean(scores)  
print(mean_score) 

重复随机子采样:

python
from sklearn.model_selection import ShuffleSplit   

rs = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)  

scores = []
for train_index, val_index in rs.split(x, y): 
     x_train, x_val = x[train_index], x[val_index]  
     y_train, y_val = y[train_index], y[val_index]
     model.fit(x_train, y_train)
     scores.append(model.score(x_val, y_val))  
mean_score = np.mean(scores)
print(mean_score)