scikit-learn中的StratifiedKFold与KFold
我使用此代码来测试KFold和StratifiedKFold.
import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold
X = np.array([
[1,2,3,4],
[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44],
[51,52,53,54],
[61,62,63,64],
[71,72,73,74]
])
y = np.array([0,0,0,0,1,1,1,1])
sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
floder = KFold(n_splits=4,random_state=0,shuffle=False)
for train, test in sfolder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print("StratifiedKFold done")
for train, test in floder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print("KFold done")
我发现StratifiedKFold可以保持标签的比例,但KFold不能。
Train: [1 2 3 5 6 7] | test: [0 4]
Train: [0 2 3 4 6 7] | test: [1 5]
Train: [0 1 3 4 5 7] | test: [2 6]
Train: [0 1 2 4 5 6] | test: [3 7]
StratifiedKFold done
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
KFold done
好像StratifiedKFold比较好,所以KFold不应该用?
什么时候使用KFold而不是StratifiedKFold?
回答
I think you should ask "When to use StratifiedKFold instead of KFold?".
You need to know what "KFold" and "Stratified" are first.
KFold is a cross-validator that divides the dataset into k folds.
Stratified is to ensure that each fold of dataset has the same
proportion of observations with a given label.
So, it means that StratifiedKFold is the improved version of KFold
Therefore, the answer to this question is we should prefer StratifiedKFold over KFold when dealing with classifications tasks with imbalanced class distributions.
FOR EXAMPLE
Suppose that there is a dataset with 16 data points and imbalanced class distribution. In the dataset, 12 of data points belong to class A and the rest (i.e. 4) belong to class B. The ratio of class B to class A is 1/3. If we use StratifiedKFold and set k = 4, then the training sets will include 3 data points from class A and 9 data points from class B and the test sets include 3 data points from class A and 1 data point from class B.
正如我们所看到的,数据集的类分布被StratifiedKFold保留在分割中,而KFold没有考虑到这一点。