有人可以解释为什么这段代码:
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import numpy as np
#df = pd.read_csv('missing_data.csv',sep=',')
df = pd.DataFrame(np.array([[1, 2, 3,4,5,6,7,8,9,1],
[4, 5, 6,3,4,5,7,5,4,1],
[7, 8, 9,6,2,3,6,5,4,1],
[7, 8, 9,6,1,3,2,2,4,0],
[7, 8, 9,6,5,6,6,5,4,0]]),
columns=['a', 'b', 'c','d','e','f','g','h','i','j'])
X_train = df.iloc[:,:-1]
y_train = df.iloc[:,-1]
clf=SVC(kernel='linear')
kfold = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
for train_index,test_index in kfold.split(X_train,y_train):
x_train_fold,x_test_fold = X_train[train_index],X_train[test_index]
y_train_fold,y_test_fold = y_train[train_index],y_train[test_index]
clf.fit(x_train_fold,y_train_fold)
引发此错误:
Traceback (most recent call last):
File "test_traintest.py", line 23, in <module>
x_train_fold,x_test_fold = X_train[train_index],X_train[test_index]
File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/frame.py", line 3030, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/Users/slowat/anaconda/envs/nlp_course/lib/python3.7/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([2, 3], dtype='int64')] are in the [columns]"
我看到了 https://stackoverflow.com/questions/55552921/none-of-index-dtype-object-are-in-the-columns 答案,但我的列的长度是相等的。
回答1
KFold.split()
返回训练和测试索引,它们应该与这样的 DataFrame 一起使用:
X_train.iloc[train_index]
使用您的语法,您试图将它们用作列名。将您的代码更改为:
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
import numpy as np
#df = pd.read_csv('missing_data.csv',sep=',')
df = pd.DataFrame(np.array([[1, 2, 3,4,5,6,7,8,9,1],
[4, 5, 6,3,4,5,7,5,4,1],
[7, 8, 9,6,2,3,6,5,4,1],
[7, 8, 9,6,1,3,2,2,4,0],
[7, 8, 9,6,5,6,6,5,4,0]]),
columns=['a', 'b', 'c','d','e','f','g','h','i','j'])
X_train = df.iloc[:,:-1]
y_train = df.iloc[:,-1]
clf=SVC(kernel='linear')
kfold = StratifiedKFold(n_splits=2,random_state=42,shuffle=True)
for train_index,test_index in kfold.split(X_train,y_train):
x_train_fold,x_test_fold = X_train.iloc[train_index],X_train.iloc[test_index]
y_train_fold,y_test_fold = y_train.iloc[train_index],y_train.iloc[test_index]
clf.fit(x_train_fold,y_train_fold)
请注意,我们使用 .iloc
而不是 .loc
。这是因为 .iloc
使用整数索引作为我们从 split()
获得的索引,而 .loc
使用索引 values。在你的情况下没关系,因为 pandas 索引匹配整数索引,但在其他项目中你会遇到它可能不是这种情况,所以坚持使用 .iloc
。
或者,当您提取 X_train
和 y_train
时,您可以将它们转换为 numpy 数组:
X_train = df.iloc[:,:-1].to_numpy()
y_train = df.iloc[:,-1].to_numpy()
然后您的代码将正常工作,因为 numpy 数组适用于整数索引。