我正在尝试在 sklearn 中绘制 roc_curve,但我必须在代码中使用 roc_auc_score 和 predict_proba。
我一直在使用 roc_auc_score 和 roc_curve 时遇到错误。
抱歉在一些变量名(我把它们翻译成英文,这样你更容易阅读)
这是我的代码:
import matplotlib.pyplot as pp
import numpy as np
# rest is already imported in different jupyter cell
models = [
GaussianNB(),
MLPClassifier(),
SVC(probability=True),
KNeighborsClassifier(),
SGDClassifier(loss='modified_huber') #default parameter was blocking use of predict_proba()
]
y_true = []
predict_tab = [[], [], [], [], []]
for id_train, id_test in StratifiedKFold(n_splits=5).split(dane.data, dane.target):
x_train = data.data[id_train]
y_train = data.target[id_train]
x_test = data.data[id_test]
y_test = data.target[id_test]
y_true.append(y_test)
for i, model in enumerate(models):
model.fit(x_train, y_train)
predict_tab[i].append(model.predict_proba(x_test))
for i in range(len(models)):
auc = roc_auc_score(y_true, predict_tab[i])
pp.figure(figsize=(5,5))
pp.plot([0,1],[0,1],color='black',lw=2,linestyle='--')
pp.xlabel('1 - specificity')
pp.ylabel('sensitivity')
pp.title(type(models[i]).__name__ + auc)
fpr, tpr = roc_curve(y_true, predict_tab[i])
pp.plot(fpr, tpr)
pp.show()
回溯我得到:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_5344\3795564845.py in <module>
26
27 for i in range(len(modele)):
---> 28 auc = roc_auc_score(y_true, predict_tab[i])
29 pp.figure(figsize=(5,5))
30 pp.plot([0,1],[0,1],color='black',lw=2,linestyle='--')
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\metrics\_ranking.py in roc_auc_score(y_true, y_score, average, sample_weight, max_fpr, multi_class, labels)
543
544 y_type = type_of_target(y_true)
--> 545 y_true = check_array(y_true, ensure_2d=False, dtype=None)
546 y_score = check_array(y_score, ensure_2d=False)
547
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
798
799 if force_all_finite:
--> 800 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
801
802 if ensure_min_samples > 0:
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
119 # for object dtype data, we only check for NaNs (GH-13254)
120 elif X.dtype == np.dtype("object") and not allow_nan:
--> 121 if _object_dtype_isnan(X).any():
122 raise ValueError("Input contains NaN")
123
AttributeError: 'bool' object has no attribute 'any'
似乎 roc_auc_score 不想将列表作为参数,但它不应该
roc_auc_score 的文档:
y_true:类似于数组的形状 (n_samples,) 或 (n_samples, n_classes) 真实标签或二进制标签指示符。二元和多类情况需要形状为 (n_samples,) 的标签,而多标签情况需要形状为 (n_samples, n_classes) 的二元标签指示符。
y_score : 类似数组的形状 (n_samples,) 或 (n_samples, n_classes) 目标分数。
如果我用 roc_auc_score 注释掉行,我会得到:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_5344\1346168288.py in <module>
32 pp.ylabel('sensitivity')
33 # pp.title(type(modele[i]).__name__ + auc)
---> 34 fpr, tpr = roc_curve(y_true, predict_tab[i])
35 pp.plot(fpr, tpr)
36 pp.show()
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\metrics\_ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
961 """
962 fps, tps, thresholds = _binary_clf_curve(
--> 963 y_true, y_score, pos_label=pos_label, sample_weight=sample_weight
964 )
965
~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\metrics\_ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
729 y_type = type_of_target(y_true)
730 if not (y_type == "binary" or (y_type == "multiclass" and pos_label is not None)):
--> 731 raise ValueError("{0} format is not supported".format(y_type))
732
733 check_consistent_length(y_true, y_score, sample_weight)
ValueError: unknown format is not supported
roc_curve 也不接受数组,尽管它的文档说这些字段与 roc_auc_score 相同
predict_tab[0] 的示例内容:
array([[1.00000000e+000, 2.88056881e-149],
[1.00000000e+000, 1.09890141e-053],
[1.00000000e+000, 2.38660986e-068],
[1.00000000e+000, 1.12075737e-038],
[1.00000000e+000, 1.09066234e-060],
[9.99970436e-001, 2.95643658e-005],
[1.00000000e+000, 6.86185710e-032],
[9.99997179e-001, 2.82111613e-006],
...
[7.55856347e-019, 1.00000000e+000],
[7.39915887e-014, 1.00000000e+000],
[1.42321068e-004, 9.99857679e-001],
[2.43861697e-006, 9.99997561e-001],
[3.18293067e-014, 1.00000000e+000],
[7.43778583e-012, 1.00000000e+000],
[9.10540238e-004, 9.99089460e-001],
[1.00000000e+000, 1.18037261e-028],
[2.62948282e-018, 1.00000000e+000]])
y_true 的示例内容:
[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 1])]
谢谢您的帮助!
回答1
您的 predict_tab
stores 五个列表,每个样本的两个概率。 roc_auc_score
接受正类的概率,所以你应该通过
# Your code
auc = roc_auc_score(y_true, predict_tab[i][0][:, 1])
# Your code
顺便说一句,这很不方便。我宁愿创建一个空列表 predict_tab
并为每个模型附加测试样本的决策分数:
# Your code
predict_tab = []
# Your code
predict_tab.append(model.decision_function(x_test))
# Your code