python - sklearn roc_auc_score AttributeError

我正在尝试在 sklearn 中绘制 roc_curve,但我必须在代码中使用 roc_auc_score 和 predict_proba。

我一直在使用 roc_auc_score 和 roc_curve 时遇到错误。

抱歉在一些变量名(我把它们翻译成英文,这样你更容易阅读)

这是我的代码:

import matplotlib.pyplot as pp
import numpy as np
# rest is already imported in different jupyter cell

models = [
    GaussianNB(),
    MLPClassifier(),
    SVC(probability=True),
    KNeighborsClassifier(),
    SGDClassifier(loss='modified_huber')        #default parameter was blocking use of predict_proba()
]

y_true = []
predict_tab = [[], [], [], [], []]

for id_train, id_test in StratifiedKFold(n_splits=5).split(dane.data, dane.target):
    x_train = data.data[id_train]
    y_train = data.target[id_train]
    x_test = data.data[id_test]
    y_test = data.target[id_test]
    y_true.append(y_test)

    for i, model in enumerate(models):
        model.fit(x_train, y_train)
        predict_tab[i].append(model.predict_proba(x_test))


for i in range(len(models)):
    auc = roc_auc_score(y_true, predict_tab[i])
    pp.figure(figsize=(5,5))
    pp.plot([0,1],[0,1],color='black',lw=2,linestyle='--')
    pp.xlabel('1 - specificity')
    pp.ylabel('sensitivity')
    pp.title(type(models[i]).__name__ + auc)
    fpr, tpr = roc_curve(y_true, predict_tab[i])
    pp.plot(fpr, tpr)
    pp.show()

回溯我得到:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_5344\3795564845.py in <module>
     26 
     27 for i in range(len(modele)):
---> 28     auc = roc_auc_score(y_true, predict_tab[i])
     29     pp.figure(figsize=(5,5))
     30     pp.plot([0,1],[0,1],color='black',lw=2,linestyle='--')

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\metrics\_ranking.py in roc_auc_score(y_true, y_score, average, sample_weight, max_fpr, multi_class, labels)
    543 
    544     y_type = type_of_target(y_true)
--> 545     y_true = check_array(y_true, ensure_2d=False, dtype=None)
    546     y_score = check_array(y_score, ensure_2d=False)
    547 

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    798 
    799         if force_all_finite:
--> 800             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    801 
    802     if ensure_min_samples > 0:

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    119     # for object dtype data, we only check for NaNs (GH-13254)
    120     elif X.dtype == np.dtype("object") and not allow_nan:
--> 121         if _object_dtype_isnan(X).any():
    122             raise ValueError("Input contains NaN")
    123 

AttributeError: 'bool' object has no attribute 'any'

似乎 roc_auc_score 不想将列表作为参数,但它不应该

roc_auc_score 的文档:

y_true:类似于数组的形状 (n_samples,) 或 (n_samples, n_classes) 真实标签或二进制标签指示符。二元和多类情况需要形状为 (n_samples,) 的标签,而多标签情况需要形状为 (n_samples, n_classes) 的二元标签指示符。

y_score : 类似数组的形状 (n_samples,) 或 (n_samples, n_classes) 目标分数。

如果我用 roc_auc_score 注释掉行,我会得到:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_5344\1346168288.py in <module>
     32     pp.ylabel('sensitivity')
     33     # pp.title(type(modele[i]).__name__ + auc)
---> 34     fpr, tpr = roc_curve(y_true, predict_tab[i])
     35     pp.plot(fpr, tpr)
     36     pp.show()

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\metrics\_ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    961     """
    962     fps, tps, thresholds = _binary_clf_curve(
--> 963         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight
    964     )
    965 

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\metrics\_ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    729     y_type = type_of_target(y_true)
    730     if not (y_type == "binary" or (y_type == "multiclass" and pos_label is not None)):
--> 731         raise ValueError("{0} format is not supported".format(y_type))
    732 
    733     check_consistent_length(y_true, y_score, sample_weight)

ValueError: unknown format is not supported

roc_curve 也不接受数组,尽管它的文档说这些字段与 roc_auc_score 相同

predict_tab[0] 的示例内容:

array([[1.00000000e+000, 2.88056881e-149],
       [1.00000000e+000, 1.09890141e-053],
       [1.00000000e+000, 2.38660986e-068],
       [1.00000000e+000, 1.12075737e-038],
       [1.00000000e+000, 1.09066234e-060],
       [9.99970436e-001, 2.95643658e-005],
       [1.00000000e+000, 6.86185710e-032],
       [9.99997179e-001, 2.82111613e-006],
       ...
       [7.55856347e-019, 1.00000000e+000],
       [7.39915887e-014, 1.00000000e+000],
       [1.42321068e-004, 9.99857679e-001],
       [2.43861697e-006, 9.99997561e-001],
       [3.18293067e-014, 1.00000000e+000],
       [7.43778583e-012, 1.00000000e+000],
       [9.10540238e-004, 9.99089460e-001],
       [1.00000000e+000, 1.18037261e-028],
       [2.62948282e-018, 1.00000000e+000]])

y_true 的示例内容:

[array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 1])]

谢谢您的帮助!

回答1

您的 predict_tab stores 五个列表,每个样本的两个概率。 roc_auc_score 接受正类的概率,所以你应该通过

# Your code
    auc = roc_auc_score(y_true, predict_tab[i][0][:, 1])
# Your code

顺便说一句,这很不方便。我宁愿创建一个空列表 predict_tab 并为每个模型附加测试样本的决策分数:

# Your code
predict_tab = []
# Your code
        predict_tab.append(model.decision_function(x_test))
# Your code

相似文章