python - 为什么我的 sklearn SVC 在手动缩放而不是使用 pipeline 时要慢得多?

我想用 TokenVectorizerimdb 数据集上构建一个 VSC。在文档中,它说要扩展训练/测试数据以获得更好的结果。有使用 pipeline 的示例代码,但它也应该与手动缩放一起使用。

import numpy as np
import datasets
import pandas as pd
import timeit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# prepare dataset
training_data = pd.DataFrame(datasets.load_dataset("imdb", split="train"))
training_data = training_data.sample(frac=1).reset_index(drop=True)
test_data = pd.DataFrame(datasets.load_dataset("imdb", split="test"))
test_data = test_data.sample(frac=1).reset_index(drop=True)

vect = CountVectorizer()
data = vect.fit_transform(training_data["text"].append(test_data["text"])).toarray()
train = data[0:25000,]
test = data[25000:50000,]

# find most frequent words in the comments
count = np.sum(train, axis=0)
ind = np.argsort(count)[::-1]

ks = [10]
#ks = [10, 50, 100, 500, 1000, 2000] # I want to compare the results for different k

# reduce features to the k most frequent tokens
# columns are already sorted by frequency desc
k_ind = ind[:max(ks)]
X = np.ascontiguousarray(train[:,k_ind])
y = training_data["label"]
test_set = np.ascontiguousarray(test[:,k_ind])

print(f"Check if X is C-contiguous: {X[:,:min(ks)].flags}")

# Test the execution time with pipeline first
for k in ks:
  clf = make_pipeline(StandardScaler(), SVC(C=1, kernel='linear', cache_size=4000)
  # only use k features
  t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
  print(f"Time with pipeline: {t}s")

# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X, copy=False)
scaler.fit(test_set)
scaler.transform(test_set, copy=False)

for k in ks:
  clf = SVC(C=1, kernel='linear', cache_size=4000)
  t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
  print(f"Time with manual scaling: {t}s")

这将产生输出:

Check if X is C-contiguous:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
Time with pipeline: 58.852547400000276s
Time with manual scaling: 181.0459952000001s

如您所见, pipeline 快得多,为什么会这样?我想测试不同 k 的分类器,但是 pipeline 和缩放器将在相同的训练数据上被多次调用,缩放已经缩放的数据会返回相同的结果,或者每次迭代时它都会改变(这就是我手动缩放然后切片缩放数据的原因)?

回答1

好吧,我只是忘了保存缩放的数组......

<...>
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X, copy=False)
scaler.fit(test_set)
test_set = scaler.transform(test_set, copy=False)
<...>

现在这两种方法都需要相同的时间。

相似文章