我想用 TokenVectorizer
在 imdb
数据集上构建一个 VSC。在文档中,它说要扩展训练/测试数据以获得更好的结果。有使用 pipeline 的示例代码,但它也应该与手动缩放一起使用。
import numpy as np
import datasets
import pandas as pd
import timeit
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# prepare dataset
training_data = pd.DataFrame(datasets.load_dataset("imdb", split="train"))
training_data = training_data.sample(frac=1).reset_index(drop=True)
test_data = pd.DataFrame(datasets.load_dataset("imdb", split="test"))
test_data = test_data.sample(frac=1).reset_index(drop=True)
vect = CountVectorizer()
data = vect.fit_transform(training_data["text"].append(test_data["text"])).toarray()
train = data[0:25000,]
test = data[25000:50000,]
# find most frequent words in the comments
count = np.sum(train, axis=0)
ind = np.argsort(count)[::-1]
ks = [10]
#ks = [10, 50, 100, 500, 1000, 2000] # I want to compare the results for different k
# reduce features to the k most frequent tokens
# columns are already sorted by frequency desc
k_ind = ind[:max(ks)]
X = np.ascontiguousarray(train[:,k_ind])
y = training_data["label"]
test_set = np.ascontiguousarray(test[:,k_ind])
print(f"Check if X is C-contiguous: {X[:,:min(ks)].flags}")
# Test the execution time with pipeline first
for k in ks:
clf = make_pipeline(StandardScaler(), SVC(C=1, kernel='linear', cache_size=4000)
# only use k features
t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
print(f"Time with pipeline: {t}s")
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
scaler.transform(X, copy=False)
scaler.fit(test_set)
scaler.transform(test_set, copy=False)
for k in ks:
clf = SVC(C=1, kernel='linear', cache_size=4000)
t = timeit.timeit(stmt='clf.fit(X[:,:k],y)', number=1, globals=globals())
print(f"Time with manual scaling: {t}s")
这将产生输出:
Check if X is C-contiguous:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Time with pipeline: 58.852547400000276s
Time with manual scaling: 181.0459952000001s
如您所见, pipeline 快得多,为什么会这样?我想测试不同 k
的分类器,但是 pipeline 和缩放器将在相同的训练数据上被多次调用,缩放已经缩放的数据会返回相同的结果,或者每次迭代时它都会改变(这就是我手动缩放然后切片缩放数据的原因)?
回答1
好吧,我只是忘了保存缩放的数组......
<...>
# Test the execution with manual scaling
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X, copy=False)
scaler.fit(test_set)
test_set = scaler.transform(test_set, copy=False)
<...>
现在这两种方法都需要相同的时间。