作业处理
在 MNIST 数据集上训练一个 SVM 分类器。因为 SVM 分类器是二元的分类,你需要使用一对多(one-versus-all)来对 10 个数字进行分类。你可能需要使用小的验证集来调整超参数,以加快进程。最后你能达到多少准确度?
把数据读取这块搞定了,因为国内网络无法正常访问mldata,导致sklearn.datasets下载MNIST Original时出现连接错误。之前一直使用tensorflow读取数据使用,但是因为我的anaconda的问题,tensorflow一直没法用。
今天好好查了一下,因为少了一个mnist-original.mat文件,下载后放在mldata文件夹中,在fetch_mldata参数设置中设置本地MNIST_data文件夹即可。
文件已经更新在了mldata中。
from sklearn.datasets import fetch_mldata from sklearn.svm import LinearSVC import numpy as np mnist = fetch_mldata('MNIST original',data_home="./MNIST_data") X = mnist["data"] y = mnist["target"] X_train = X[:60000] y_train = y[:60000] X_test = X[60000:] y_test = y[60000:] print('train_length:',len(X_train),len(y_train)) print('test_length:',len(X_test),len(y_test))
先试一下线性SVM对数据的拟合,使用训练数据对模型进行训练
from sklearn.metrics import accuracy_score np.random.seed(42) rnd_idx=np.random.permutation(60000) X_train=X_train[rnd_idx] y_train=y_train[rnd_idx] lin_clf=LinearSVC(random_state=42) lin_clf.fit(X_train,y_train) y_pred=lin_clf.predict(X_train) #在第四章训练中已经提到过,目前还不希望使用测试数据集进行评价,模型正确率0.85375 accuracy_score(y_train,y_pred)
上面确率达到了85%,看起来不错,但是目前的数据是没有做特征缩放的,所以做一下特征缩放,再用线性SVM拟合试试。正确率达到92%
from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train.astype(np.float32)) X_test_scaled = scaler.transform(X_test.astype(np.float32)) lin_clf = LinearSVC(random_state=42) lin_clf.fit(X_train_scaled, y_train) y_pred = lin_clf.predict(X_train_scaled) accuracy_score(y_train, y_pred)
再进一步,使用设定为RBF kernel的SVM进行拟合。正确率达到94.6%
from sklearn.svm import SVC svm_clf = SVC(decision_function_shape="ovr", gamma="auto") svm_clf.fit(X_train_scaled[:10000], y_train[:10000]) y_pred = svm_clf.predict(X_train_scaled) accuracy_score(y_train, y_pred)
现在采用包含交叉验证的随机搜索调整超参数,因为时间会很长,就用一个小一些的数据集,只使用了1000的样本。
测了一下,提高一个量级,10000样本,时间开销很大,得分0.863,下面为训练后结果。
'''
SVC(C=6.335534109540218, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.0011263118134606108,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
'''
from sklearn.model_selection import RandomizedSearchCV from scipy.stats import reciprocal, uniform from sklearn.svm import SVC svm_clf = SVC(decision_function_shape="ovr", gamma="auto") param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)} rnd_search_cv = RandomizedSearchCV(svm_clf, param_distributions, n_iter=10, verbose=2, cv=3) rnd_search_cv.fit(X_train_scaled[:1000], y_train[:1000])
获取的超参数再次进行训练,很慢
rnd_search_cv.best_estimator_.fit(X_train_scaled, y_train)
很慢
y_pred=rnd_search_cv.best_estimator_.predict(X_train_scaled) accuracy_score(y_`train,y_pred) #=>0.99965
很慢
y_pred = rnd_search_cv.best_estimator_.predict(X_test_scaled) accuracy_score(y_test, y_pred) #=>0.971