Once I had a problem where I have to give the best solution according to Area under the curve (AUC or AUROC) score. I had hundreds of features and my training set was about 100.000 of objects. It was a classification problem. I had tried gradient boosting, logistic regression, random forests and then I decided to try SVM and LinearSVC.
It is unsurprisingly that I struggled to fit a model with sklearn.svm.SVC even using linear kernel, because it takes tooo much time to calculate it.
Then I tried to sklearn.svm.LinearSVC wich is implemented using liblinear, instead of libsvm and is a better choice for running on a large data set. But it doesn’t have a predict_proba function and can only predict finite classes instead of giving probability of belonging to a class.
It took me some time to find that in such case I can use sklearn.calibration.CalibratedClassifierCV.
import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.svm import LinearSVC from sklearn.calibration import CalibratedClassifierCV scaler = StandardScaler() X = scaler.fit_transform(X_raw) cclf = CalibratedClassifierCV(base_estimator=LinearSVC(penalty='l2', dual=False), cv=5) cclf.fit(X, y) res = cclf.predict_proba(X_test)[:, 1]; #an array containing probabilities of belonging to the 1st class
Or you can use CalibratedClassifierCV(base_estimator=clf, cv=prefit), if the classifier have been fit already on data.
If classifier was already fitted, the data for CalibratedClassifierCV must be disjoint from the data used for fitting the classifier.