How to predict_proba with LinearSVC

Once I had a problem where I have to give the best solution according to Area under the curve (AUC or AUROC) score. I had hundreds of features and my training set was about 100.000 of objects. It was a classification problem. I had tried gradient boosting, logistic regression, random forests and then I decided to try SVM and LinearSVC.

It is unsurprisingly that I struggled to fit a model with sklearn.svm.SVC even using linear kernel, because it takes tooo much time to calculate it.
Then I tried to sklearn.svm.LinearSVC wich is implemented using liblinear, instead of libsvm and is a better choice for running on a large data set. But it doesn’t have a predict_proba function and can only predict finite classes instead of giving probability of belonging to a class.

It took me some time to find that in such case I can use sklearn.calibration.CalibratedClassifierCV.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
scaler = StandardScaler()
X = scaler.fit_transform(X_raw)
cclf = CalibratedClassifierCV(base_estimator=LinearSVC(penalty='l2', dual=False), cv=5)
cclf.fit(X, y)
res = cclf.predict_proba(X_test)[:, 1];
#an array containing probabilities of belonging to the 1st class

Or you can use CalibratedClassifierCV(base_estimator=clf, cv=prefit), if the classifier have been fit already on data.

If classifier was already fitted, the data for CalibratedClassifierCV must be disjoint from the data used for fitting the classifier.