on
Basics of XGBoost in Python
I have spent the past week porting and automating a model training script from R to Python. The R script relied heavily on Extreme Gradient Boosting, so I had an opportunity to take a deeper look at the xgboost Python package. The new Python version of the code was supposed to stay as close as possible to the R script in the methodology of training several different models. I found the documentation of the Python package a little painful to read, so here is a small wrap-up of how to get started with implementing XGBoost in Python.
This is not supposed to be a post on how the algorithm itself works, that has been well covered already, for example in “A Kaggle Master Explains Gradient Boosting” by Ben Gorman.
Installation
If you are on Linux, you can simply type into the command line
sudo apt-get install build-essential
pip install xgboost
Or if you are using Anaconda:
conda install -c anaconda py-xgboost=0.60
Installation on Windows is slightly more difficult, see this tutorial from IBM for example. [EDIT 08/2017 - xgboost is on PyPI now, and at least installation under Anaconda works out-of-the box on Windows now, too!]
Core XGBoost Library VS scikit-learn API
Models can be trained in two different ways:
- Directly using the core library - this is closer to the implementation of the caret-package in R
- Using the scikit-learn API - this means that the models are implemented in a way that lets the scikit package recognize it as one of it’s own models. We will see later how this can be useful.
Train with the core library
import xgboost as xgb
import pandas as pd
df = pd.DataFrame({'X': [3, 4, 6], 'y': [12, 21, 87]})
X = df.drop('y', axis=1)
y = df['y']
data_matrix = xgb.DMatrix(X, y)
params = {'booster': 'gblinear', 'objective': 'reg:linear'}
model = xgb.train(dtrain=data_matrix, params=params)
X_test = pd.DataFrame({'X': [4, 1, 7]})
test_matrix = xgb.DMatrix(X_test)
pred = model.predict(test_matrix)
In line 6, we convert our training data into a so called DMatrix, which is the data structure used internally by xgboost. The training data has to be converted in the same way, as seen in line twelve. A model is generated by simply using xgb.train() on the created DMatrix and a set of parameters.
These parameters are one of the reasons why you might decide to use the core library instead of the scikit API - there are slightly more options to choose from. For example, in the above code I used 'booster': 'gblinear'
- this is not possible with the scikit API version, as it doesn’t let you change the booster parameter and the default is gbtree.
[EDIT 07/2017 - The ‘booster’ parameter can now also be set from the scikit wrapper, so this argument is moot!]
The second parameter specifies the objective function to use, options include reg:logistic and binary:logistc - see here for a full list of parameters.
Train with the scikit-learn API
import pandas as pd
from xgboost import XGBRegressor
df = pd.DataFrame({'X': [3, 4, 6], 'y': [12, 21, 87]})
X = df.drop('y', axis=1)
y = df['y']
params = {
'max_depth': 5,
'n_estimators': 50,
'objective': 'reg:linear'}
model = XGBRegressor(**params)
model.fit(X, y)
x_test = pd.DataFrame({'X': [4, 1, 7]})
pred = model.predict(x_test)
As you see, this version looks much cleaner and more familiar when you already use the scikit package. No need to convert your data frame into a DMatrix, and you can use the fit()
and predict()
methods as usual. There is an XGBRegressor and an XGBClassifier, I used the former to replicate the above example. If you run both code snippets locally, you will find that the results differ because of the different booster that is used. (Change the booster param in the first code snippet to ‘gbtree’ and notice the difference.)
For me, the biggest advantage of the scikit API was that I could automate the search for optimal parameters using RandomizedSearchCV. Let’s see how to do this:
# continuing from the above example, without lines 14-17
from sklearn.model_selection import RandomizedSearchCV
search_params = {
'max_depth': list(range(2, 11, 1)),
'n_estimators': list(range(10, 60, 10))
}
search = RandomizedSearchCV(model, search_params, n_iter=20)
search.fit(X, y)
print(search.best_params_)
This will fit 20 different models by randomly picking parameters from the ranges provided in search_params, and perform cross validation on them. The best set of parameters that was found can then be accessed as seen in the last line. If you instantiate RandomizedSearchCV with the parameter refit=True
, it will also give you the estimator trained on the best set of parameters so you can reuse it to make predictions.
Conclusion
Once you get the hang of it, both methods of using XGBoost in Python are quite simple. In my case, I actually needed to use both versions because I wanted to implement models with both tree-based and linear base learners, which is not possible with the scikit API because it doesn’t let you choose your type of booster. If this was possible, I would definitely prefer using the scikit version, so I wouldn’t have to bother adding yet another transformation step to my data to make it a DMatrix and could also automate parameter search for the linear model easily using the available scikit tools.
Discussion and feedback