There are many supervised algorithms that can be used to train a classifier. The problem is that it is hard to say which one will be the best fit for your data — there is no golden algorithm that always comes out best. To train a good predictive model you need to check many possible algorithms, and select the one that fits your data the best (base on selected metric and validation schema). In this tutorial I will show you how to easily check many algorithms on credit scoring task with MLJAR.
Get data!
The data I will use is from a past Kaggle competition (link for data). We will download the training dataset (cs-training.csv file), which will be used for model training and test data (cs-test.csv file), and we will this to compute predictions and submit to Kaggle.
1 – At first let’s create a project with Binary Classification task.
2 – Upload a training dataset (cs-training.csv file).
3 – Select target column. In this analysis SeriousDlqn2yrs will be our target variable. We will train model to predict this feature. Please remember to set first column Unamed: 0 as Don’t use it — it is a id column and it is not needed for model training. After selecting usage for columns please accept column usage (green button at the top).
4 – We are now ready to run Machine Learning Experiment! Please go to experiments and click Add new experiment. We will use 10-fold CV with shuffle and stratification for model validation (classes in dataset are unbalanced that’s why it is good to use stratification and shuffle). There are missing values in the dataset, which will be filled with median values. We will use all available algorithms. The metric that we will optimize is Area Under Curve (AUC). The AUC is from 0 to 1 range, and the higher the better. We set a training time limit for 20 minutes for each model (you can set lower if you don’t have enough computational credits). When setting experiment is done, we are ready to start, so just click Create & Start button at the bottom and all machine learning magic will be started.
5 – After starting training you will be redirected to Results page. At the beginning all models are initialized. We selected tuning mode Sport which means that for each ML algorithm will be checked from 10 to 15 different hyper-parameters settings. (Hyper-parameters are values that control the process of ML algorithm training). You need to wait a while till all models will be trained
6 – You should get results like below:
You can click on each model and check its hyper-parameters values and learning curves. Below is example result for Extreme Gradient Boosting algorithm. From learning curve you can see that it was trained with 250 trees. However, the model state that was saved and will be used for computing predictions has 150 trees. MLJAR detects that performance decrease on test folds during cross validation and store only best performing model (with 150 trees).
You can observe that each algorithm has different performance. Before running analysis it is hard to say which one will be good for your data. That’s why it is good to check as many as possible.
7 – To compute predictions on test data. First you need to upload cs-test.csv file and select column usage – in the same way as you did train data. Then, please go to Predict in menu and select test data, model (with highest score) and click Start Prediction. That’s all, now wait a while till predictions are computed and they will appear at the bottom. Let’s download prediction from Ensemble model.
8 – We will use downloaded predictions to submit to Kaggle competition page. Before doing this, we need to fix header in predictions. The column names should be: Id, Probability.
9 – OK, we are ready to submit. Below is score from Kaggle system. When you compare it with results on Private Leaderboard, you will see that results are very good — it is in TOP 10! 🙂
In this analysis we used many different Machine Learning algorithms to train the classifier. As you see, each algorithm has different performance (the AUC score). To find a good model you need to check many training algorithms, but usually people don’t do this because of lack of time:
- they don’t have time to write code for model training with many algorithms, very often each algorithm has different interface, different data format, which makes writing code for each very time consuming — in mljar.com there is one interface for many algorithms
- they don’t have time to wait for training many algorithms, usually models are trained sequentially. Example: training of each model takes 30 minutes then to check 100 models, you need 50 hours (2 days, 6 hours) — in mljar.com all computations are distributed in the cloud, to speed up model training we launch for you up to 12 machines, so training of 100 models (30 minutes each) can be done on mljar in 4.5 hour!
- one more thing, when you are checking 100 models sequentially are you sure that you save all results and you will be able to go to them back? In mljar all your results are saved, so you can check them at any time.
Author bio
Piotr Płoński is a founder of MLJAR – Machine Learning in the browser service, which makes ML super easy. He is also assistant professor at Warsaw University of Technology, where he applies machine learning methods to analyze data from high energy physics experiments.