A contribution from Piotr Płoński, founder of MLJAR.
Predicting employee attrition with machine learning can greatly improve human resources management, saving a lot of money, while still keeping good employees if action is taken in time. In this article the example of how to predict employee attrition with machine learning is presented.
The analytic methods can improve human resources (HR) management for companies with a large number of employees. It is very easy to give an example of how companies can benefit from machine learning methods applied to HR. Let’s assume that training of new employees costs 1000$ per employee, and if we can predict which employee is going to leave next month, and propose him/her a bonus program worth 500$ to keep him for next 6 months, we are 500$ plus and keep an experienced, well-trained employee under the hood, with higher morale.
In this article, I would like to present how to predict employee attrition with machine learning. For analysis I will use a data set created by IBM data scientists, which is available here. However, I will do a split into train and test samples to better explain how machine learning methods can be applied to this problem. The split data is available at my github. The training set represents historical data about employees. In this data each sample describes the employee with parameters like: age, department, distance from home, marital status, income, years at company. You can check all used descriptors here. For each employee in the training set the attrition is known (it is historical value). In test data we have employees descriptors available, however the attrition is unknown and we want to predict (compute) it with our machine learning model. (To be honest, the attrition values in test data are available, but for better explanation let’s assume that they are missing).
There are 1200 samples (employees) in train data. For model training we will use MLJAR which allows to create machine learning models in the browser. Details of used set-up:
- 5-fold stratified cross validation
- as training algorithms we use Extreme Gradient Boosting, LightGBM and Random Forest, additionally an ensemble was constructed from all models
- for model evaluation we use Area Under ROC Curve (the higher, the better)
After model training we can easily compare different ML algorithms and use the model with the highest score – in our case it is Ensemble model.
Before computing predictions, let’s check feature importance for models. The 10 top important features for single model (Extreme Gradient Boosting) are presented below:
From features importance we get insight that business travel frequency, hourly rate, distance from home and job involvement are the most important factors used in attrition prediction.
The test samples which were not used for training can be used to compute probability of attrition. Because the ground truth labels are known for test samples we can compare predicted values with true labels. In the picture below are presented top and bottom rows from table sorted by descending prediction value.
From that table, you can observe that action should be taken for employees with high predicted values, because otherwise they will leave your company in the near future. There can be a question how many employees from the top of the table we should take care of … and there is no clear answer for this, because it depends on hire vs keep costs.
Herein, it was presented how machine learning model can be applied for predicting employee attrition. Problem of data collecting and cleaning was not considered, which is crucial for building a good machine learning model. After data is collected & cleaned the machine learning part can be easily done with MLJAR. The detail steps of analysis were described in this article. All models trained in this article are available at this link.
Author bio
Piotr Płoński is a founder of MLJAR – Machine Learning in the browser service, which makes ML super easy. He is also assistant professor at Warsaw University of Technology, where he applies machine learning methods to analyze data from high energy physics experiments.