Predicting Churning Customers Using CRISP-DM Methodology
The development of this project aimed to identify the churn generation of customers. The project’s motivation was to analyze patterns, trends and predictions extracted from the data using machine learning models capable of identifying the significant decrease in the use of services and products by customers. The importance of development is precisely to provide companies with the possibility to identify the decrease in consumption of their businesses and to act in order to identify the problems that caused this decrease, as well as to explore new possibilities to retain and attract new customers with differentiated services and new products.
In the development of the project, the 6 stages of the CRISP-DM methodology were used, precisely because it is a very consolidated methodology in the area of data mining and with a completeness of topics to be developed and explored that allows to create data solutions. All the developed code can be accessed here. For the development of this project I used the Customer Churn data set from the Kaggle website that can be founded here.
Below, each performance performed on the project within the areas of the CRISP-DM methodology is presented:
Phase 1: Understanding the Business:
With the loading of the dataset on the jupyter notebook, I started the project with phase 1 — Understanding the business. In this step, the type of problem that will be solved is identified, in this case predict the churn of customers. In addition, it is worth noting that at this stage the company’s background is also identified to know how the project will be conducted until its solution, the main objectives that the project aims to achieve and the success criteria that are metrics that will be used to verify if the project reached the objectives established in the delivery of the same.
For this project, the background was defined as the development of the case study through the creation of the file in jupyter notebook in pyhon language. The objective of the project was to answer the questions below:
- Is it possible to achieve a hit rate above 80% in the customer’s churn forecast?
- What is the level of correlation between the variables age, year and total_pucharse?
- Is the solution developed passive to be applied in a way that the business identified benefits?
The use of metrics such as confusion matrix, precision, recall, f1-score and ROC curve are examples of metrics used to analyze the project’s success criterion.
Phase 2: Understanding the Data
Evolving to the next phase, phase 2 — Understanding the Data, treatment was carried out through the collection, describing data using statistics, performing exploratory analysis and verifying the quality of the data so that the data makes sense in the solution in which the project intends to deliver. It is important to note that data is a decisive factor for the success of the project, since if poor quality data is used, we will be generating a poor quality solution. Therefore, this step is extremely important to help us understand the data that we will supply to our solution. This phase, according to several authors, articles and books, is responsible for more or less 70% of the project due to its importance.
So, for this project, I understood the data by studying the entire data set, identifying: variables, values, data types, number of unique values per variable, describing the data set using statistics, checking the data distribution of the variable target, visualization of the correlation between the variables to identify their impact on the forecast of the models and generating an analysis report for easy verification of the information as well as sharing with those involved in a corporate environment.
Phase 3: Preparing the Data
In step 3, in which the data is prepared, four sub-steps were used, which were: data selection, data cleaning, data construction and data integration. Starting with data selection, I tried to check outliers and if all rows and columns are candidates for use in the model. Still in this first sub-step, I segregated the variables in which I separated independent variables from the dependent variable. In the sub-step of cleaning the data set, I checked if there were null data which was not the case for this data set and I did the necessary transformations (modeling) of categorical data to prepare the data for the model. Still in this cleaning step, I segregated the data set between training and test data for training, testing and validating the model’s performance.
For this project, there was no need to build variables and integrate data with different sources, despite having made the steps for learning purposes explicit.
Phase 4: Modeling
In phase 4 — Model, as the data set had the target variable characterizing the problem in a supervised learning model, I made use of 4 models to meet the objectives and validate their performance. The models applied here were: logistic regression, naive bayes, random forest and decision tree.
Phase 5: Evaluation
In phase 5 of CRISP-DM, I made use of the performance evaluation of the models by discovering the success and error rate of each one of them as well as the generation of the confusion matrix, identification of the most important variables for each model (this step can only be done after the fit of the model, that’s why it was present here), generation of metrics such as precision, recall, f1-score and support as well as raw ROC of each model. It can be seen that the Decision Trees model had a higher hit rate than the other models (92%) based on the data used.
It is worth mentioning that some techniques could have been used, such as engineering of characteristics, creation of dummy variables and other approaches, making an exploratory analysis of the data, as well as modifying the model parameters for a better fit. On the other hand, it can be observed that the Naive Bayes model presented a higher accuracy rate (0.92 or 92%) of accuracy, although we found a higher rate of accuracy in the confusion matrix of the decision tree model. The ROC curve is another tool widely used with binary classifiers. The dotted line on the generated graph represented the ROC curve. A good classifier is as far away from that line as possible (towards the upper left corner). Thus, it can be seen that the ROC curve found in the Naive Bayes model is what directs us that within the applied models, the Naive Bayes model is the best model used in this approach.
Through this phase, it can be concluded that the objectives set were fulfilled, that is, we achieved a performance above 80% and found that the variables year, age and total_pucharse have low correlation and do not generate negative and biased impact on the models used.
Phase 6: Deployement
The last phase, phase 6 — Deploy, is where the implementation is made, it can be delivered in several ways, for example, creating an application for data consumption by those involved, making the model available through the persistence of the trained model using the Pickle library python and etc.
For this project, only the development of the case study was generated. The availability for consumption of the data will be developed as a point to improve the structure of this project.
Despite the immense value created through the development of python code to handle the data according to the purpose of each stage of the CRISP-DM methodology, the greatest value is to answer the questions defined at the beginning of the project so that they can generate value in decision making. Thus, the questions and respective answers of each one can be checked below:
- Is it possible to achieve a hit rate above 80% in the customer’s churn forecast? Following the steps of the CRISP-DM methodology, it was possible to carry out an organized and evolutionary process for the development of a data solution. Through these steps, it was possible to train and test 4 different models where hit rates between 83% and 92% were found.
- What is the level of correlation between the variables age, year and total_pucharse? Through step 2 of the CRISP-DM, a correlation was made between the variables and a heatmp graph was generated through the seaborn library, which presented a map of color and values in relation to the variables used. In the reading done through it was possible to identify a low correlation between the variables used in the model. It is important to note that this pattern is possible and tends to benefit machine learning models since there is no skewed data.
- Is the solution developed passive to be applied in a way that the business identified benefits? The generated data solution can be applied to the real world, since non-biased data were used and high hit rates were found, according to step 5 of the CRISP-DM, where the performance evaluations of each generated model were made. The point of attention is that for this project it was defined as an objective to reach a rate above 80%, that is, specifically for this project to achieve this rate would positively reflect the success criterion of the same. However, care and attention must be paid to each company and business having their respective rates that reflect their success criteria. A rate of 80% can be great for one company, but this same rate can be bad for another, it is always necessary to evaluate the scenario of the company as a whole always.
Finally, for reflective purposes only, applying data to different models is a good practice for evaluating different possibilities and choosing the one with the best accuracy for a forecast. There are several libraries focused on machine learning that can be applied in this context and expand the options for checking forecasts. In this case, the most suitable model due to its performance was the Naive Bayes.
I have a passion for working with data and helping to generate value for companies based on their structured analysis. The engineering and data science area, as well as other professions in the data pipeline, have skills, concepts and solutions that are of great interest to me. Let’s keep in touch, you can find me on LinkedIn by clicking here and see my projects on GitHub by clicking here.
“Data is a highly valued asset that provides a competitive advantage in the business world.”