“There are two kinds of people in this world…”
Are words we’ve all heard before, typically from an elder relative trying to make sense of some bigger picture for you. As cliche as the statement is, demographic information — identifiers which tell us something about the members of a population — are often split into just a few distinct buckets, allowing populations to be split. Gender, car ownership, job industry, being a parent or not, current employment status — these are just a few examples of the myriad data points that can be used to segment groups of people. Choices people make over the course of their lives will deposit them into a number of buckets defined by the data scientist interested in understanding their patterns and habits in the hopes of retaining them as a customer, delivering them relevant product ads, or targeting them for the most beneficial information. As people grow and change, the buckets they fall into and how they are categorized will change.
Data scientists are often curious about how the various categorizations — the independent demographic features of their observed sample — impact the dependent target variable, thereby triggering an outcome. Do people with siblings tend to be dog owners more than they are cat owners? Do people with different types of Master’s degrees have a different preference for vacation destination or are the observed number of MFAs vs MBAs flocking to warmer climates in the winter similar? Questions like these can be answered using the Pearson Chi-squared test.
The Pearson Chi-squared test (often just “Chi-squared”) is a statistical hypothesis test used to determine if there is a significant difference between the observed and expected distributions in one or more categories of a contingency table.
Contingency tables are a type of frequency distribution tables which are used to summarize a relationship between two or more variables. A contingency table is defined such that there would be a different number of observations across each row or population but a similar proportion across each column or group. A key assumption of the Chi-squared test is that each observation (row) is only contributing data to one cell in the contingency table. For example, if we had 50 participants in a study where they each tried to solve 2 crossword puzzles, it would not be appropriate to conduct a Chi-square test on frequency data that looked like the below because each participant is contributing to two cells:
If we were studying 100 participants who were randomly assigned puzzle 1 or 2, we could then use Chi-squared to determine whether there was independence between puzzle assignment and solving rate.
Other rules for the Chi-square test are that the total number of observations in the contingency table should be greater than 20. Also, if the count of a cell in the contingency table is fewer than 5, the Yates correction for continuity should be used. The chi-squared test is used for categorical data but can be applied to a continuous variable after the data is binned.
Pearson’s Chi Squared Test
The Pearson Chi-squared test calculates a test statistic and p-value that determine whether there is evidence to reject the null hypothesis that there is no difference in the observed and expected frequencies of two or more variables. If evidence is found to reject the null — the observed and expected frequencies of the variables are not similar — then there is evidence that the variables are dependent on one another.
To run the chi-squared test in Python, find the chi2_contingency method in the scipy.stats library. The method takes in a contingency table, a parameter for the Yates correction, and a specification on which statistic the test should calculate. It outputs the test statistic, the p-value, the degrees of freedom, and a table of the expected values of the distribution.
Interpreting the output is relatively straightforward — if the test statistic is greater than or equal to the critical value then we have evidence to reject the null hypothesis. For user knowledge, the critical value for a chi-squared distribution can be calculated in Python using the scipy.stats.chi2.ppf (percentage point function) method by inputting the ppf = 0.95 (to cover 95% of the observations from the selected sample) and degrees of freedom. DF can be calculated using the formula:
DF = (n_rows - 1) * (n_columns - 1)
Referring to the number of rows and columns of the contingency table.
The test results can also be interpreted using the p-value — if the calculated p-value is less than or equal to the selected level of significance (alpha), there is evidence to reject the null hypothesis. If p is greater than alpha, the null cannot be rejected with the selected samples.
Example: ML classification for Customer segmentation
I’ll now provide a brief example of using Chi-squared to reduce features in a data set prior to determining feature importance and fitting a model. Customer demographics can be observed in this data set from a Kaggle competition containing information about customers of a telecommunications company. Many of the identifying customer data were categorical — things like whether or not the customer had multiple phone lines or whether they had dependents. The goal of the project was to classify customers who churned — whether or not they terminated their service in the last month.
After removing a few features (due to multicollinearity) during the data cleaning phase, I still had 15 features, which was pushing towards the higher side for the K-Prototypes cluster analysis I wanted to conduct. While I plan to use PCA to reduce features, using Chi-squared in advance could potentially whittle down the list first, with the aim of my feature selection being to select features that the response was highly dependent on.
After conducting quick visual analysis of the count distribution of the variables, I was ready to use the Chi-square method to dig deeper. Two examples are as follows:
I first compared frequency distribution for the gender category against churn.
It appears that the number of males and females who had a Churn value of 0 (did not churn) and 1 (did churn) were approximately equal. It is not likely that we will find evidence to reject a Chi-squared test, but we explore it anyway. The <crosstab> method in pandas allows a user to generate a contingency table from two features of a data frame. I after generating the table, I can use the chi2_contingency test:
Again, outputs are as follows:
- Test statistic — 0.475
- P-value — 0.4904
- Degrees of Freedom — 1
- Expected values — [[2,557.27, 2,605.73],[ 925.73, 943.27]]
EVs for the expected values table are calculated using the formula:
E = n * p
Where n is the sample size and p is the P(AnB) = P(A) * P(B).
For Females who did not churn(0), we have
E = 7,032 * ((3,483/7,032) * (5,163/7,032)) = 7,032 * 0.3637 = 2,557.27
Inputting ppf = 0.95 and DF = 1 into the chi2.ppf(0.95, 1), we get a critical value of 3.84. The test statistic and p-value are lower than the critical value and alpha respectively, meaning we do not have evidence to reject the null and the features are independent of one another.
A second feature I considered was the customers’ Payment Method. Customers could pay by one of four methods — electronic check, mailed check, bank transfer, or credit card. The countplot against churn was:
At first glance, it appears that customers who did not churn had similar frequency for all four methods while the customers who left were using electronic checks at about 3–4 times the rate as other payment methods. This warrants additional investigation:
- Test statistic — 645.43
- P-value — 1.43 * 10^-139
- DF — 3
- Expected values — [[1,132.16, 1,116.74, 1,736.42, 1,177.68], [ 409.84, 404.26, 628.58, 426.32]]
As we can see, there are significant differences between the values in the generated crosstab and the expected values table. The test statistic and the p-value are significant, providing evidence to reject the null hypothesis and say there is a difference between the observed and expected frequencies of payment method used by customers.
The takeaway is that I will not include gender in the data I consider at later stages in my modeling process, but will be sure to include payment method because we learn something from the dependence between payment method and churn.