Analysis of Customer Segmentation with RFM

by Jocelyn Miao

Dashboard

Dashboard Screenshot

Overview

With more stores going online and tracking their data, how can companies keep track of their customers, and better yet, retain them? By using several methods of customer groupings, we can target and specialize tacticts on specific customers. I will be visualizing RFM, KMeans, and CLV models for customer segmentation in this project.

The dataset I will be using is UCI Online Retail, which “contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.” Here is a brief description of the dataset from UCI:

Variable Name	Role	Description
`InvoiceNo`	ID	A 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation
`StockCode`	ID	Categorical a 5-digit integral number uniquely assigned to each distinct product
`Description`	Feature	Categorical product name
`Quantity`	Feature	Integer the quantities of each product (item) per transaction
`InvoiceDate`	Feature	Date the day and time when each transaction was generated
`UnitPrice`	Feature	Continuous product price per unit (sterling)
`CustomerID`	Feature	Categorical a 5-digit integral number uniquely assigned to each customer
`Country`	Feature	Categorical the name of the country where each customer resides

Data Cleaning

To clean the dataset, columns with null values in CustomerID will be dropped, since there is no way to recover lost IDs that are needed for analysis. Duplicate rows and rows that correspond to the same transaction and refund were dropped. Rows with values 'POST', DOT, and CRUK in the StockCode column are used for postage or donations and do not contribute to sales data, so they were dropped. One thing to note is the StockCode M represents manual help, which could represent an item, refund, discount, or more (unknown), so I did not remove these rows. I removed leftover invoices starting with “C” to remove refund only transactions. Lastly, I created a column TotalSale that uses the quantity and price to describe the overall value of the transaction.

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	TotalSale
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850	United Kingdom	15.3
536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850	United Kingdom	20.34
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850	United Kingdom	22
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850	United Kingdom	20.34
536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850	United Kingdom	20.34

Exploratory Analysis

What countries are users purchasing from? Outside of the United Kingdom being the majority, the dataset has users from several countries, with many from Germany and France.

In which month did customers make the most transactions? The month of November.

RFM Grouping

Now we can move on to calculating RFM scores for customers and grouping them for different sales tactics. I will be aggregating the dataframe to make columns for Recency (time since the customer last made a purchase), Frequency (how many transactions they make), and Monetary (how much total they have spent) values. Then we can convert this to a quartile rank from 1-5 as the actual RFM Segment. Below is the head of the new dataframe:

CustomerID	Frequency	Recency	Monetary	R	F	M	RFM_Segment	Loyal
12346	1	0	77183.6	5	1	5	515	False
12347	7	365	615.714	1	5	5	155	False
12348	4	283	308.91	1	4	3	143	False
12349	1	0	1457.55	5	1	5	515	False
12350	1	0	294.4	5	1	3	513	False

I will categorize customers with a score of 333 or higher to be Loyal, and customers under that to be Not Loyal. Below is the average recency, frequency, and monetary values of each group.

Loyal	Recency	Frequency	Monetary
False	137.585	3.98186	437.308
True	81.9403	2.96269	544.037

To create a model to predict loyalty, I chose a decision tree classifier model using the sklearn library. My model found an accuracy of 98% on determining if a customer would be loyal based off the recency, frequency, and monetary values of a customer.

Decision Tree

KMeans

Another method of clustering, KMeans, can be used to classify these users. I will use the Recency, Frequency, and Monetary values previously calculated and scale it, then use the elbow test to determine the number of clusters.

Here, around 4 clusters is the elbow point. Next we can make an actual model with 4 clusters, and add it to the dataframe for each customer. The dataframe below shows the mean for each category for each group.

Cluster	Recency	Frequency	Monetary
0	27.508	1.61218	395.742
1	352.211	28.7719	793.778
2	102.5	1.5	80709.9
3	265.28	5.45686	414.016

The groups are:

0: High frequency customers: promote higher monetary sales
1: New customers: entice with limited-time discounts
2: High spending customers: reward with wholesale discounts
3: Average customers: promote with points rewards

The difference between RFM and KMeans is we have much more distinguished groups based on distances as shown in the graph, while the RFM analysis had overlaping customers in different areas. KMeans tended to group customers more on features, so different sales tactics can be targeted towards these features.

Modeling CLV

Let’s move on to finding the customer lifetime value of these customers for one year. I will be using the BetaGeoFitter and GammaGammaFitter from the lifetimes package to calculate the CLV. Customers with only one purchase will be filtered out for the models.

The Beta-Geometric/NBD model is a type of Buy Till You Die model that describes the dropout process of customers and can predict how many transactions a customer will make. I will combine this with the Gamma-Gamma Submodel can predict the profit for each customer. After fitting both models, we can use the customer_lifetime_value function under the Gamma-Gamma model to create a column for the dataframe with the CLV of each customer, and graph this with the RFM graph.

CustomerID	frequency	recency	T	monetary_value	CLV	group
12347	6	365	367	599.702	3007.45	4
12348	3	283	358	261.48	911.965	2
12352	6	260	296	244.29	1705.14	3
12356	2	303	325	269.905	774.421	1
12358	1	149	150	523.2	1300.52	2

Here are the descriptive statistics for CLV:

	CLV
count	2784
mean	2662.11
std	8683.81
min	312.874
25%	817.444
50%	1366.32
75%	2364.95
max	221407

We can see that most customers have a CLV around the mean but there are a few outliers with high scores. CLV values can also be used to create sementations for customers (as seen in the Group column, like with RFM earlier).

The BGF model can also be used to predict the how many purchases customers will make within the next 30 days (and many other predictions). Here are the top 5 for this dataset:

CustomerID	Expected Purchases
14911	8.9461
12748	7.72459
17841	7.5888
15311	6.09582
14606	6.02791

Overall, there are several ways to segment customers with different benefits and drawbacks depending on what analysis should be done on customer databases. With more data on customers, we could extend our modeling to predicting what customers will buy and when.