Unsupervised learning for market segmentation
In this project, I analyze a dataset containing data on various customers’ annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.
The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel'
and 'Region'
will be excluded in the analysis — with focus instead on the six product categories recorded for customers.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm as cm
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace=True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
except:
print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.
In this section, I begin exploring the data through visualizations and code to understand how each feature is related to the others. I observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which will be tracked throughout the project.
# Display a description of the dataset
display(data.describe())
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
count | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 | 440.000000 |
mean | 12000.297727 | 5796.265909 | 7951.277273 | 3071.931818 | 2881.493182 | 1524.870455 |
std | 12647.328865 | 7380.377175 | 9503.162829 | 4854.673333 | 4767.854448 | 2820.105937 |
min | 3.000000 | 55.000000 | 3.000000 | 25.000000 | 3.000000 | 3.000000 |
25% | 3127.750000 | 1533.000000 | 2153.000000 | 742.250000 | 256.750000 | 408.250000 |
50% | 8504.000000 | 3627.000000 | 4755.500000 | 1526.000000 | 816.500000 | 965.500000 |
75% | 16933.750000 | 7190.250000 | 10655.750000 | 3554.250000 | 3922.000000 | 1820.250000 |
max | 112151.000000 | 73498.000000 | 92780.000000 | 60869.000000 | 40827.000000 | 47943.000000 |
To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. I cycled through different sets of samples until I obtained customers that varied significantly from one another.
# Select three indices to sample from the dataset
indices = [7,70,296]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop=True)
print("Chosen samples of wholesale customers dataset:")
display(samples)
Chosen samples of wholesale customers dataset:
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
0 | 7579 | 4956 | 9426 | 1669 | 3321 | 2566 |
1 | 16705 | 2037 | 3202 | 10643 | 116 | 1365 |
2 | 19087 | 1304 | 3643 | 3045 | 710 | 898 |
1.) What is most interesting about the first establishment is that it doesn’t order an exceptionally low amount of anything. Whereas the other two establishments have a least one order category with a value placed in the first quartile, this customer’s orders across the board place in their 2nd quartiles or higher. Most noteworthy among them is Delicatessen
, with a value in its top quartile. Other significant orders include Grocery
and Detergents_Paper
. What these three goods have in common is long shelf lives, indicating that this customer may be ordering a large volume of each of them on an irregular basis and stocking them for an extended period of time while reselling them further down the line to its own customers. Even without an absolute idea of what “monetary units” were used in creating this data, it is still obvious that one customer cannot consume such an amount of Detergents_Paper
itself. In reality, this customer is purchasing mainly with the intent to build inventories. Based on this assessment, I classify this establishment as a retailer. The large order of Grocery
clarifies that this retailer is likely a supermarket that primarily sells preprocessed and prepackaged foods, cleaning products, and miscellaneous items for the home.
2.) The most outstanding order for the second establishment is Frozen
, placed well inside the 4th quartile with an order size of 10,643. This customer has notable orders of Fresh
and Delicatessen
as well, which both rank in their respective 3rd quartiles. Its orders in Milk
and Grocery
are just ordinary and typical for any restaurant. This is most likely a 2 or 3 star restaurant or pizzeria whose menu consists of items prepared with mostly frozen ingredients. They keep a large freezer that holds the majority of their product, but also regularly use fresh vegetables and meats in their dishes, which probably include pizza, pastas, sandwiches, etc. Because this restaurant isn’t especially interested in investing much in cleanliness and upkeep, it doesn’t need to order much Detergents_Paper
. Its order in this category - a mere 116 - ranks it at the very bottom in its 1st quartile. It doesn’t get much lower than this.
3.) Fresh
is the most significant order category for the last establishment, with an order solidly inside the top quartile. The customer’s Frozen
order also features in its own 3rd quartile - enough to warrant consideration, albeit not near to the extent of the second customer. It orders a moderate amount of the other products (Grocery
, Detergents_Paper
, Delicatessen
), with a particularly weak showing in Milk
, falling in its own first quartile. My interpretation is that this customer is a restaurant as well, although one that serves more high-quality fresh food than the second one. Whereas the former prepares most of its items using frozen ingredients, the current instead prepares its dishes from scratch with mostly fresh ingredients. If the former scores 2 or 3 stars, then this one would likely score 4 or 5. Furthermore, in order to make sure that the dishes, silverware, kitchen, and restaurant in general are clean at all times, this restaurant orders a hefty volume of Detergents_Paper
, however not enough that they could resell to customers, as the first establishment could. One would expect this level of attention to sanitation and hygiene in an upscale restaurant.
One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? This determination can be made quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Set random seed to allow reproducible results
np.random.seed(42)
# Make a copy of the DataFrame
# Use the 'drop' function to drop a given feature
new_data = data.copy()
new_data.drop(['Grocery'], axis = 1, inplace=True)
# Split the data into training and testing sets using the given feature as the target
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Grocery'],
test_size=0.25, random_state=42)
# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train,y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test,y_test)
print "The decision tree regressor's r2_score is: {:.4f}".format(score)
The decision tree regressor's r2_score is: 0.6819
Grocery
feature as a target using the other features as estimators. The default decision tree regressor output an R^2
of 0.6819, which in my view is significant enough to warrant further consideration. Specifically, the other five features can explain roughly 70% of the variation in grocery purchases even without model tuning. Using grid search or other model optimization methods, this score could likely be boosted even further, but that was not the main purpose of this exercise. Based on this discovery, I believe that Grocery
to no small extent is a function of the other five features. Grocery
is not necessary for identifying customers’ spending habits because most of its predictive power can be expressed without it. Going further, the remaining features can explain 50% and 17% of the variation in Detergents_Paper
and Milk
respectively. I hypothesis that combining these features with Grocery
into a new macro-feature would help to shrink dimensionality while minimizing information loss, at least compared with combining Fresh
, Frozen
, and Delicatessen
. Attempting to predict these features generates negative R^2
, which is what gave me this insight.To get a better understanding of the dataset, I construct a scatter matrix of each of the six product features present in the data. If I find that the feature I attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, a feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data.
# Produce a scatter matrix for each pair of features in the data
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
# A more colorful visualization that allows for more convenient intepretation
# Taken, with modifications, from http://datascience.stackexchange.com/questions/10459/calculation-and-visualization-of-correlation-matrix-with-pandas
fig = plt.figure()
fig.set_size_inches(10, 10)
ax1 = fig.add_subplot(111)
cmap = cm.get_cmap('jet', 30)
cax = ax1.imshow(data.corr(), interpolation="nearest", cmap=cmap)
ax1.grid(True)
plt.title('Customer Spending Feature Correlation')
labels=['Placeholder','Fresh','Milk','Grocery','Frozen','Deterg./Paper','Deli.']
ax1.set_xticklabels(labels, fontsize=12)
ax1.set_yticklabels(labels, fontsize=12)
cbar = fig.colorbar(cax, ticks= np.linspace(0.0,1.0,5))
plt.show()
# Precise numeric breakdown
display(data.corr())
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
Fresh | 1.000000 | 0.100510 | -0.011854 | 0.345881 | -0.101953 | 0.244690 |
Milk | 0.100510 | 1.000000 | 0.728335 | 0.123994 | 0.661816 | 0.406368 |
Grocery | -0.011854 | 0.728335 | 1.000000 | -0.040193 | 0.924641 | 0.205497 |
Frozen | 0.345881 | 0.123994 | -0.040193 | 1.000000 | -0.131525 | 0.390947 |
Detergents_Paper | -0.101953 | 0.661816 | 0.924641 | -0.131525 | 1.000000 | 0.069291 |
Delicatessen | 0.244690 | 0.406368 | 0.205497 | 0.390947 | 0.069291 | 1.000000 |
Grocery
and Detergents_Paper
; Grocery
and Milk
, and Detergents_Paper
and Milk
. To be more precise, the correlation coefficients for these pairs are 0.924641, 0.728335, and 0.661816 respectively. This supports my previous claim that these three features are not orthonormal and individually significant in predicting customer spending habits. A new insight that this analysis does give is that while the features are highly correlated with each other, they are not particularly correlated with the other three features (Fresh
, Frozen
, and Delicatessen
. This is especially true for Detergents_Paper
and Grocery
. This indicates that instead of discarding the three features outright, they should be “repackaged” into a new one that captures the information they contain. This new feature could maintain the predictive strength that they possess as a group, without being redundant.In this section, I preprocess the data to create a better representation of customers by performing a scaling on the data and detecting and removing outliers. Preprocessing data is often times a critical step in assuring that results obtained from analysis are significant and meaningful.
If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.
# Scale the data using the natural logarithm
log_data = np.log(data)
# Scale the sample data using the natural logarithm
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features
pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
# Numeric visualization
display(data.corr())
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
Fresh | 1.000000 | 0.100510 | -0.011854 | 0.345881 | -0.101953 | 0.244690 |
Milk | 0.100510 | 1.000000 | 0.728335 | 0.123994 | 0.661816 | 0.406368 |
Grocery | -0.011854 | 0.728335 | 1.000000 | -0.040193 | 0.924641 | 0.205497 |
Frozen | 0.345881 | 0.123994 | -0.040193 | 1.000000 | -0.131525 | 0.390947 |
Detergents_Paper | -0.101953 | 0.661816 | 0.924641 | -0.131525 | 1.000000 | 0.069291 |
Delicatessen | 0.244690 | 0.406368 | 0.205497 | 0.390947 | 0.069291 | 1.000000 |
After applying a natural logarithm scaling to the data, the distribution of each feature appears much more normal. Correlations between various pairs of features is still clearly evident after the log transform.
# Display the log-transformed sample data
display(log_samples)
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
0 | 8.933137 | 8.508354 | 9.151227 | 7.419980 | 8.108021 | 7.850104 |
1 | 9.723463 | 7.619233 | 8.071531 | 9.272658 | 4.753590 | 7.218910 |
2 | 9.856763 | 7.173192 | 8.200563 | 8.021256 | 6.565265 | 6.800170 |
Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many “rules of thumb” for what constitutes an outlier in a dataset. Here, I use Tukey’s Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
# For each feature find the data points with extreme high or low values
for feature in log_data.keys():
# Calculate Q1 (25th percentile of the data) for the given feature
# Calculate Q3 (75th percentile of the data) for the given feature
Q1, Q3 = np.percentile(log_data[feature], 25), np.percentile(log_data[feature], 75)
# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = 1.5 * (Q3 - Q1)
# Display the outliers
print("Data points considered outliers for the feature '{}':".format(feature))
display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
# Select the indices for data points to be removed
outliers = [65,66,75,128,154]
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop=True)
Data points considered outliers for the feature 'Fresh':
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
65 | 4.442651 | 9.950323 | 10.732651 | 3.583519 | 10.095388 | 7.260523 |
66 | 2.197225 | 7.335634 | 8.911530 | 5.164786 | 8.151333 | 3.295837 |
81 | 5.389072 | 9.163249 | 9.575192 | 5.645447 | 8.964184 | 5.049856 |
95 | 1.098612 | 7.979339 | 8.740657 | 6.086775 | 5.407172 | 6.563856 |
96 | 3.135494 | 7.869402 | 9.001839 | 4.976734 | 8.262043 | 5.379897 |
128 | 4.941642 | 9.087834 | 8.248791 | 4.955827 | 6.967909 | 1.098612 |
171 | 5.298317 | 10.160530 | 9.894245 | 6.478510 | 9.079434 | 8.740337 |
193 | 5.192957 | 8.156223 | 9.917982 | 6.865891 | 8.633731 | 6.501290 |
218 | 2.890372 | 8.923191 | 9.629380 | 7.158514 | 8.475746 | 8.759669 |
304 | 5.081404 | 8.917311 | 10.117510 | 6.424869 | 9.374413 | 7.787382 |
305 | 5.493061 | 9.468001 | 9.088399 | 6.683361 | 8.271037 | 5.351858 |
338 | 1.098612 | 5.808142 | 8.856661 | 9.655090 | 2.708050 | 6.309918 |
353 | 4.762174 | 8.742574 | 9.961898 | 5.429346 | 9.069007 | 7.013016 |
355 | 5.247024 | 6.588926 | 7.606885 | 5.501258 | 5.214936 | 4.844187 |
357 | 3.610918 | 7.150701 | 10.011086 | 4.919981 | 8.816853 | 4.700480 |
412 | 4.574711 | 8.190077 | 9.425452 | 4.584967 | 7.996317 | 4.127134 |
Data points considered outliers for the feature 'Milk':
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
86 | 10.039983 | 11.205013 | 10.377047 | 6.894670 | 9.906981 | 6.805723 |
98 | 6.220590 | 4.718499 | 6.656727 | 6.796824 | 4.025352 | 4.882802 |
154 | 6.432940 | 4.007333 | 4.919981 | 4.317488 | 1.945910 | 2.079442 |
356 | 10.029503 | 4.897840 | 5.384495 | 8.057377 | 2.197225 | 6.306275 |
Data points considered outliers for the feature 'Grocery':
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
75 | 9.923192 | 7.036148 | 1.098612 | 8.390949 | 1.098612 | 6.882437 |
154 | 6.432940 | 4.007333 | 4.919981 | 4.317488 | 1.945910 | 2.079442 |
Data points considered outliers for the feature 'Frozen':
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
38 | 8.431853 | 9.663261 | 9.723703 | 3.496508 | 8.847360 | 6.070738 |
57 | 8.597297 | 9.203618 | 9.257892 | 3.637586 | 8.932213 | 7.156177 |
65 | 4.442651 | 9.950323 | 10.732651 | 3.583519 | 10.095388 | 7.260523 |
145 | 10.000569 | 9.034080 | 10.457143 | 3.737670 | 9.440738 | 8.396155 |
175 | 7.759187 | 8.967632 | 9.382106 | 3.951244 | 8.341887 | 7.436617 |
264 | 6.978214 | 9.177714 | 9.645041 | 4.110874 | 8.696176 | 7.142827 |
325 | 10.395650 | 9.728181 | 9.519735 | 11.016479 | 7.148346 | 8.632128 |
420 | 8.402007 | 8.569026 | 9.490015 | 3.218876 | 8.827321 | 7.239215 |
429 | 9.060331 | 7.467371 | 8.183118 | 3.850148 | 4.430817 | 7.824446 |
439 | 7.932721 | 7.437206 | 7.828038 | 4.174387 | 6.167516 | 3.951244 |
Data points considered outliers for the feature 'Detergents_Paper':
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
75 | 9.923192 | 7.036148 | 1.098612 | 8.390949 | 1.098612 | 6.882437 |
161 | 9.428190 | 6.291569 | 5.645447 | 6.995766 | 1.098612 | 7.711101 |
Data points considered outliers for the feature 'Delicatessen':
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
66 | 2.197225 | 7.335634 | 8.911530 | 5.164786 | 8.151333 | 3.295837 |
109 | 7.248504 | 9.724899 | 10.274568 | 6.511745 | 6.728629 | 1.098612 |
128 | 4.941642 | 9.087834 | 8.248791 | 4.955827 | 6.967909 | 1.098612 |
137 | 8.034955 | 8.997147 | 9.021840 | 6.493754 | 6.580639 | 3.583519 |
142 | 10.519646 | 8.875147 | 9.018332 | 8.004700 | 2.995732 | 1.098612 |
154 | 6.432940 | 4.007333 | 4.919981 | 4.317488 | 1.945910 | 2.079442 |
183 | 10.514529 | 10.690808 | 9.911952 | 10.505999 | 5.476464 | 10.777768 |
184 | 5.789960 | 6.822197 | 8.457443 | 4.304065 | 5.811141 | 2.397895 |
187 | 7.798933 | 8.987447 | 9.192075 | 8.743372 | 8.148735 | 1.098612 |
203 | 6.368187 | 6.529419 | 7.703459 | 6.150603 | 6.860664 | 2.890372 |
233 | 6.871091 | 8.513988 | 8.106515 | 6.842683 | 6.013715 | 1.945910 |
285 | 10.602965 | 6.461468 | 8.188689 | 6.948897 | 6.077642 | 2.890372 |
289 | 10.663966 | 5.655992 | 6.154858 | 7.235619 | 3.465736 | 3.091042 |
343 | 7.431892 | 8.848509 | 10.177932 | 7.283448 | 9.646593 | 3.610918 |
In this section I use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, I am looking for which compound combinations of features best describe customers.
Now that the data has been scaled to a more normal distribution and has had outliers removed, I can apply PCA to the good_data
to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new “feature” of the space, however it is a composition of the original features present in the data.
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA()
pca.fit(good_data)
# Transform the sample log-data using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
Detergents_Paper
, Grocery
, and Milk
represents prepacked goods that can easily be stocked on shelves and resold to consumers. This feature contains items that are meant to be used or consumed in isolation. The name for this feature could be “Packaged Goods.” As products in Fresh
and Frozen
are commonly not consumed separately but instead jointly with other products, they have the same weight sign in this feature. PC 2, which consists of Fresh
, Frozen
, and Delicatessen
represents food items that can be consumed either in isolation or in combination with one another. If one had to apply a name to this feature, it would be “Meals.” For example, a Turkey sandwhich with fries is a meal that uses items from each of the three categories in the feature. Because Milk
, Grocery
, and Detergents_Paper
(in the form of a napkin) are typically included in meals as well, all six original features have the same weight sign here.Fresh
. Perhaps this feature constitutes foods that have longer shelf lives than those captured in the second component. Even for those customers who order heavily in the “Meal” category defined previously, it is likely not the case that ALL of their orders need to be consumed immediately. Some can be stocked away and used at later dates. Cured meats (Delicatessen
) and frozen pastries (Frozen
), for example, would likely fall into such a new category. The negative weight on Fresh
, which includes foods with very short shelf lives, makes sense with this interpretation as well. In the fourth component, Frozen
stands tall above all the rest in its positive weight, while Delicatessen
nearly balances it on the negative side. My interpretation of this feature is that it represents cheap food that can be stocked in refrigerators for long periods of time. In a sense, the fourth feature takes the third one and specializes it further. Whereas foods in Delicatessen
are typically expensive and even luxurious sometimes, those in Frozen
are usually processed and cheap. The first four principal components encapsulate variation in Frozen
so well that it receives nearly no weight in the fifth and sixth components, as expected. The code below shows how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions. Observe the numerical value for the first four dimensions of the sample points.
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Dimension 6 | |
---|---|---|---|---|---|---|
0 | -1.5672 | -0.9010 | 0.3684 | -0.2682 | -0.4571 | 0.1526 |
1 | 2.3404 | -1.6911 | 0.7155 | 0.5932 | 0.4606 | 0.4074 |
2 | 0.9637 | -0.9607 | -0.4363 | 0.1914 | -0.6969 | 0.3937 |
When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is important for knowing how many dimensions are necessary for the problem. Additionally, if a significant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.
# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2)
pca.fit(good_data)
# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)
# Transform the sample log-data using the PCA fit above
pca_samples = pca.transform(log_samples)
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns=['Dimension 1', 'Dimension 2'])
The code below shows how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions. Observe how the values for the first two dimensions remain unchanged when compared to a PCA transformation in six dimensions.
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns=['Dimension 1', 'Dimension 2']))
Dimension 1 | Dimension 2 | |
---|---|---|
0 | -1.5672 | -0.9010 |
1 | 2.3404 | -1.6911 |
2 | 0.9637 | -0.9607 |
A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1
and Dimension 2
). In addition, the biplot shows the projection of the original features along the components. A biplot can help to interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
<matplotlib.axes._subplots.AxesSubplot at 0x110b69dd0>
With the original feature projection in red, it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the upper left corner of the figure will likely correspond to a customer that spends a lot on 'Milk'
, 'Grocery'
and 'Detergents_Paper'
, but not so much on the other product categories.
Detergents_Paper
, Grocery
, and Milk
, have the the strongest correlation with the first principal component. In fact, Detergent_Paper
is nearly parallel with the new axis, indicating almost perfect correlation. These results also make sense in terms of the pca_results plot as well as the correlation matrix displayed earlier which shows significant correlation between the three features. Fresh
, Frozen
, and Delicatessen
are most strongly correlated with the second principal component, however not to the great extent of the features in the first component. Fresh
is clearly the most dominant feature, while Frozen
and Delicatessen
are roughly equal, however pointing in different directions. This observation also agrees with the pca_results plot that I obtained earlier, which shows Fresh
with the highest weight and the other two features tied for second and third.In this section, I employ the Gaussian Mixture Model clustering algorithm to identify the various customer segments inherent in the data. I then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.
n_samples - 1
clusters (Gaussian distributions in this case) and assigns each data point to one of them based on probability. After log-transforming the data, the feature distributions appear approximately normal, which satisfies the Gaussian distribution assumption of GMM. Looking at the biplot, I can discern two rough clusters, but they are far from non-overlapping. The boundary between the two clusters is ambiguous at best, which makes a strong case for GMM. To demostrate this, I found 31 points in the data that GMM could not definitively classify into one of two clusters. The meaning of “definitive” here is subjective, but I define it as a point having greater than 60% probability of belonging to one of the clusters. I marked these points as green stars in the cluster_results
plot as well. This clear distinction between “strong” and “weak” cluster members is not possible with K-Means, which makes it appear as though all cluster members belong equally, betraying the truth of the data. This reinforces why I chose to use GMM instead.Depending on the problem, the number of clusters that one would expect to be in a dataset may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, the “goodness” of a clustering the can be quantified by calculating each data point’s silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
# Apply clustering algorithm to the reduced data
clusterer = GaussianMixture(n_components=2, n_init=10, random_state=42)
clusterer.fit(reduced_data)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
# Find the cluster centers
centers = clusterer.means_
# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data,preds)
print 'Silhouette score using GMM: {:.3f}'.format(score)
Silhouette score using GMM: 0.422
The table below displays the silhouette scores associated with various GMM clustering setups:
No. of Clusters | Silhouette Score |
---|---|
2 | 0.422 |
3 | 0.394 |
4 | 0.316 |
5 | 0.278 |
8 | 0.333 |
12 | 0.301 |
20 | 0.314 |
50 | 0.307 |
n_components=2
appears to be the optimal cluster size parameter for GMM with its silhouette score of 0.422. Furthermore, increasing the number of clusters does not appear to have a positive and consistent effect on the score, therefore two clusters was chosen as the optimal number of clusters.With the optimal number of clusters chosen for the GMM clustering algorithm using the scoring metric above, I can now visualize the results by executing the code block below. For experimentation purposes, the number of clusters for GMM can be adjusted to see various visualizations. The visualization I’ve provided, however, corresponds with the optimal number of clusters.
# Added some functionality to display points that don't neatly fall into either cluster.
probs = pd.DataFrame(clusterer.predict_proba(reduced_data))
border_indexes = np.array(probs[(probs[0] >= 0.4) & (probs[0] <= 0.6)].index)
border_data = reduced_data.values[border_indexes]
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples, border_data)
Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, I recover the representative customer spending from these data points by applying the inverse transformations.
# Inverse transform the centers
log_centers = pca.inverse_transform(centers)
# Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
Segment 0 | 8953.0 | 2114.0 | 2765.0 | 2075.0 | 353.0 | 732.0 |
Segment 1 | 3552.0 | 7837.0 | 12219.0 | 870.0 | 4696.0 | 962.0 |
Delicatessen
is not very useful in demarcating the customer segments.Fresh
, Milk
, Grocery
, Frozen
, Detergents_Paper
, Delicatessen
respectively. This further supports the assertion that Fresh
and Frozen
are most valuable categories for customers in Segment 0. These two categories are relatively more important to restaurants and farmers’ markets compared to other kinds of establishments, giving additional support to the existence of this cluster.Milk
, Grocery
, and Detergent_Papers
are so many multiples higher than those of Segment 0, that one can assume they are buying enough to resell directly to their constumers. These customers order moderate to large quantities of detergents, cleaners, groceries, and other goods, and stock them for extended periods of time. They likely do not have an urgent need for their products at any particular moment in time. Comparing this segment’s true sales means to their respective quartiles shows that a typical customer’s orders place in Q2, Q3, Q3, Q2, Q3, and Q2 for Fresh
, Milk
, Grocery
, Frozen
, Detergents_Paper
, Delicatessen
respectively. This demonstrates that Milk
, Grocery
, and Detergents_Paper
are categories most integral to Segment 1. All of these products can be purchased in large quantities, stocked easily on shelves, and sold to consumers, validating the retail aspect of this cluster.
# Display the predictions
for i, pred in enumerate(sample_preds):
print "Sample point", i, "predicted to be in Cluster", pred
Sample point 0 predicted to be in Cluster 1
Sample point 1 predicted to be in Cluster 0
Sample point 2 predicted to be in Cluster 0
Sample 0 Difference Table (True Centers - Customer Purchases)
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
Segment 0 | 1233 | 2904 | 6737 | 389 | 2984 | 1854 |
Segment 1 | 3263 | 1391 | 129 | 633 | 275 | 1621 |
Sample 1 Difference Table (True Centers - Customer Purchases)
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
Segment 0 | 7893 | 15 | 513 | 8585 | 221 | 653 |
Segment 1 | 12389 | 4310 | 6353 | 9607 | 2930 | 420 |
Sample 2 Difference Table (True Centers - Customer Purchases)
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicatessen | |
---|---|---|---|---|---|---|
Segment 0 | 10275 | 748 | 954 | 987 | 373 | 186 |
Segment 1 | 14771 | 5043 | 5912 | 2009 | 2336 | 47 |
In this final section, I investigate ways that to make use of the clustered data. First, I discuss how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme. Next, I consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data. Finally, I compare the customer segments to a hidden variable present in the data, to see whether the clustering identified certain relationships.
Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively.
Fresh
and Frozen
over the other categories, while those in Segment 1 order high volumes of Detergents_Paper
, Grocery
, and Milk
. The differences between the two segments on Delicatessen
is not as apparent, so this feature should be excluded from the analysis. Going off these observations, the distributor can predict that customers in Segment 0, who order goods with short expiry, need frequent and consistent deliveries in order to operate their businesses. These customers would benefit from the 5 days a week delivery schedule and would likely react negatively to switch to a 3 day schedule. On the other hand, customers in Segment 1 mainly order goods that do not have as tight expiration dates (with the exception of Milk
). The goods they order can easily be shipped in bulk and stocked for extended periods of time. Those in this group would likely have an indifferent to positive reaction to the delivery schedule change. I float the possibility of positive because they might feel it is a hassle to have to deal with receiving shipments 5 days a week compared to 3 when the need to replenish their goods is not necessarily urgent. By receiving goods only 3 days per week, their employees can spend more time focusing on other more critical job tasks instead of wasting time receiving deliveries.Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), ‘customer segment’ can be considered as an engineered feature for the data. Consider the situation where the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor could classify each new customer to a customer segment to determine the most appropriate delivery service. The details for how this could be handled are explained below.
At the beginning of this project, it was discussed that the 'Channel'
and 'Region'
features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel'
feature into the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.
The code block belows assigns a label to each data point - either 'HoReCa'
(Hotel/Restaurant/Cafe) or 'Retail'
in the reduced space. In addition, the sample points are circled in the plot, which will identify their labeling.
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)
HoReCa
has a larger distribution than Retail
, which can be seen in how many red points enroach on the green space. Retail
also spills over into the HoReCa
space, but not as much as the other way around. GMM also predicted that Segment 1 ‘wraps’ around the top of Segment 0, which seemed strange to me. The true distribution does not show such a pattern. Despite these features, HoReCa
does appear to have a more well-defined center than Retail
. One can visualize a circle surrouding its most dense portion that includes ~70% of its members. In determing what makes a ‘pure’ Hotel/Restaurant/Cafe, I would base this measure on how close a given customer is to the cluster center. Dimension_1 = 1.8
would separate the ‘pure’ Hotels/Restaurants/Cafes on the left from the rest of the data on the right, as seen in the modified plot. Likewise, a vertical line could be set at Dimension_1 = -3
to divide ‘pure’ Retailers on the right from the rest of the data on the left. In my previous definition of customer segments, I placed heavy importance on whether or not the customer is ordering enough that it could resell product to further customers. The above analysis using variation on Dimension_1
is consistent with this definition of the segments. Below -1.8, the customers are buying so little that they could not feasibly run a retail business. Above 3, they order so much that they have to resell it to stay in business. Everywhere in between is much more ambiguous, so I would be more restrained in making predictions about customers in this zone for which I didn’t have labels.