K-Prototypes clustering — for when you’re clustering dynamic, real world data

cluster analysis 101
partial head of the data frame
value counts
  • ‘Male’ for Gender
  • ‘No’ for Has phone service
  • ‘DSL’ for Internet service
  • ‘Month-to-month’ for Contract length
  • ‘Mailed check’ for Payment method
Green denotes a match to the ‘Mode’ row
  • Iterate through a range of numbers 2 through 10, specifying the number of clusters for the K-prototype function. As you can see I have initialized the function with the Huang approach.
  • To build the clusters, I fit.predict the data, specifying which of my columns are categorical.
  • I append the cost and number of clusters used to compute that cost to appropriate lists.
  • Finally, I can plot a simple scatterplot and am able to see where the cost starts to flatten off.
costs = []
n_clusters = []
cat_cols = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14]for i in tqdm(range(2, 10)):
try:
kproto = KPrototypes(n_clusters=i, init='Huang', verbose=2)
clusters = kproto.fit_predict(data_corr, categorical=cat_cols)
costs.append(kproto.cost_)
n_clusters.append(i)
except:
print(f"Can't cluster with {i} clusters")
fig = go.Figure(data=go.Scatter(x=n_clusters, y=costs))
fig.show()
Elbow plot
Clusters with K-Prototypes (categorical Modes and continuous Means)
  • Group/Cluster 1 — 1,371
  • Group 2 — 2,314
  • Group 3 — 1,528
  • Group 4 — 1,819

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store