k-Means Clustering Explorer

Segmentation tool

Explore customer or campaign segments using k-means clustering. Upload numeric data or use a demo dataset, choose how many clusters to search for, and see how different values of k change the group structure.

OVERVIEW & OBJECTIVE

k-means is an exploratory technique that partitions observations into k clusters so that points in the same cluster are close together and clusters are far apart. It requires numeric features and works best when clusters are roughly spherical in the feature space.

$$\min_{\{C_j,\ \mu_j\}} \sum_{j=1}^{k} \sum_{x_i \in C_j} \lVert x_i - \mu_j \rVert^2$$

where \(C_j\) is the set of observations assigned to cluster \(j\) and \(\mu_j\) is that cluster’s centroid (mean vector). This tool lets you adjust the number of clusters and see within-cluster variation, elbow and silhouette diagnostics, and plain-language summaries.

Additional notes & assumptions

k-means assumes numeric, reasonably scaled features. It is sensitive to outliers and feature scales, so consider standardizing variables (mean 0, sd 1) before clustering. Results are exploratory and should be interpreted alongside business context, not as definitive “truth.”

MARKETING SCENARIOS

Use presets to load example segmentation datasets (such as customers with revenue and engagement metrics or campaigns with spend and response). You can download the scenario data to tweak it in Excel or Numbers and then re-upload.

INPUTS & SETTINGS

Load data & choose features

Upload numeric CSV data

Provide a header row and numeric columns only (up to 5,000 observations). After upload you can choose which variables to include in the clustering and which to plot on the chart.

Drop a CSV/TSV here, or .

Use a demo segmentation dataset

Load a synthetic marketing dataset with multiple clusters (for example, low/medium/high value customers) to explore the tool without uploading your own file.


Feature selection

Choose which numeric columns are used to form clusters.

Plot axes

Pick two variables for the scatterplot (x and y).

Preprocessing & clustering

to
Additional info & guidance

Start with a small range of clusters (for example, 2–8) and look at the elbow and silhouette charts to decide on a reasonable value of k. Extremely large k can overfit noise and create tiny clusters that are hard to interpret.

You can change the standardization option to see how feature scaling impacts cluster assignments, which is especially important when some variables are on very different scales (for example, annual revenue vs. satisfaction scores).

VISUAL OUTPUT

Cluster map (scatterplot)

Each point is an observation colored by cluster membership, plotted using the selected x and y variables. Centroids are shown as larger markers. The underlying clustering uses all selected features (after any chosen scaling), but the axes on this chart always show the variables in their original units. This means the map is a 2D or 3D projection: clusters that overlap visually on these axes may still differ on other variables included in the model.

Elbow chart (within-cluster variation)

Plots total within-cluster sum of squares (WCSS) versus k. Look for an “elbow” where additional clusters give diminishing returns.

Advanced visualization settings
Cluster map dimensions

Subsampling can make dense plots easier to read when you have many observations. When you show only a random subset of points, the clusters and centroids are still based on the full dataset.

Silhouette diagnostics

Shows the average silhouette score for the current choice of k. Silhouette values range from -1 to 1: values near 1 indicate observations are much closer to their own cluster than to others (clear, well-separated segments), values near 0 indicate overlapping or ambiguous boundaries, and negative values suggest some observations fit better in a different cluster.

CLUSTER SUMMARY

Number of clusters (k):
Total within-cluster SS:
Average silhouette (k):
Largest cluster size:

APA-Style Report

After you run k-means, this panel will describe the clustering solution in a more formal statistical style, including the number of clusters, within-cluster variation, and basic diagnostics.

Managerial Interpretation

This panel translates the clustering into plain language, highlighting how many customer or campaign segments were found, how they differ on key variables, and how actionable the segments appear.

Cluster profile table

Use this table to understand how each segment differs in practical terms. The feature means describe the typical values within a cluster (for example, average spend or average engagement level). The within-cluster standard deviations show how variable each feature is inside that segment—small values mean members are similar on that metric, while large values mean the segment is more heterogeneous. The average distance to the centroid summarizes the overall tightness of the segment across all features; smaller distances indicate more internally consistent, well-defined clusters that are easier to describe and target.

Cluster Size Feature means Within-cluster standard deviations Avg distance to centroid
Run clustering to see cluster profiles here.

The downloaded file includes your original numeric columns plus a cluster_id for each observation and its distance_to_centroid (how far it sits from the center of its assigned cluster based on the selected features).

DIAGNOSTICS & ASSUMPTIONS

Diagnostics & assumptions

The tool will flag potential issues such as very small clusters, extreme outliers, or highly imbalanced cluster sizes. Use these diagnostics to decide whether to simplify the feature set, change k, or reconsider using k-means for this dataset.