Overview

This pipeline takes raw customer data through a complete clustering workflow: first normalizing features with StandardScaler to ensure each dimension contributes equally, then applying K-Means to discover natural customer segments, and finally visualizing the results in a scatter plot with PCA-reduced dimensions.

Why This Pipeline Works

StandardScaler is essential before K-Means because the algorithm uses Euclidean distance — without scaling, features with larger ranges dominate the distance calculation. K-Means with k-means++ initialization converges faster and produces more stable clusters than random initialization.

Expected Output

The pipeline produces cluster assignments for each customer, a 2D scatter plot colored by cluster, and internal validation metrics (Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index) that quantify how well-separated the clusters are.

Evaluation Metrics

Silhouette Score

measures how similar each point is to its own cluster vs. the nearest cluster (-1 to 1, higher is better)

Calinski-Harabasz Index

ratio of between-cluster to within-cluster variance (higher is better)

Davies-Bouldin Index

average similarity between each cluster and its most similar one (lower is better)

Inertia

sum of squared distances to cluster centers (lower indicates tighter clusters)

Customer Segmentation Pipeline

Pipeline Stages

Overview

Why This Pipeline Works

Expected Output

Evaluation Metrics

Related Topics

Ready to try it?