Pipeline Stages
Data Loader
Loads the Iris dataset (150 samples, 4 features)
Standard Scaler
Normalizes features to zero mean, unit variance
K-Means
Partitions data into 3 clusters using k-means++ initialization
Cluster Plot
2D PCA scatter plot colored by cluster assignment
Overview
This pipeline takes raw customer data through a complete clustering workflow: first normalizing features with StandardScaler to ensure each dimension contributes equally, then applying K-Means to discover natural customer segments, and finally visualizing the results in a scatter plot with PCA-reduced dimensions.
Why This Pipeline Works
StandardScaler is essential before K-Means because the algorithm uses Euclidean distance — without scaling, features with larger ranges dominate the distance calculation. K-Means with k-means++ initialization converges faster and produces more stable clusters than random initialization.
Expected Output
The pipeline produces cluster assignments for each customer, a 2D scatter plot colored by cluster, and internal validation metrics (Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index) that quantify how well-separated the clusters are.
Evaluation Metrics
Silhouette Score
measures how similar each point is to its own cluster vs. the nearest cluster (-1 to 1, higher is better)
Calinski-Harabasz Index
ratio of between-cluster to within-cluster variance (higher is better)
Davies-Bouldin Index
average similarity between each cluster and its most similar one (lower is better)
Inertia
sum of squared distances to cluster centers (lower indicates tighter clusters)
Related Topics
Ready to try it?
Load this blueprint into the interactive pipeline editor and run it on sample data — no setup required.
Try this Blueprint