- Experience with the specific topic: Novice
- Professional experience: No industry experience
Knowledge of machine learning is not required, but the reader should be familiar with basic data analysis (e.g., descriptive analysis) and the programming language Python. To follow along, download the sample dataset here.
Introduction to K-means Clustering
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
- The centroids of the K clusters, which can be used to label new data
- Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The "Choosing K" section below describes how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.
This introduction to the K-means clustering algorithm covers:
- Common business cases where K-means is used
- The steps involved in running the algorithm
- A Python example using delivery fleet data
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:
- Behavioral segmentation:
- Segment by purchase history
- Segment by activities on application, website, or platform
- Define personas based on interests
- Create profiles based on activity monitoring
- Inventory categorization:
- Group inventory by sales activity
- Group inventory by manufacturing metrics
- Sorting sensor measurements:
- Detect activity types in motion sensors
- Group images
- Separate audio
- Identify groups in health monitoring
- Detecting bots or anomalies:
- Separate valid activity groups from bots
- Group valid activity to clean up outlier detection
In addition, monitoring if a tracked data point switches between groups over time can be used to detect meaningful changes in the data.
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:
1. Data assigment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be Si.
2. Centroid update step:
In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be obtained using the following techniques.
One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine K.
A number of other techniques exist for validating K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each K.
Example: Applying K-Means Clustering to Delivery Fleet Data
As an example, we'll show how the K-means algorithm works with a sample dataset of delivery fleet driver data. For the sake of simplicity, we'll only be looking at two driver features: mean distance driven per day and the mean percentage of time a driver was >5 mph over the speed limit. In general, this algorithm can be used for any number of features, so long as the number of data samples is much greater than the number of features.
Step 1: Clean and Transform Your Data
For this example, we've already cleaned and completed some simple data transformations. A sample of the data as a is shown below.
The chart below shows the dataset for 4,000 drivers, with the distance feature on the x-axis and speeding feature on the y-axis.
Step 2: Choose K and Run the Algorithm
Start by choosing K=2. For this example, use the Python packages scikit-learn and NumPy for computations as shown below:
The cluster labels are returned in .
Step 4: Iterate Over Several Values of K
Test how the results look for K=4. To do this, all you need to change is the target number of clusters in the function.
The chart below shows the resulting clusters. We see that four distinct groups have been identified by the algorithm; now speeding drivers have been separated from those who follow speed limits, in addition to the rural vs. urban divide. The threshold for speeding is lower with the urban driver group than for the rural drivers, likely due to urban drivers spending more time in intersections and stop-and-go traffic.
Additional Notes and Alternatives
Feature engineering is the process of using domain knowledge to choose which data metrics to input as features into a machine learning algorithm. Feature engineering plays a key role in K-means clustering; using meaningful features that capture the variability of the data is essential for the algorithm to find all of the naturally-occurring groups.
Categorical data (i.e., category labels such as gender, country, browser type) needs to be encoded or separated in a way that can still work with the algorithm.
Feature transformations, particularly to represent rates rather than measurements, can help to normalize the data. For example, in the delivery fleet example above, if total distance driven had been used rather than mean distance per day, then drivers would have been grouped by how long they had been driving for the company rather than rural vs. urban.
A number of alternative clustering algorithms exist including DBScan, spectral clustering, and modeling with Gaussian mixtures. A dimensionality reduction technique, such as principal component analysis, can be used to separate groups of patterns in data. You can read more about alternatives to K-means in this post.
One possible outcome is that there are no organic clusters in the data; instead, all of the data fall along the continuous feature ranges within one single group. In this case, you may need to revisit the data features to see if different measurements need to be included or a feature transformation would better represent the variability in the data. In addition, you may want to impose categories or labels based on domain knowledge and modify your analysis approach.
For more information on K-means clustering, visit the scikit learn site.
Want to keep learning? Download our new study from Forrester about the tools and practices keeping companies on the forefront of data science.
Step 3: Review the Results
The chart below shows the results. Visually, you can see that the K-means algorithm splits the two groups based on the distance feature. Each cluster centroid is marked with a star.
- Group 1 Centroid = (50, 5.2)
- Group 2 Centroid = (180.3, 10.5)
Using domain knowledge of the dataset, we can infer that Group 1 is urban drivers and Group 2 is rural drivers.
This tutorial examines array and cluster data types and gives you an introduction to creating and manipulating arrays and clusters.
An array, which consists of elements and dimensions, is either a control or an indicator – it cannot contain a mixture of controls and indicators. Elements are the data or values contained in the array. A dimension is the length, height, or depth of an array. Arrays are very helpful when you are working with a collection of similar data and when you want to store a history of repetitive computations.
Array elements are ordered. Each element in an array has a corresponding index value, and you can use the array index to access a specific element in that array. In NI LabVIEW software, the array index is zero-based. This means that if a one-dimensional (1D) array contains n elements, the index range is from 0 to n – 1, where index 0 points to the first element in the array and index n – 1 points to the last element in the array.
Clusters group data elements of mixed types. An example of a cluster is the LabVIEW error cluster, which combines a Boolean value, a numeric value, and a string. A cluster is similar to a record or a struct in text-based programming languages.
Similar to arrays, a cluster is either a control or an indicator and cannot contain a mixture of controls and indicators. The difference between clusters and arrays is that a particular cluster has a fixed size, where a particular array can vary in size. Also, a cluster can contain mixed data types, but an array can contain only one data type.
Creating Array Controls and Indicators
To create an array in LabVIEW, you must place an array shell on the front panel and then place an element, such as a numeric, Boolean, or waveform control or indicator, inside the array shell.
1. Create a new VI.
2. Right-click on the front panel to display the Controls palette.
3. On the Controls palette, navigate to Modern»Array, Matrix, & Cluster and drag the Array shell onto the front panel.
4. On the Controls palette, navigate to Modern»Numeric and drag and drop a numeric indicator inside the Array shell.
5. Place your mouse over the array and drag the right side of the array to expand it and display multiple elements.
The previous steps walked you through creating a 1D array. A 2D array stores elements in a grid or matrix. Each element in a 2D array has two corresponding index values, a row index and a column index. Again, as with a 1D array, the row and column indices of a 2D array are zero-based.
To create a 2D array, you must first create a 1D array and then add a dimension to it. Return to the 1D array you created earlier.
1. On the front panel, right-click the index display and select Add Dimension from the shortcut menu.
2. Place your mouse over the array and drag the corner of the array to expand it and display multiple rows and columns.
Up to this point, the numeric elements of the arrays you have created have been dimmed zeros. A dimmed array element indicates that the element is uninitialized. To initialize an element, click inside the element and replace the dimmed 0 with a number of your choice.
You can initialize elements to whatever value you choose. They do not have to be the same values as those shown above.
Creating Array Constants
You can use array constants to store constant data or as a basis for comparison with another array.
1. On the block diagram, right-click to display the Functions palette.
2. On the Functions palette, navigate to Programming»Array and drag the Array Constant onto the block diagram.
3. On the Functions palette, navigate to Programming»Numeric and drag and drop the Numeric Constant inside the Array Constant shell.
4. Resize the array constant and initialize a few of the elements.
If you wire an array as an input to a for loop, LabVIEW provides the option to automatically set the count terminal of the for loop to the size of the array using the Auto-Indexing feature. You can enable or disable the Auto-Indexing option by right-clicking the loop tunnel wired to the array and selecting Enable Indexing (Disable Indexing).
If you enable Auto-Indexing, each iteration of the for loop is passed the corresponding element of the array.
When you wire a value as the output of a for loop, enabling Auto-Indexing outputs an array. The array is equal in size to the number of iterations executed by the for loop and contains the output values of the for loop.
1. Create a new VI. Navigate to File»New VI.
2. Create and initialize two 1D array constants, containing six numeric elements, on the block diagram similar to the array constants shown below.
3. Create a 1D array of numeric indicators on the front panel. Change the numeric type to a 32-bit integer. Right-click on the array and select Representation»I32.
4. Create a for loop on the block diagram and place an add function inside the for loop.
5. Wire one of the array constants into the for loop and connect it to the x terminal of the add function.
6. Wire the other array constant into the for loop and connect it to the y terminal of the add function.
7. Wire the output terminal of the add function outside the for loop and connect it to the input terminal of the array of numeric indicators.
8. Your final block diagram and front panel should be similar to those shown below.
9. Go to the front panel and run the VI. Note that each element in the array of numeric indicators is populated with the sum of the corresponding elements in the two array constants.
Be aware that if you enable Auto-Indexing on more than one loop tunnel and wire the for loop count terminal, the number of iterations is equal to the smaller of the choices. For example, in the figure below, the for loop count terminal is set to run 15 iterations, Array 1 contains 10 elements, and Array 2 contains 20 elements. If you run the VI in the figure below, the for loop executes 10 times and Array Result contains 10 elements. Try this and see it for yourself.
You can create a 2D array using nested for loops and Auto-Indexing as shown below. The outer for loop creates the row elements, and the inner for loop creates the column elements.
1. Create a new VI.
2. Right-click on the front panel to display the Controls palette.
3. On the Controls palette, navigate to Modern»Array, Matrix, & Cluster and drag the Cluster shell onto the front panel.
4. Resize the Cluster shell so that it is big enough to contain multiple elements.
5. On the Controls palette, navigate to Modern»Numeric and drag and drop a numeric control inside the Cluster shell.
6. On the Controls palette, navigate to Modern»String & Path and drag and drop a String Control inside the Cluster shell.
7. On the Controls palette, navigate to Modern»Boolean and drag and drop a Vertical Toggle Switch inside the Cluster shell.
8. Your cluster should now look similar to the one shown below.
You can now wire the numeric, string, and Boolean controls throughout the block diagram with one wire rather than three separate wires.
Creating Cluster Constants
Similar to array constants, you can use cluster constants to store constant data or as a basis for comparison with another cluster. Create cluster constants the same way you created array constants in the steps discussed earlier.
If you already have a cluster control or indicator and want to make a cluster constant that contains the same data types, make a copy of the cluster control or indicator on the block diagram and then right-click on the copy and select Change to Constant from the shortcut menu.
This tutorial examines four main cluster functions often used to manipulate clusters. These are the Bundle, Unbundle, Bundle By Name, and Unbundle By Name functions.
Use the Bundle function to assemble a cluster from individual elements. To wire elements into the Bundle function, use your mouse to resize the function or right-click on the function and select Add Input from the shortcut menu.
Use the Bundle By Name or the Bundle function to modify an existing cluster. You can resize the Bundle By Name function in the same manner as the Bundle function.
The Bundle By Name function is very useful when modifying existing clusters because it lets you know exactly which cluster element you are modifying. For example, consider a cluster that contains two string elements labeled “String 1” and “String 2.” If you use the Bundle function to modify the cluster, the function terminals appear in the form of pink abc’s. You do not know which terminal modifies “String 1” and which terminal modifies “String 2.”
However, if you use the Bundle By Name function to modify the cluster, the function terminals display the element label so that you know which terminal modifies “String 1” and which terminal modifies “String 2.”
Use the Unbundle function to disassemble a cluster into its individual elements. Use the Unbundle by Name function to return specific cluster elements you specify by name. You can also resize these functions for multiple elements in the same manner as the Bundle and Bundle By Name functions.
Cluster elements have a logical order unrelated to their position in the shell. The first object you place in the cluster is element 0, the second is element 1, and so on. If you delete an element, the order adjusts automatically. The cluster order determines the order in which the elements appear as terminals on the Bundle and Unbundle functions on the block diagram. You can view and modify the cluster order by right-clicking the cluster border and selecting Reorder Controls In Cluster from the shortcut menu.
The white box on each element shows its current place in the cluster order. The black box shows the element’s new place in the order. To set the order of a cluster element, enter the new order number in the Click to set to text box and click the element. The cluster order of the element changes, and the cluster order of other elements automatically adjusts. Save the changes by clicking the Confirm button on the toolbar. Revert to the original order by clicking the Cancel button.
Modules Home FIRST Community
Back to Top