K-Means Clustering in PHP

Unsupervised learning technique to find clusters (subsets of data with similar caracteristics) in unknown data.

Given a random set of numbers, the problem to solve here is to determine a fixed number of intervals that best describe the distribution of the initial dataset. Instead of trying to identify how many clusters exists in the dataset (also possible, but outside the scope of this article), a K-means clustering algorithm will always return a fixed (K) number of subsets.

Given the following data set:

D = [1, 3, 2, 5, 6, 2, 3, 1, 30, 36, 45, 3, 15, 19]

The idea would be to find 3 clusters with the most similar numbers:

S = [[1,1,2,2,3,3,3,7], [15,17], [30,36,45]]

For a human being, given such a small set and number of clusters, it would be relatively easy to determine the intervals that best describe the dataset, however that task becomes nearly impossibe as the dataset grows.

K-Means algorithm (http://en.wikipedia.org/wiki/K-means_algorithm) will allows us to do this task automatically.

It works by identifying the space of possible solutions and, as a first step, it creates each of the K clusters and distributes them randomly (or using an heuristic) throughout the available solution space and recalculates it’s positions on each iteration until there are no more changes and the clusters are finally found.
On each iteration, every value it the dataset is assigned to the closest cluster (which was initially positioned at random) by using a distance (or similarity) function. Once all the values have been assigned to one given cluster, the cluster position is then recalculated so that it is placed in the mean location of the values that are assigned to it.
As the algorithm iterates, values will move from one cluster to the other until there aren’t any more changes at which point the algorithm is finished. Depending on the problem, it might be worth limiting the number of iterations to look out for infinite loops.

For our current problem, because it needs to be integrated in an existing application that our company is currently in the process of improving and it is entirely written in PHP, we will choose PHP as our programming language. You can choose any language you are most comfortable with to implement it, but feel free to download the full version of the PHP code at the end of the article. The code comes with some examples and it is fairly well documented. If you ever need to use it, and email or link back is always welcomed.

<?php
 
/**
* This code was created by Jose Fonseca (josefonseca@blip.pt) 
*
* Please feel free to use it in either commercial or non-comercial applications,
* and if you enjoy using it feel free to let us know or to comment on our
* technical blog at http://code.blip.pt
*/
 
 
print_r(kmeans(array(1, 3, 2, 5, 6, 2, 3, 1, 30, 36, 45, 3, 15, 17), 3));
 
 
/**
* This function takes a array of integers and the number of clusters to create.
* It returns a multidimensional array containing the original data organized
* in clusters.
*
* @param array $data
* @param int $k
*
* @return array
*/
function kmeans($data, $k)
{
        $cPositions = assign_initial_positions($data, $k);
        $clusters = array();
 
        while(true)
        {
                $changes = kmeans_clustering($data, $cPositions, $clusters);
                if(!$changes)
                {
                        return kmeans_get_cluster_values($clusters, $data);
                }
                $cPositions = kmeans_recalculate_cpositions($cPositions, $data, $clusters);
        }
}
 
/**
*
*/
function kmeans_clustering($data, $cPositions, &$clusters)
{
        $nChanges = 0;
        foreach($data as $dataKey => $value)
        {
                $minDistance = null;
                $cluster = null;
                foreach($cPositions as $k => $position)
                {
                        $distance = distance($value, $position);
                        if(is_null($minDistance) || $minDistance > $distance)
                        {
                                $minDistance = $distance;
                                $cluster = $k;
                        }
                }
                if(!isset($clusters[$dataKey]) || $clusters[$dataKey] != $cluster)
                {
                        $nChanges++;
                }
                $clusters[$dataKey] = $cluster;
        }
 
        return $nChanges;
}
 
 
 
 
function kmeans_recalculate_cpositions($cPositions, $data, $clusters)
{
        $kValues = kmeans_get_cluster_values($clusters, $data);
        foreach($cPositions as $k => $position)
        {
                $cPositions[$k] = empty($kValues[$k]) ? 0 : kmeans_avg($kValues[$k]);
        }
        return $cPositions;
}
 
function kmeans_get_cluster_values($clusters, $data)
{
        $values = array();
        foreach($clusters as $dataKey => $cluster)
        {
                $values[$cluster][] = $data[$dataKey];
        }
        return $values;
}
 
 
function kmeans_avg($values)
{
        $n = count($values);
        $sum = array_sum($values);
        return ($n == 0) ? 0 : $sum / $n;
}
 
/**
* Calculates the distance (or similarity) between two values. The closer
* the return value is to ZERO, the more similar the two values are.
*
* @param int $v1
* @param int $v2
*
* @return int
*/
function distance($v1, $v2)
{
  return abs($v1-$v2);
}
 
/**
* Creates the initial positions for the given
* number of clusters and data.
* @param array $data
* @param int $k
*
* @return array
*/
function assign_initial_positions($data, $k)
{
        $min = min($data);
        $max = max($data);
        $int = ceil(abs($max - $min) / $k);
        while($k-- > 0)
        {
                $cPositions[$k] = $min + $int * $k;
        }
        return $cPositions;
}

Output

Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 3
            [2] => 2
            [3] => 5
            [4] => 6
            [5] => 2
            [6] => 3
            [7] => 1
            [8] => 3
        )
 
    [2] => Array
        (
            [0] => 30
            [1] => 36
            [2] => 45
        )
 
    [1] => Array
        (
            [0] => 15
            [1] => 17
        )
 
)

5 Responses to “K-Means Clustering in PHP”


  1. 1 aldo

    thanks for your code

  2. 2 Didac

    You got an small problem in your code, I guess.

    In case one of the clouds does not get any value, this cloud is directly deleted.
    You could ask for 45 clusters, but if in one step you only get 30 clusters with values, your function kmeans_get_cluster_values() will make tha array to have just 30 clusters, and on following steps, you don’t search for 45 clusters anymore. What about the cluster with mean=0?

    What if you want 3 clusters and you have the dataset:
    1,1,1,1,1,1,1,2,20,20,20,20,20

    The result should be 1, 2 and 20, but as you get the initial representative values 1, 10 and 20, in the second step, 1s and the 2 go to 1, the 20s go to 20, and the three cluster array becomes a 2 cluster array, and no more changes are done.

  3. 3 José

    Hi Didac

    Thanks for your comment.

    You are right, the code I posted was removing the clusters with no values assigned to it. This can be fixed by making some changes to the function that recalculates the clusters positions:

    function kmeans_recalculate_cpositions($cPositions, $data, $clusters)
     {
     	$kValues = kmeans_get_cluster_values($clusters, $data);
    -	foreach($kValues as $k => $values)
    +	foreach($cPositions as $k => $position)
     	{
    -		$cPosition[$k] = kmeans_avg($values);
    +		$cPositions[$k] = empty($kValues[$k]) ? rand(min($data), max($data)) : kmeans_avg($kValues[$k]);
     	}
     	return $cPositions;
     }

    In this case, if a cluster is empty I just pick a random position in the data set and wait for the next iteration.

    For the data set 1,1,1,1,1,1,1,2,20,20,20,20,20 finding the clusters 1s, 2 and 20s would be the optimal solution, but with this algorithm you will, most of the time, be stuck with a less-optimal solution which would be the one with only 2 clusters, 1s-2 and 20s.

    To find the optimal solution, you would have to improve the algorithm a lot. You would need to create a function to evaluate the quality of your solution, for example assuming that a solution where all clusters have values is better than a solution where some clusters are empty, and optimize it.
    Simulated annealing, for example, is a technique to optimize such functions - “…At each step, the SA heuristic considers some neighbour s’ of the current state s, and probabilistically decides between moving the system to state s’ or staying in state s. The probabilities are chosen so that the system ultimately tends to move to states of lower energy. Typically this step is repeated until the system reaches a state that is good enough for the application, or until a given computation budget has been exhausted.”

    I hope this helps and thanks for visiting our blog,

    José

  4. 4 Didac

    By the way, even in the function function kmeans_recalculate_cpositions(), there is a variable called $cPosition.

    If you check your code, you will see that you want to recalculate the cPositions, but you do all the changes in $cPosition, and this variable is never returned or moved to $cPositions.

    I guess that the variable called $cPosition should be called $cPositions.

    At least I did that change and now the code runs properly. If doing so, the empty clusters will not really disapear, but they will have the same cPosition value they had before, in case some objects can move to this cluster.

    I tried assigning random initial cPosition values, and sometimes I get repeated values, let’s say 4,4 and 10.

    For the example with the set
    1,1,1,1,1,2,10,10,10,10
    after one iteration, the cPositions become
    1.111, 4 and 10

    Even that’s not the best example, it still gives a change to split the values in 3 clusters.

  5. 5 José

    I think you miss-read the diff in my previous comment, the function to recalculate $cPositions is returning the updated array as it should be:

    function kmeans_recalculate_cpositions($cPositions, $data, $clusters)
    {
            $kValues = kmeans_get_cluster_values($clusters, $data);
            foreach($cPositions as $k => $position)
            {
                    $cPositions[$k] = empty($kValues[$k]) ? 0 : kmeans_avg($kValues[$k]);
            }
            return $cPositions;
    }

    If you copy the updated code you should get the correct results, however for the 1,1,1,1,1,2,20,20,20,20 data set you will still only get 2 clusters most of the time.

Comments are currently closed.