K-Means#

The K-Means algorithm solves clustering problem by partitioning $n$ feature vectors into $k$ clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

Operation	Computational methods	Programming Interface
Training	Lloyd’s	train(…)	train_input	train_result
Inference	Lloyd’s	infer(…)	infer_input	infer_result

Mathematical formulation#

Training#

Given the training set $X = {x_{1}, \dots, x_{n}}$ of $p$ -dimensional feature vectors and a positive integer $k$ , the problem is to find a set $C = {c_{1}, \dots, c_{k}}$ of $p$ -dimensional centroids that minimize the objective function

Φ_{X} (C) = \sum_{i = 1}^{n} d^{2} (x_{i}, C),

where $d^{2} (x_{i}, C)$ is the squared Euclidean distance from $x_{i}$ to the closest centroid in $C$ ,

d^{2} (x_{i}, C) = min_{1 \leq j \leq k} ‖ x_{i} - c_{j} ‖^{2}, 1 \leq i \leq n .

Expression $‖ \cdot ‖$ denotes $L_{2}$ norm.

Note

In the general case, $d$ may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.

Training method: Lloyd’s#

The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where $t$ denotes a index of the current iteration, e.g., $C^{(t)} = {c_{1}^{(t)}, \dots, c_{k}^{(t)}}$ is the set of centroids at the $t$ -th iteration. The method requires the initial centroids $C^{(1)}$ to be specified at the beginning of the algorithm ( $t = 1$ ).

(1) Assignment step: Assign each feature vector $x_{i}$ to the nearest centroid. $y_{i}^{(t)}$ denotes the assigned label (cluster index) to the feature vector $x_{i}$ .

y_{i}^{(t)} = \arg min_{1 \leq j \leq k} ‖ x_{i} - c_{j}^{(t)} ‖^{2}, 1 \leq i \leq n .

Each feature vector from the training set $X$ is assigned to exactly one centroid so that $X$ is partitioned to $k$ disjoint sets (clusters)

S_{j}^{(t)} = {x_{i} \in X : y_{i}^{(t)} = j}, 1 \leq j \leq k .

(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.

c_{j}^{(t + 1)} = \frac{1}{| S_{j}^{(t)} |} \sum_{x \in S_{j}^{(t)}} x, 1 \leq j \leq k .

The steps (1) and (2) are performed until the following stop condition,

\sum_{j = 1}^{k} ‖ c_{j}^{(t)} - c_{j}^{(t + 1)} ‖^{2} < ε,

is satisfied or number of iterations exceeds the maximal value $T$ defined by the user.

Inference#

Given the inference set $X^{'} = {x_{1}^{'}, \dots, x_{m}^{'}}$ of $p$ -dimensional feature vectors and the set $C = {c_{1}, \dots, c_{k}}$ of centroids produced at the training stage, the problem is to predict the index $y_{j}^{'} \in {0, \dots, k - 1}$ , $1 \leq j \leq m$ , of the centroid in accordance with a method-defined rule.

Inference method: Lloyd’s#

Lloyd’s inference method computes the $y_{j}^{'}$ as an index of the centroid closest to the feature vector $x_{j}^{'}$ ,

y_{j}^{'} = \arg min_{1 \leq l \leq k} ‖ x_{j}^{'} - c_{l} ‖^{2}, 1 \leq j \leq m .

Usage example#

Training#

kmeans::model<> run_training(const table& data,
                             const table& initial_centroids) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(10)
      .set_max_iteration_count(50)
      .set_accuracy_threshold(1e-4);

   const auto result = train(kmeans_desc, data, initial_centroids);

   print_table("labels", result.get_labels());
   print_table("centroids", result.get_model().get_centroids());
   print_value("objective", result.get_objective_function_value());

   return result.get_model();
}

Inference#

table run_inference(const kmeans::model<>& model,
                    const table& new_data) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(model.get_cluster_count());

   const auto result = infer(kmeans_desc, model, new_data);

   print_table("labels", result.get_labels());
}

Programming Interface#

All types and functions in this section shall be declared in the oneapi::dal::kmeans namespace and be available via inclusion of the oneapi/dal/algo/kmeans.hpp header file.

Descriptor#

template <typename Float = float,
          typename Method = method::by_default,
          typename Task = task::by_default>
class descriptor {
public:
   explicit descriptor(std::int64_t cluster_count = 2);

   int64_t get_cluster_count() const;
   descriptor& set_cluster_count(int64_t);

   int64_t get_max_iteration_count() const;
   descriptor& set_max_iteration_count(int64_t);

   double get_accuracy_threshold() const;
   descriptor& set_accuracy_threshold(double);
};

template<typename Float = float, typename Method = method::by_default, typename Task = task::by_default> class descriptor#

Template Parameters:

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd.
Task – Tag-type that specifies the type of the problem to solve. Can be task::clustering.

Constructors

descriptor(std::int64_t cluster_count = 2)#: Creates a new instance of the class with the given cluster_count.

Properties

int64_t max_iteration_count#

The maximum number of iterations $T$ . Default value: 100.

Getter & Setter: int64_t get_max_iteration_count() const

descriptor & set_max_iteration_count(int64_t)
Invariants: max_iteration_count >= 0

int64_t cluster_count#

The number of clusters $k$ . Default value: 2.

Getter & Setter: int64_t get_cluster_count() const

descriptor & set_cluster_count(int64_t)
Invariants: cluster_count > 0

double accuracy_threshold#

The threshold $ε$ for the stop condition. Default value: 0.0.

Getter & Setter: double get_accuracy_threshold() const

descriptor & set_accuracy_threshold(double)
Invariants: accuracy_threshold >= 0.0

Method tags#

namespace method {
   struct lloyd {};
   using by_default = lloyd;
} // namespace method

struct lloyd#: Tag-type that denotes Lloyd’s computational method.

using by_default = lloyd #: Alias tag-type for Lloyd’s computational method.

Task tags#

namespace task {
   struct clustering {};
   using by_default = clustering;
} // namespace task

struct clustering#: Tag-type that parameterizes entities used for solving clustering problem.

using by_default = clustering #: Alias tag-type for the clustering task.

Model#

template <typename Task = task::by_default>
class model {
public:
   model();

   const table& get_centroids() const;

   int64_t get_cluster_count() const;
};

template<typename Task = task::by_default> class model#

Template Parameters:: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

model()#: Creates a new instance of the class with the default property values.

Public Methods

const table &get_centroids() const#: A $k \times p$ table with the cluster centroids. Each row of the table stores one centroid.

int64_t get_cluster_count() const#: Number of clusters $k$ in the trained model.

Training train(...)#

Input#

template <typename Task = task::by_default>
class train_input {
public:
   train_input(const table& data = table{},
               const table& initial_centroids = table{});

   const table& get_data() const;
   train_input& set_data(const table&);

   const table& get_initial_centroids() const;
   train_input& set_initial_centroids(const table&);
};

template<typename Task = task::by_default> class train_input#

Template Parameters:: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

train_input(const table &data = table{}, const table &initial_centroids = table{})#: Creates a new instance of the class with the given data and initial_centroids.

Properties

const table &initial_centroids#

A $k \times p$ table with the initial centroids, where each row stores one centroid.

Getter & Setter: const table & get_initial_centroids() const

train_input & set_initial_centroids(const table &)

const table &data#

An $n \times p$ table with the data to be clustered, where each row stores one feature vector.

Getter & Setter: const table & get_data() const

train_input & set_data(const table &)

Result#

template <typename Task = task::by_default>
class train_result {
public:
   train_result();

   const model<Task>& get_model() const;

   const table& get_labels() const;

   int64_t get_iteration_count() const;

   double get_objective_function_value() const;
};

template<typename Task = task::by_default> class train_result#

Template Parameters:: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

train_result()#: Creates a new instance of the class with the default property values.

Public Methods

const model<Task> &get_model() const#: The trained K-means model.

const table &get_labels() const#: An $n \times 1$ table with the labels $y_{i}$ assigned to the samples $x_{i}$ in the input data, $1 \leq 1 \leq n$ .

int64_t get_iteration_count() const#: The number of iterations performed by the algorithm.

double get_objective_function_value() const#: The value of the objective function $Φ_{X} (C)$ , where $C$ is model.centroids (see kmeans::model::centroids).

Operation#

template <typename Float, typename Method, typename Task>
train_result<Task> train(const descriptor<Float, Method, Task>& desc,
                         const train_input<Task>& input);

template<typename Float, typename Method, typename Task> train_result<Task> train(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)#

Runs the training operation for K-Means clustering. For more details see oneapi::dal::train.

Template Parameters:

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd.
Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Parameters:

desc – Descriptor of the algorithm.
input – Input data for the training operation.

Preconditions: input.data.has_data == true

input.initial_centroids.row_count == desc.cluster_count

input.initial_centroids.column_count == input.data.column_count
Postconditions: result.labels.row_count == input.data.row_count

result.labels.column_count == 1

result.labels[i] >= 0

result.labels[i] < desc.cluster_count

result.iteration_count <= desc.max_iteration_count

result.model.centroids.row_count == desc.cluster_count

result.model.centroids.column_count == input.data.column_count

Inference infer(...)#

Input#

template <typename Task = task::by_default>
class infer_input {
public:
   infer_input(const model<Task>& m = model<Task>{},
               const table& data = table{});

   const model<Task>& get_model() const;
   infer_input& set_model(const model<Task>&);

   const table& get_data() const;
   infer_input& set_data(const table&);
};

template<typename Task = task::by_default> class infer_input#

Template Parameters:: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

infer_input(const model<Task> &m = model<Task>{}, const table &data = table{})#: Creates a new instance of the class with the given model and data.

Properties

const model<Task> &model#

An $n \times p$ table with the data to be assigned to the clusters, where each row stores one feature vector. Default value: model<Task>{}.

Getter & Setter: const model< Task > & get_model() const

infer_input & set_model(const model< Task > &)

const table &data#

The trained K-Means model. Default value: table{}.

Getter & Setter: const table & get_data() const

infer_input & set_data(const table &)

Result#

template <typename Task = task::by_default>
class infer_result {
public:
   infer_result();

   const table& get_labels() const;

   double get_objective_function_value() const;
};

template<typename Task = task::by_default> class infer_result#

Template Parameters:: Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

infer_result()#: Creates a new instance of the class with the default property values.

Public Methods

const table &get_labels() const#: An $n \times 1$ table with assignments labels to feature vectors in the input data.

double get_objective_function_value() const#: The value of the objective function $Φ_{X} (C)$ , where $C$ is defined by the corresponding infer_input::model::centroids.

Operation#

template <typename Float, typename Method, typename Task>
infer_result<Task> infer(const descriptor<Float, Method, Task>& desc,
                         const infer_input<Task>& input);

template<typename Float, typename Method, typename Task> infer_result<Task> infer(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)#

Runs the inference operation for K-Means clustering. For more details see oneapi::dal::infer.

Template Parameters:

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd.
Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Parameters:

desc – Descriptor of the algorithm.
input – Input data for the inference operation.

Preconditions: input.data.has_data == true

input.model.centroids.has_data == true

input.model.centroids.row_count == desc.cluster_count

input.model.centroids.column_count == input.data.column_count
Postconditions: result.labels.row_count == input.data.row_count

result.labels.column_count == 1

result.labels[i] >= 0

result.labels[i] < desc.cluster_count

K-Means

Contents

K-Means#

Mathematical formulation#

Training#

Training method: Lloyd’s#

Inference#

Inference method: Lloyd’s#

Usage example#

Training#

Inference#

Programming Interface#

Descriptor#

Method tags#

Task tags#

Model#

Training train(...)#

Input#

Result#

Operation#

Inference infer(...)#

Input#

Result#

Operation#