K-Means¶
The K-Means algorithm solves clustering problem by partitioning \(n\) feature vectors into \(k\) clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.
Operation |
Computational methods |
Programming Interface |
||
Mathematical formulation¶
Training¶
Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and a positive integer \(k\), the problem is to find a set \(C = \{ c_1, \ldots, c_k \}\) of \(p\)-dimensional centroids that minimize the objective function
where \(d^2(x_i, C)\) is the squared Euclidean distance from \(x_i\) to the closest centroid in \(C\),
Expression \(\|\cdot\|\) denotes \(L_2\) norm.
Note
In the general case, \(d\) may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.
Training method: Lloyd’s¶
The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where \(t\) denotes a index of the current iteration, e.g., \(C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}\) is the set of centroids at the \(t\)-th iteration. The method requires the initial centroids \(C^{(1)}\) to be specified at the beginning of the algorithm (\(t = 1\)).
(1) Assignment step: Assign each feature vector \(x_i\) to the nearest centroid. \(y_i^{(t)}\) denotes the assigned label (cluster index) to the feature vector \(x_i\).
Each feature vector from the training set \(X\) is assigned to exactly one centroid so that \(X\) is partitioned to \(k\) disjoint sets (clusters)
(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.
The steps (1) and (2) are performed until the following stop condition,
is satisfied or number of iterations exceeds the maximal value \(T\) defined by the user.
Inference¶
Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the set \(C = \{ c_1, \ldots, c_k \}\) of centroids produced at the training stage, the problem is to predict the index \(y_j' \in \{ 0, \ldots, k-1 \}\), \(1 \leq j \leq m\), of the centroid in accordance with a method-defined rule.
Inference method: Lloyd’s¶
Lloyd’s inference method computes the \(y_j'\) as an index of the centroid closest to the feature vector \(x_j'\),
Usage example¶
Training¶
kmeans::model<> run_training(const table& data,
const table& initial_centroids) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(10)
.set_max_iteration_count(50)
.set_accuracy_threshold(1e-4);
const auto result = train(kmeans_desc, data, initial_centroids);
print_table("labels", result.get_labels());
print_table("centroids", result.get_model().get_centroids());
print_value("objective", result.get_objective_function_value());
return result.get_model();
}
Inference¶
table run_inference(const kmeans::model<>& model,
const table& new_data) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(model.get_cluster_count());
const auto result = infer(kmeans_desc, model, new_data);
print_table("labels", result.get_labels());
}
Programming Interface¶
All types and functions in this section shall be declared in the
oneapi::dal::kmeans
namespace and be available via inclusion of the
oneapi/dal/algo/kmeans.hpp
header file.
Descriptor¶
template <typename Float = float,
typename Method = method::by_default,
typename Task = task::by_default>
class descriptor {
public:
explicit descriptor(std::int64_t cluster_count = 2);
int64_t get_cluster_count() const;
descriptor& set_cluster_count(int64_t);
int64_t get_max_iteration_count() const;
descriptor& set_max_iteration_count(int64_t);
double get_accuracy_threshold() const;
descriptor& set_accuracy_threshold(double);
};
-
template<typename
Float
= float, typenameMethod
= method::by_default, typenameTask
= task::by_default>
classdescriptor
¶ - Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::lloyd
.Task – Tag-type that specifies the type of the problem to solve. Can be
task::clustering
.
Constructors
-
descriptor
(std::int64_t cluster_count = 2)¶ Creates a new instance of the class with the given
cluster_count
.
Properties
-
int64_t
cluster_count
= 2¶ The number of clusters \(k\).
- Getter & Setter
int64_t get_cluster_count() const
descriptor & set_cluster_count(int64_t)
- Invariants
cluster_count > 0
-
int64_t
max_iteration_count
= 100¶ The maximum number of iterations \(T\).
- Getter & Setter
int64_t get_max_iteration_count() const
descriptor & set_max_iteration_count(int64_t)
- Invariants
max_iteration_count >= 0
-
double
accuracy_threshold
= 0.0¶ The threshold \(\varepsilon\) for the stop condition.
- Getter & Setter
double get_accuracy_threshold() const
descriptor & set_accuracy_threshold(double)
- Invariants
accuracy_threshold >= 0.0
Model¶
template <typename Task = task::by_default>
class model {
public:
model();
const table& get_centroids() const;
int64_t get_cluster_count() const;
};
-
template<typename
Task
= task::by_default>
classmodel
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
-
model
()¶ Creates a new instance of the class with the default property values.
Properties
-
const table &
centroids
= table{}¶ A \(k \times p\) table with the cluster centroids. Each row of the table stores one centroid.
- Getter & Setter
const table & get_centroids() const
-
int64_t
cluster_count
= 0¶ Number of clusters \(k\) in the trained model.
- Getter & Setter
int64_t get_cluster_count() const
- Invariants
cluster_count == centroids.row_count
Training train(...)
¶
Input¶
template <typename Task = task::by_default>
class train_input {
public:
train_input(const table& data = table{},
const table& initial_centroids = table{});
const table& get_data() const;
train_input& set_data(const table&);
const table& get_initial_centroids() const;
train_input& set_initial_centroids(const table&);
};
-
template<typename
Task
= task::by_default>
classtrain_input
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
-
train_input
(const table &data = table{}, const table &initial_centroids = table{})¶ Creates a new instance of the class with the given
data
andinitial_centroids
.
Properties
-
const table &
data
¶ An \(n \times p\) table with the data to be clustered, where each row stores one feature vector.
- Getter & Setter
const table & get_data() const
train_input & set_data(const table &)
-
const table &
initial_centroids
¶ A \(k \times p\) table with the initial centroids, where each row stores one centroid.
- Getter & Setter
const table & get_initial_centroids() const
train_input & set_initial_centroids(const table &)
Result¶
template <typename Task = task::by_default>
class train_result {
public:
train_result();
const model<Task>& get_model() const;
const table& get_labels() const;
int64_t get_iteration_count() const;
double get_objective_function_value() const;
};
-
template<typename
Task
= task::by_default>
classtrain_result
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
-
train_result
()¶ Creates a new instance of the class with the default property values.
Properties
-
const model<Task> &
model
= model<Task>{}¶ The trained K-means model.
- Getter & Setter
const model< Task > & get_model() const
-
const table &
labels
= table{}¶ An \(n \times 1\) table with the labels \(y_i\) assigned to the samples \(x_i\) in the input data, \(1 \leq 1 \leq n\).
- Getter & Setter
const table & get_labels() const
-
int64_t
iteration_count
= 0¶ The number of iterations performed by the algorithm.
- Getter & Setter
int64_t get_iteration_count() const
- Invariants
iteration_count >= 0
-
double
objective_function_value
¶ The value of the objective function \(\Phi_X(C)\), where \(C\) is
model.centroids
(seekmeans::model::centroids
).- Getter & Setter
double get_objective_function_value() const
- Invariants
objective_function_value >= 0.0
Operation¶
-
template<typename
Float
, typenameMethod
, typenameTask
>
train_result<Task>train
(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)¶ Runs the training operation for K-Means clustering. For more details see
oneapi::dal::train
.- Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::lloyd
.Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
- Parameters
desc – Descriptor of the algorithm.
input – Input data for the training operation.
- Preconditions
- Postconditions
result.labels.row_count == input.data.row_count
result.labels.column_count == 1
result.labels[i] >= 0
result.labels[i] < desc.cluster_count
result.iteration_count <= desc.max_iteration_count
result.model.centroids.row_count == desc.cluster_count
result.model.centroids.column_count == input.data.column_count
Inference infer(...)
¶
Input¶
template <typename Task = task::by_default>
class infer_input {
public:
infer_input(const model<Task>& m = model<Task>{},
const table& data = table{});
const model<Task>& get_model() const;
infer_input& set_model(const model<Task>&);
const table& get_data() const;
infer_input& set_data(const table&);
};
-
template<typename
Task
= task::by_default>
classinfer_input
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
-
infer_input
(const model<Task> &m = model<Task>{}, const table &data = table{})¶ Creates a new instance of the class with the given
model
anddata
.
Properties
-
const model<Task> &
model
= model<Task>{}¶ An \(n \times p\) table with the data to be assigned to the clusters, where each row stores one feature vector.
- Getter & Setter
const model< Task > & get_model() const
infer_input & set_model(const model< Task > &)
-
const table &
data
= table{}¶ The trained K-Means model.
- Getter & Setter
const table & get_data() const
infer_input & set_data(const table &)
Result¶
template <typename Task = task::by_default>
class infer_result {
public:
infer_result();
const table& get_labels() const;
double get_objective_function_value() const;
};
-
template<typename
Task
= task::by_default>
classinfer_result
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
Constructors
-
infer_result
()¶ Creates a new instance of the class with the default property values.
Properties
-
const table &
labels
= table{}¶ An \(n \times 1\) table with assignments labels to feature vectors in the input data.
- Getter & Setter
const table & get_labels() const
-
double
objective_function_value
= 0.0¶ The value of the objective function \(\Phi_X(C)\), where \(C\) is defined by the corresponding
infer_input::model::centroids
.- Getter & Setter
double get_objective_function_value() const
- Invariants
objective_function_value >= 0.0
Operation¶
-
template<typename
Float
, typenameMethod
, typenameTask
>
infer_result<Task>infer
(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)¶ Runs the inference operation for K-Means clustering. For more details see
oneapi::dal::infer
.- Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::lloyd
.Task – Tag-type that specifies type of the problem to solve. Can be
task::clustering
.
- Parameters
desc – Descriptor of the algorithm.
input – Input data for the inference operation.
- Preconditions
- Postconditions