k-Nearest Neighbors Classification (k-NN)¶
\(k\)-NN classification algorithm infers the class for the new feature vector by computing majority vote of the \(k\) nearest observations from the training set.
Operation |
Computational methods |
Programming Interface |
|||
Mathematical formulation¶
Training¶
Let \(X = \{ x_1, \ldots, x_n \}\) be the training set of \(p\)-dimensional feature vectors, let \(Y = \{ y_1, \ldots, y_n \}\) be the set of class labels, where \(y_i \in \{ 0, \ldots, c-1 \}\), \(1 \leq i \leq n\). Given \(X\), \(Y\) and the number of nearest neighbors \(k\), the problem is to build a model that allows distance computation between the feature vectors in training and inference sets at the inference stage.
Training method: brute-force¶
The training operation produces the model that stores all the feature vectors from the initial training set \(X\).
Training method: k-d tree¶
The training operation builds a \(k\)-\(d\) tree that partitions the training set \(X\) (for more details, see k-d Tree).
Inference¶
Let \(X' = \{ x_1', \ldots, x_m' \}\) be the inference set of \(p\)-dimensional feature vectors. Given \(X'\), the model produced at the training stage and the number of nearest neighbors \(k\), the problem is to predict the label \(y_j'\) for each \(x_j'\), \(1 \leq j \leq m\), by performing the following steps:
Identify the set \(N(x_j') \subseteq X\) of the \(k\) feature vectors in the training set that are nearest to \(x_j'\) with respect to the Euclidean distance.
Estimate the conditional probability for the \(l\)-th class as the fraction of vectors in \(N(x_j')\) whose labels \(y_j\) are equal to \(l\):
(1)¶\[P_{jl} = \frac{1}{| N(x_j') |} \Big| \big\{ x_r \in N(x_j') : y_r = l \big\} \Big|, \quad 1 \leq j \leq m, \; 0 \leq l < c.\]Predict the class that has the highest probability for the feature vector \(x_j'\):
(2)¶\[y_j' = \mathrm{arg}\max_{0 \leq l < c} P_{jl}, \quad 1 \leq j \leq m.\]
Inference method: brute-force¶
Brute-force inference method determines the set \(N(x_j')\) of the nearest feature vectors by iterating over all the pairs \((x_j', x_i)\) in the implementation defined order, \(1 \leq i \leq n\), \(1 \leq j \leq m\). The final prediction is computed according to the equations (1) and (2).
Inference method: k-d tree¶
K-d tree inference method traverses the \(k\)-\(d\) tree to find feature vectors associated with a leaf node that are closest to \(x_j'\), \(1 \leq j \leq m\). The set \(\tilde{n}(x_j')\) of the currently-known nearest \(k\)-th neighbors is progressively updated during tree traversal. The search algorithm limits exploration of the nodes for which the distance between the \(x_j'\) and respective part of the feature space is not less than the distance between \(x_j'\) and the most distant feature vector from \(\tilde{n}(x_j')\). Once tree traversal is finished, \(\tilde{n}(x_j') \equiv N(x_j')\). The final prediction is computed according to the equations (1) and (2).
Usage example¶
Training¶
knn::model<> run_training(const table& data,
const table& labels) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};
const auto result = train(knn_desc, data, labels);
return result.get_model();
}
Inference¶
table run_inference(const knn::model<>& model,
const table& new_data) {
const std::int64_t class_count = 10;
const std::int64_t neighbor_count = 5;
const auto knn_desc = knn::descriptor<float>{class_count, neighbor_count};
const auto result = infer(knn_desc, model, new_data);
print_table("labels", result.get_labels());
}
Programming Interface¶
All types and functions in this section shall be declared in the
oneapi::dal::knn
namespace and be available via inclusion of the
oneapi/dal/algo/knn.hpp
header file.
Descriptor¶
template <typename Float = float,
typename Method = method::by_default,
typename Task = task::by_default>
class descriptor {
public:
explicit descriptor(std::int64_t class_count,
std::int64_t neighbor_count);
std::int64_t get_class_count() const;
descriptor& set_class_count(std::int64_t);
std::int64_t get_neighbor_count() const;
descriptor& set_neighbor_count(std::int64_t);
};
-
template<typename
Float
= float, typenameMethod
= method::by_default, typenameTask
= task::by_default>
classdescriptor
¶ - Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::bruteforce
ormethod::kd_tree
.Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
Constructors
-
descriptor
(std::int64_t class_count, std::int64_t neighbor_count)¶ Creates a new instance of the class with the given
class_count
andneighbor_count
property values.
Properties
-
std::int64_t
class_count
¶ The number of classes \(c\).
- Getter & Setter
std::int64_t get_class_count() const
descriptor & set_class_count(std::int64_t)
- Invariants
class_count > 1
-
std::int64_t
neighbor_count
¶ The number of neighbors \(k\).
- Getter & Setter
std::int64_t get_neighbor_count() const
descriptor & set_neighbor_count(std::int64_t)
- Invariants
neighbor_count > 0
Model¶
template <typename Task = task::by_default>
class model {
public:
model();
};
-
template<typename
Task
= task::by_default>
classmodel
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
Constructors
-
model
()¶ Creates a new instance of the class with the default property values.
Training train(...)
¶
Input¶
template <typename Task = task::by_default>
class train_input {
public:
train_input(const table& data = table{},
const table& labels = table{});
const table& get_data() const;
train_input& set_data(const table&);
const table& get_labels() const;
train_input& set_labels(const table&);
};
-
template<typename
Task
= task::by_default>
classtrain_input
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
Constructors
-
train_input
(const table &data = table{}, const table &labels = table{})¶ Creates a new instance of the class with the given
data
andlabels
property values.
Properties
-
const table &
data
= table{}¶ The training set \(X\).
- Getter & Setter
const table & get_data() const
train_input & set_data(const table &)
-
const table &
labels
= table{}¶ Vector of labels \(y\) for the training set \(X\).
- Getter & Setter
const table & get_labels() const
train_input & set_labels(const table &)
Result¶
template <typename Task = task::by_default>
class train_result {
public:
train_result();
const model<Task>& get_model() const;
};
-
template<typename
Task
= task::by_default>
classtrain_result
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
Constructors
-
train_result
()¶ Creates a new instance of the class with the default property values.
Properties
Operation¶
-
template<typename
Float
, typenameMethod
, typenameTask
>
train_result<Task>train
(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)¶ Runs the training operation for \(k\)-NN classifier. For more details see
oneapi::dal::train
.- Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::bruteforce
ormethod::kd_tree
.Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
- Parameters
desc – Descriptor of the algorithm.
input – Input data for the training operation.
- Preconditions
Inference infer(...)
¶
Input¶
template <typename Task = task::by_default>
class infer_input {
public:
infer_input(const model<Task>& m = model<Task>{},
const table& data = table{});
const model<Task>& get_model() const;
infer_input& set_model(const model&);
const table& get_data() const;
infer_input& set_data(const table&);
};
-
template<typename
Task
= task::by_default>
classinfer_input
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
Constructors
-
infer_input
(const model<Task> &m = model<Task>{}, const table &data = table{})¶ Creates a new instance of the class with the given
model
anddata
property values.
Properties
-
const model<Task> &
model
= model<Task>{}¶ The trained \(k\)-NN model.
- Getter & Setter
const model< Task > & get_model() const
infer_input & set_model(const model &)
-
const table &
data
= table{}¶ The dataset for inference \(X'\).
- Getter & Setter
const table & get_data() const
infer_input & set_data(const table &)
Result¶
template <typename Task = task::by_default>
class infer_result {
public:
infer_result();
const table& get_labels() const;
};
-
template<typename
Task
= task::by_default>
classinfer_result
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
Constructors
-
infer_result
()¶ Creates a new instance of the class with the default property values.
Properties
-
const table &
labels
= table{}¶ The predicted labels.
- Getter & Setter
const table & get_labels() const
Operation¶
-
template<typename
Float
, typenameMethod
, typenameTask
>
infer_result<Task>infer
(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)¶ Runs the inference operation for \(k\)-NN classifier. For more details see
oneapi::dal::infer
.- Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::bruteforce
ormethod::kd_tree
.Task – Tag-type that specifies type of the problem to solve. Can be
task::classification
.
- Parameters
desc – Descriptor of the algorithm.
input – Input data for the inference operation.
- Preconditions
input.data.has_data == true
- Postconditions