Principal Components Analysis (PCA)¶
Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.
Operation |
Computational methods |
Programming Interface |
|||
Mathematical formulation¶
Training¶
Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and the number of principal components \(r\), the problem is to compute \(r\) principal directions (\(p\)-dimensional eigenvectors [Lang87]) for the training set. The eigenvectors can be grouped into the \(r \times p\) matrix \(T\) that contains one eigenvector in each row.
Training method: Covariance¶
This method uses eigenvalue decomposition of the covariance matrix to compute the principal components of the datasets. The method relies on the following steps:
Computation of the covariance matrix
Computation of the eigenvectors and eigenvalues
Formation of the matrices storing the results
Covariance matrix computation shall be performed in the following way:
Compute the vector-column of sums \(s_i = \sum_{j=1}^n x_{i,j}, \quad 1 \leq i \leq p\).
Compute the cross-product \(P = X^TX - s^Ts\).
Compute the covariance matrix \(\Sigma = \frac{1}{n - 1} P\).
To compute eigenvalues \(\lambda_i\) and eigenvectors \(\upsilon_i\), the implementer can choose an arbitrary method such as [Ping14].
The final step is to sort the set of pairs \((\lambda_i, \upsilon_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (\upsilon_{i,1}, \cdots, \upsilon_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset shall be returned.
Training method: SVD¶
This method uses singular value decomposition of the dataset to compute its principal components. The method relies on the following steps:
Computation of the singular values and singular vectors
Formation of the matrices storing the results
To compute singular values \(\lambda_i\) and singular vectors \(u_i\) and \(v_i\), the implementer can choose an arbitrary method such as [Demmel90].
The final step is to sort the set of pairs \((\lambda_i, v_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (v_{i,1}, \cdots, v_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset shall be returned.
Sign-flip technique¶
Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to sign ambiguity. To get the deterministic result, a sign-flip technique should be applied. One of the sign-flip techniques proposed in [Bro07] requires the following modification of matrix \(T\):
where \(T_i\) is \(i\)-th row, \(T_{ij}\) is the element in the \(i\)-th row and \(j\)-th column, \(\mathrm{sgn}(\cdot)\) is the signum function,
Note
The sign-flip technique described above is an example. oneDAL spec does not require implementation of this sign-flip technique. Implementer can choose an arbitrary technique that modifies the eigenvectors’ signs.
Inference¶
Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the \(r \times p\) matrix \(T\) produced at the training stage, the problem is to transform \(X'\) to the set \(X'' = \{ x_1'', \ldots, x_m'' \}\), where \(x_{j}''\) is an \(r\)-dimensional feature vector, \(1 \leq j \leq m\).
The feature vector \(x_{j}''\) is computed through applying linear transformation [Lang87] defined by the matrix \(T\) to the feature vector \(x_{j}'\),
Inference methods: Covariance and SVD¶
Covariance and SVD inference methods compute \(x_{j}''\) according to (1).
Usage example¶
Training¶
pca::model<> run_training(const table& data) {
const auto pca_desc = pca::descriptor<float>{}
.set_component_count(5)
.set_deterministic(true);
const auto result = train(pca_desc, data);
print_table("means", result.get_means());
print_table("variances", result.get_variances());
print_table("eigenvalues", result.get_eigenvalues());
print_table("eigenvectors", result.get_eigenvectors());
return result.get_model();
}
Inference¶
table run_inference(const pca::model<>& model,
const table& new_data) {
const auto pca_desc = pca::descriptor<float>{}
.set_component_count(model.get_component_count());
const auto result = infer(pca_desc, model, new_data);
print_table("labels", result.get_transformed_data());
}
Programming Interface¶
All types and functions in this section shall be declared in the
oneapi::dal::pca
namespace and be available via inclusion of the
oneapi/dal/algo/pca.hpp
header file.
Descriptor¶
template <typename Float = float,
typename Method = method::by_default,
typename Task = task::by_default>
class descriptor {
public:
explicit descriptor(std::int64_t component_count = 0);
int64_t get_component_count() const;
descriptor& set_component_count(int64_t);
bool get_deterministic() const;
descriptor& set_deterministic(bool);
};
-
template<typename
Float
= float, typenameMethod
= method::by_default, typenameTask
= task::by_default>
classdescriptor
¶ - Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::cov
ormethod::svd
.Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
Constructors
-
descriptor
(std::int64_t component_count = 0)¶ Creates a new instance of the class with the given
component_count
property value.
Properties
-
int64_t
component_count
= 0¶ The number of principal components \(r\). If it is zero, the algorithm computes the eigenvectors for all features, \(r = p\).
- Getter & Setter
int64_t get_component_count() const
descriptor & set_component_count(int64_t)
- Invariants
component_count >= 0
-
bool
deterministic
= true¶ Specifies whether the algorithm applies the Sign-flip technique. If it is true, the directions of the eigenvectors must be deterministic.
- Getter & Setter
bool get_deterministic() const
descriptor & set_deterministic(bool)
Model¶
template <typename Task = task::by_default>
class model {
public:
model();
const table& get_eigenvectors() const;
int64_t get_component_count() const;
};
-
template<typename
Task
= task::by_default>
classmodel
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
Constructors
-
model
()¶ Creates a new instance of the class with the default property values.
Properties
-
const table &
eigenvectors
= table{}¶ An \(r \times p\) table with the eigenvectors. Each row contains one eigenvector.
- Getter & Setter
const table & get_eigenvectors() const
-
int64_t
component_count
= 0¶ The number of components \(r\) in the trained model.
- Getter & Setter
int64_t get_component_count() const
- Invariants
component_count == eigenvectors.row_count
Training train(...)
¶
Input¶
template <typename Task = task::by_default>
class train_input {
public:
train_input(const table& data = table{});
const table& get_data() const;
train_input& set_data(const table&);
};
-
template<typename
Task
= task::by_default>
classtrain_input
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
Constructors
-
train_input
(const table &data = table{})¶ Creates a new instance of the class with the given
data
property value.
Properties
-
const table &
data
= table{}¶ An \(n \times p\) table with the training data, where each row stores one feature vector.
- Getter & Setter
const table & get_data() const
train_input & set_data(const table &)
Result¶
template <typename Task = task::by_default>
class train_result {
public:
train_result();
const model<Task>& get_model() const;
const table& get_means() const;
const table& get_variances() const;
const table& get_eigenvalues() const;
const table& get_eigenvectors() const;
};
-
template<typename
Task
= task::by_default>
classtrain_result
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
Constructors
-
train_result
()¶ Creates a new instance of the class with the default property values.
Properties
-
const model<Task> &
model
= model<Task>{}¶ The trained PCA model.
- Getter & Setter
const model< Task > & get_model() const
-
const table &
means
= table{}¶ A \(1 \times r\) table that contains the mean values for the first \(r\) features.
- Getter & Setter
const table & get_means() const
-
const table &
variances
= table{}¶ A \(1 \times r\) table that contains the variances for the first \(r\) features.
- Getter & Setter
const table & get_variances() const
-
const table &
eigenvalues
= table{}¶ A \(1 \times r\) table that contains the eigenvalues for for the first \(r\) features.
- Getter & Setter
const table & get_eigenvalues() const
-
const table &
eigenvectors
= table{}¶ An \(r \times p\) table with the eigenvectors. Each row contains one eigenvector.
- Getter & Setter
const table & get_eigenvectors() const
- Invariants
eigenvectors == model.eigenvectors
Operation¶
-
template<typename
Float
, typenameMethod
, typenameTask
>
train_result<Task>train
(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)¶ Runs the training operation for PCA. For more details, see
oneapi::dal::train
.- Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::cov
ormethod::svd
.Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
- Parameters
desc – Descriptor of the algorithm.
input – Input data for the training operation.
- Preconditions
- Postconditions
result.means.row_count == 1
result.means.column_count == desc.component_count
result.variances.row_count == 1
result.variances.column_count == desc.component_count
result.variances[i] >= 0.0
result.eigenvalues.row_count == 1
result.eigenvalues.column_count == desc.component_count
result.model.eigenvectors.row_count == 1
result.model.eigenvectors.column_count == desc.component_count
Inference infer(...)
¶
Input¶
template <typename Task = task::by_default>
class infer_input {
public:
infer_input(const model<Task>& m = model<Task>{},
const table& data = table{});
const model<Task>& get_model() const;
infer_input& set_model(const model&);
const table& get_data() const;
infer_input& set_data(const table&);
};
-
template<typename
Task
= task::by_default>
classinfer_input
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
Constructors
-
infer_input
(const model<Task> &m = model<Task>{}, const table &data = table{})¶ Creates a new instance of the class with the given
model
anddata
property values.
Properties
-
const model<Task> &
model
= model<Task>{}¶ The trained PCA model.
- Getter & Setter
const model< Task > & get_model() const
infer_input & set_model(const model &)
-
const table &
data
= table{}¶ The dataset for inference \(X'\).
- Getter & Setter
const table & get_data() const
infer_input & set_data(const table &)
Result¶
template <typename Task = task::by_default>
class infer_result {
public:
infer_result();
const table& get_transformed_data() const;
};
-
template<typename
Task
= task::by_default>
classinfer_result
¶ - Template Parameters
Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
Constructors
-
infer_result
()¶ Creates a new instance of the class with the default property values.
Properties
-
const table &
transformed_data
= table{}¶ An \(n \times r\) table that contains data projected to the \(r\) principal components.
- Getter & Setter
const table & get_transformed_data() const
Operation¶
-
template<typename
Float
, typenameMethod
, typenameTask
>
infer_result<Task>infer
(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)¶ Runs the inference operation for PCA. For more details see
oneapi::dal::infer
.- Template Parameters
Float – The floating-point type that the algorithm uses for intermediate computations. Can be
float
ordouble
.Method – Tag-type that specifies an implementation of algorithm. Can be
method::cov
ormethod::svd
.Task – Tag-type that specifies type of the problem to solve. Can be
task::dim_reduction
.
- Parameters
desc – Descriptor of the algorithm.
input – Input data for the inference operation.
- Preconditions
- Postconditions