Principal Components Analysis (PCA)#
Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.
Operation |
Computational methods |
Programming Interface |
Mathematical formulation#
Given the training set
Training method: Covariance#
This method uses eigenvalue decomposition of the covariance matrix to compute the principal components of the datasets. The method relies on the following steps:
Computation of the covariance matrix
Computation of the eigenvectors and eigenvalues
Formation of the matrices storing the results
Covariance matrix computation shall be performed in the following way:
Compute the vector-column of sums
.Compute the cross-product
.Compute the covariance matrix
To compute eigenvalues
The final step is to sort the set of pairs
Training method: SVD#
This method uses singular value decomposition of the dataset to compute its principal components. The method relies on the following steps:
Computation of the singular values and singular vectors
Formation of the matrices storing the results
To compute singular values
The final step is to sort the set of pairs
Sign-flip technique#
Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to
sign ambiguity. To get the deterministic result, a sign-flip technique should be
applied. One of the sign-flip techniques proposed in [Bro07] requires the
following modification of matrix
The sign-flip technique described above is an example. oneDAL spec does not require implementation of this sign-flip technique. Implementer can choose an arbitrary technique that modifies the eigenvectors’ signs.
Given the inference set
The feature vector
Inference methods: Covariance and SVD#
Covariance and SVD inference methods compute
Usage example#
pca::model<> run_training(const table& data) {
const auto pca_desc = pca::descriptor<float>{}
const auto result = train(pca_desc, data);
print_table("means", result.get_means());
print_table("variances", result.get_variances());
print_table("eigenvalues", result.get_eigenvalues());
print_table("eigenvectors", result.get_eigenvectors());
return result.get_model();
table run_inference(const pca::model<>& model,
const table& new_data) {
const auto pca_desc = pca::descriptor<float>{}
const auto result = infer(pca_desc, model, new_data);
print_table("labels", result.get_transformed_data());
Programming Interface#
All types and functions in this section shall be declared in the
namespace and be available via inclusion of the
header file.
template <typename Float = float,
typename Method = method::by_default,
typename Task = task::by_default>
class descriptor {
explicit descriptor(std::int64_t component_count = 0);
int64_t get_component_count() const;
descriptor& set_component_count(int64_t);
bool get_deterministic() const;
descriptor& set_deterministic(bool);
template<typename Float = float, typename Method = method::by_default, typename Task = task::by_default>
class descriptor# - Template Parameters:
Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::cov or method::svd.
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
descriptor(std::int64_t component_count = 0)#
Creates a new instance of the class with the given
property value.
int64_t component_count#
The number of principal components
. If it is zero, the algorithm computes the eigenvectors for all features, . Default value: 0.- Getter & Setter
int64_t get_component_count() const
descriptor & set_component_count(int64_t)
- Invariants
- component_count >= 0
bool deterministic#
Specifies whether the algorithm applies the Sign-flip technique. If it is true, the directions of the eigenvectors must be deterministic. Default value: true.
- Getter & Setter
bool get_deterministic() const
descriptor & set_deterministic(bool)
template <typename Task = task::by_default>
class model {
const table& get_eigenvectors() const;
int64_t get_component_count() const;
template<typename Task = task::by_default>
class model# - Template Parameters:
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
Creates a new instance of the class with the default property values.
Public Methods
const table &get_eigenvectors() const#
table with the eigenvectors. Each row contains one eigenvector.
int64_t get_component_count() const#
The number of components
in the trained model.
Training train(...)#
template <typename Task = task::by_default>
class train_input {
train_input(const table& data = table{});
const table& get_data() const;
train_input& set_data(const table&);
template<typename Task = task::by_default>
class train_input# - Template Parameters:
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
train_input(const table &data = table{})#
Creates a new instance of the class with the given
property value.
template <typename Task = task::by_default>
class train_result {
const model<Task>& get_model() const;
const table& get_means() const;
const table& get_variances() const;
const table& get_eigenvalues() const;
const table& get_eigenvectors() const;
template<typename Task = task::by_default>
class train_result# - Template Parameters:
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
Creates a new instance of the class with the default property values.
Public Methods
template <typename Float, typename Method, typename Task>
train_result<Task> train(const descriptor<Float, Method, Task>& desc,
const train_input<Task>& input);
template<typename Float, typename Method, typename Task>
train_result<Task> train(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)# Runs the training operation for PCA. For more details, see oneapi::dal::train.
- Template Parameters:
Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::cov or method::svd.
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
- Parameters:
desc – Descriptor of the algorithm.
input – Input data for the training operation.
- Preconditions
- Postconditions
- result.means.row_count == 1result.means.column_count == desc.component_countresult.variances.row_count == 1result.variances.column_count == desc.component_countresult.variances[i] >= 0.0result.eigenvalues.row_count == 1result.eigenvalues.column_count == desc.component_countresult.model.eigenvectors.row_count == 1result.model.eigenvectors.column_count == desc.component_count
Inference infer(...)#
template <typename Task = task::by_default>
class infer_input {
infer_input(const model<Task>& m = model<Task>{},
const table& data = table{});
const model<Task>& get_model() const;
infer_input& set_model(const model&);
const table& get_data() const;
infer_input& set_data(const table&);
template<typename Task = task::by_default>
class infer_input# - Template Parameters:
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
infer_input(const model<Task> &m = model<Task>{}, const table &data = table{})#
Creates a new instance of the class with the given
property values.
template <typename Task = task::by_default>
class infer_result {
const table& get_transformed_data() const;
template<typename Task = task::by_default>
class infer_result# - Template Parameters:
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
Creates a new instance of the class with the default property values.
Public Methods
template <typename Float, typename Method, typename Task>
infer_result<Task> infer(const descriptor<Float, Method, Task>& desc,
const infer_input<Task>& input);
template<typename Float, typename Method, typename Task>
infer_result<Task> infer(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)# Runs the inference operation for PCA. For more details see oneapi::dal::infer.
- Template Parameters:
Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::cov or method::svd.
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.
- Parameters:
desc – Descriptor of the algorithm.
input – Input data for the inference operation.
- Preconditions
- Postconditions