Principal Components Analysis (PCA)¶

Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.

Operation	Computational methods		Programming Interface
Training	Covariance	SVD	train(…)	train_input	train_result
Inference	Covariance	SVD	infer(…)	infer_input	infer_result

Mathematical formulation¶

Training¶

Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and the number of principal components \(r\), the problem is to compute \(r\) principal directions (\(p\)-dimensional eigenvectors [Lang87]) for the training set. The eigenvectors can be grouped into the \(r \times p\) matrix \(T\) that contains one eigenvector in each row.

Training method: Covariance¶

This method uses eigenvalue decomposition of the covariance matrix to compute the principal components of the datasets. The method relies on the following steps:

Computation of the covariance matrix
Computation of the eigenvectors and eigenvalues
Formation of the matrices storing the results

Covariance matrix computation shall be performed in the following way:

Compute the vector-column of sums \(s_i = \sum_{j=1}^n x_{i,j}, \quad 1 \leq i \leq p\).
Compute the cross-product \(P = X^TX - s^Ts\).
Compute the covariance matrix \(\Sigma = \frac{1}{n - 1} P\).

To compute eigenvalues \(\lambda_i\) and eigenvectors \(\upsilon_i\), the implementer can choose an arbitrary method such as [Ping14].

The final step is to sort the set of pairs \((\lambda_i, \upsilon_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (\upsilon_{i,1}, \cdots, \upsilon_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset shall be returned.

Training method: SVD¶

This method uses singular value decomposition of the dataset to compute its principal components. The method relies on the following steps:

Computation of the singular values and singular vectors
Formation of the matrices storing the results

To compute singular values \(\lambda_i\) and singular vectors \(u_i\) and \(v_i\), the implementer can choose an arbitrary method such as [Demmel90].

The final step is to sort the set of pairs \((\lambda_i, v_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (v_{i,1}, \cdots, v_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset shall be returned.

Sign-flip technique¶

Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to sign ambiguity. To get the deterministic result, a sign-flip technique should be applied. One of the sign-flip techniques proposed in [Bro07] requires the following modification of matrix \(T\):

\[\hat{T}_i = T_i \cdot \mathrm{sgn}(\max_{1 \leq j \leq p } |{T}_{ij}|), \quad 1 \leq i \leq r,\]

where \(T_i\) is \(i\)-th row, \(T_{ij}\) is the element in the \(i\)-th row and \(j\)-th column, \(\mathrm{sgn}(\cdot)\) is the signum function,

\[\begin{split}\mathrm{sgn}(x) = \begin{cases} -1, & x < 0, \\ 0, & x = 0, \\ 1, & x > 0. \end{cases}\end{split}\]

Note

The sign-flip technique described above is an example. oneDAL spec does not require implementation of this sign-flip technique. Implementer can choose an arbitrary technique that modifies the eigenvectors’ signs.

Inference¶

Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the \(r \times p\) matrix \(T\) produced at the training stage, the problem is to transform \(X'\) to the set \(X'' = \{ x_1'', \ldots, x_m'' \}\), where \(x_{j}''\) is an \(r\)-dimensional feature vector, \(1 \leq j \leq m\).

The feature vector \(x_{j}''\) is computed through applying linear transformation [Lang87] defined by the matrix \(T\) to the feature vector \(x_{j}'\),

(1)¶\[x_{j}'' = T x_{j}', \quad 1 \leq j \leq m.\]

Inference methods: Covariance and SVD¶

Covariance and SVD inference methods compute \(x_{j}''\) according to (1).

Usage example¶

Training¶

pca::model<> run_training(const table& data) {
   const auto pca_desc = pca::descriptor<float>{}
      .set_component_count(5)
      .set_deterministic(true);

   const auto result = train(pca_desc, data);

   print_table("means", result.get_means());
   print_table("variances", result.get_variances());
   print_table("eigenvalues", result.get_eigenvalues());
   print_table("eigenvectors", result.get_eigenvectors());

   return result.get_model();
}

Inference¶

table run_inference(const pca::model<>& model,
                    const table& new_data) {
   const auto pca_desc = pca::descriptor<float>{}
      .set_component_count(model.get_component_count());

   const auto result = infer(pca_desc, model, new_data);

   print_table("labels", result.get_transformed_data());
}

Programming Interface¶

All types and functions in this section shall be declared in the oneapi::dal::pca namespace and be available via inclusion of the oneapi/dal/algo/pca.hpp header file.

Descriptor¶

template <typename Float = float,
          typename Method = method::by_default,
          typename Task = task::by_default>
class descriptor {
public:
   explicit descriptor(std::int64_t component_count = 0);

   int64_t get_component_count() const;
   descriptor& set_component_count(int64_t);

   bool get_deterministic() const;
   descriptor& set_deterministic(bool);
};

template<typename Float = float, typename Method = method::by_default, typename Task = task::by_default> class descriptor¶

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::cov or method::svd.
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Constructors

descriptor(std::int64_t component_count = 0)¶: Creates a new instance of the class with the given component_count property value.

Properties

int64_t component_count = 0¶

The number of principal components \(r\). If it is zero, the algorithm computes the eigenvectors for all features, \(r = p\).

Getter & Setter: int64_t get_component_count() const

descriptor & set_component_count(int64_t)
Invariants: component_count >= 0

bool deterministic = true¶

Specifies whether the algorithm applies the Sign-flip technique. If it is true, the directions of the eigenvectors must be deterministic.

Getter & Setter: bool get_deterministic() const

descriptor & set_deterministic(bool)

Method tags¶

namespace method {
   struct cov {};
   struct svd {};
   using by_default = cov;
} // namespace method

struct cov¶: Tag-type that denotes Covariance computational method.

struct svd¶: Tag-type that denotes SVD computational method.

using by_default = cov ¶: Alias tag-type for Covariance computational method.

Task tags¶

namespace task {
   struct dim_reduction {};
   using by_default = dim_reduction;
} // namespace task

struct dim_reduction¶: Tag-type that parameterizes entities used for solving dimensionality reduction problem.

using by_default = dim_reduction ¶: Alias tag-type for dimensionality reduction task.

Model¶

template <typename Task = task::by_default>
class model {
public:
   model();

   const table& get_eigenvectors() const;

   int64_t get_component_count() const;
};

template<typename Task = task::by_default> class model¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Constructors

model()¶: Creates a new instance of the class with the default property values.

Properties

const table &eigenvectors = table{}¶

An \(r \times p\) table with the eigenvectors. Each row contains one eigenvector.

Getter & Setter: const table & get_eigenvectors() const

int64_t component_count = 0¶

The number of components \(r\) in the trained model.

Getter & Setter: int64_t get_component_count() const
Invariants: component_count == eigenvectors.row_count

Training `train(...)`¶

Input¶

template <typename Task = task::by_default>
class train_input {
public:
   train_input(const table& data = table{});

   const table& get_data() const;
   train_input& set_data(const table&);
};

template<typename Task = task::by_default> class train_input¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Constructors

train_input(const table &data = table{})¶: Creates a new instance of the class with the given data property value.

Properties

const table &data = table{}¶

An \(n \times p\) table with the training data, where each row stores one feature vector.

Getter & Setter: const table & get_data() const

train_input & set_data(const table &)

Result¶

template <typename Task = task::by_default>
class train_result {
public:
   train_result();

   const model<Task>& get_model() const;

   const table& get_means() const;

   const table& get_variances() const;

   const table& get_eigenvalues() const;

   const table& get_eigenvectors() const;
};

template<typename Task = task::by_default> class train_result¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Constructors

train_result()¶: Creates a new instance of the class with the default property values.

Properties

const model<Task> &model = model<Task>{}¶

The trained PCA model.

Getter & Setter: const model< Task > & get_model() const

const table &means = table{}¶

A \(1 \times r\) table that contains the mean values for the first \(r\) features.

Getter & Setter: const table & get_means() const

const table &variances = table{}¶

A \(1 \times r\) table that contains the variances for the first \(r\) features.

Getter & Setter: const table & get_variances() const

const table &eigenvalues = table{}¶

A \(1 \times r\) table that contains the eigenvalues for for the first \(r\) features.

Getter & Setter: const table & get_eigenvalues() const

const table &eigenvectors = table{}¶

An \(r \times p\) table with the eigenvectors. Each row contains one eigenvector.

Getter & Setter: const table & get_eigenvectors() const
Invariants: eigenvectors == model.eigenvectors

Operation¶

template<typename Float, typename Method, typename Task> train_result<Task> train(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)¶

Runs the training operation for PCA. For more details, see oneapi::dal::train.

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::cov or method::svd.
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Parameters

desc – Descriptor of the algorithm.
input – Input data for the training operation.

Preconditions: input.data.has_data == true

input.data.column_count >= desc.component_count
Postconditions: result.means.row_count == 1

result.means.column_count == desc.component_count

result.variances.row_count == 1

result.variances.column_count == desc.component_count

result.variances[i] >= 0.0

result.eigenvalues.row_count == 1

result.eigenvalues.column_count == desc.component_count

result.model.eigenvectors.row_count == 1

result.model.eigenvectors.column_count == desc.component_count

Inference `infer(...)`¶

Input¶

template <typename Task = task::by_default>
class infer_input {
public:
   infer_input(const model<Task>& m = model<Task>{},
               const table& data = table{});

   const model<Task>& get_model() const;
   infer_input& set_model(const model&);

   const table& get_data() const;
   infer_input& set_data(const table&);
};

template<typename Task = task::by_default> class infer_input¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Constructors

infer_input(const model<Task> &m = model<Task>{}, const table &data = table{})¶: Creates a new instance of the class with the given model and data property values.

Properties

const model<Task> &model = model<Task>{}¶

The trained PCA model.

Getter & Setter: const model< Task > & get_model() const

infer_input & set_model(const model &)

const table &data = table{}¶

The dataset for inference \(X'\).

Getter & Setter: const table & get_data() const

infer_input & set_data(const table &)

Result¶

template <typename Task = task::by_default>
class infer_result {
public:
   infer_result();

   const table& get_transformed_data() const;
};

template<typename Task = task::by_default> class infer_result¶

Template Parameters: Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Constructors

infer_result()¶: Creates a new instance of the class with the default property values.

Properties

const table &transformed_data = table{}¶

An \(n \times r\) table that contains data projected to the \(r\) principal components.

Getter & Setter: const table & get_transformed_data() const

Operation¶

template<typename Float, typename Method, typename Task> infer_result<Task> infer(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)¶

Runs the inference operation for PCA. For more details see oneapi::dal::infer.

Template Parameters

Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.
Method – Tag-type that specifies an implementation of algorithm. Can be method::cov or method::svd.
Task – Tag-type that specifies type of the problem to solve. Can be task::dim_reduction.

Parameters

desc – Descriptor of the algorithm.
input – Input data for the inference operation.

Preconditions: input.data.has_data == true

input.model.eigenvectors.row_count == desc.component_count

input.model.eigenvectors.column_count == input.data.column_count
Postconditions: result.transformed_data.row_count == input.data.row_count

result.transformed_data.column_count == desc.component_count

Decomposition Appendix

oneAPI Specification 1.1-provisional-rev-1 documentation

Principal Components Analysis (PCA)¶

Mathematical formulation¶

Training¶

Training method: Covariance¶

Training method: SVD¶

Sign-flip technique¶

Inference¶

Inference methods: Covariance and SVD¶

Usage example¶

Training¶

Inference¶

Programming Interface¶

Descriptor¶

Method tags¶

Task tags¶

Model¶

Training train(...)¶

Input¶

Result¶

Operation¶

Inference infer(...)¶

Input¶

Result¶

Operation¶

Training `train(...)`¶

Inference `infer(...)`¶