Attributes#
The parameters passed to create a primitive descriptor specify the basic problem description: the operation kind, the propagation kind, the input and output tensors descriptors (e.g. strides if applicable…), as well as the engine where the primitive will be executed.
Attributes specify some extra properties of the primitive. Users must create them before use and must set required specifics using the corresponding setters. The attributes are copied during primitive descriptor creation, so users can change or destroy attributes right after that.
If not modified, attributes can stay empty, which is equivalent to the default attributes. Primitive descriptors’ constructors have empty attributes as default parameters, so, unless required, users can simply omit them.
Attributes can also contain post-ops, which are computations executed after the primitive.
Scratchpad Mode#
Some primitives might require a temporary buffer while performing their
computations. For instance, the operations that do not have enough independent
work to utilize all cores on a system might use parallelization over the
reduction dimension (the K dimension in the GEMM notation). In this case
different threads compute partial results in private temporary buffers, and then
the private results are added to produce the final result. Another example is
using matrix multiplication (GEMM) to implement convolution. Before calling
GEMM, the source activations need to be transformed using the im2col
operation. The transformation result is written to a temporary buffer that is
then used as an input for the GEMM.
In both of these examples, the temporary buffer is no longer required once the primitive computation is completed. oneDNN refers to such kind of a memory buffer as a scratchpad.
Both types of implementation might need extra space for the reduction in case
there are too few independent tasks. The amount of memory required by the
im2col
transformation is proportional to the size of the source image
multiplied by the weights spatial size. The size of a buffer for reduction is
proportional to the tensor size to be reduced (e.g., diff_weights
in the
case of backward by weights) multiplied by the number of threads in the
reduction groups (the upper bound is the total number of threads).
By contrast, some other primitives might require very little extra space. For
instance, one of the implementation of the dnnl::sum
primitive requires temporary
space only to store the pointers to data for each and every input array (that
is, the size of the scratchpad is n * sizeof(void *)
, where n
is the
number of summands).
oneDNN supports two modes for handling scratchpads:
-
enum class dnnl::scratchpad_mode#
Scratchpad mode.
Values:
-
enumerator library#
The library manages the scratchpad allocation. There may be multiple implementation-specific policies that can be configured via mechanisms that fall outside of the scope of this specification.
-
enumerator user#
The user manages the scratchpad allocation by querying and providing the scratchpad memory to primitives. This mode is thread-safe as long as the scratchpad buffers are not used concurrently by two primitive executions.
-
enumerator library#
The scratchpad mode is controlled though the
dnnl::primitive_attr::set_scratchpad_mode()
primitive attributes.
If the user provides scratchpad memory to a primitive, this memory must be created using the same engine that the primitive uses.
All primitives support both scratchpad modes.
Note
Primitives are not thread-safe by default. The only way to make the primitive
execution fully thread-safe is to use the dnnl::scratchpad_mode::user
mode and
not pass the same scratchpad memory to two primitives that are executed
concurrently.
Examples#
Library Manages Scratchpad#
As mentioned above, this is a default behavior. We only want to highlight how a user can query the amount of memory consumed by a primitive due to a scratchpad.
// Use default attr, hence the library allocates scratchpad
dnnl::primitive::primitive_desc op_pd(params, /* other arguments */);
// Print how much memory would be hold by a primitive due to scratchpad
std::cout << "primitive will use "
<< op_pd.query_s64(dnnl::query::memory_consumption_s64)
<< " bytes" << std::endl;
// In this case scratchpad is internal, hence user visible scratchpad memory
// descriptor should be empty:
auto zero_md = dnnl::memory::desc();
User Manages Scratchpad#
// Create an empty (default) attributes
dnnl::primitive_attr attr;
// Default scratchpad mode is `library`:
assert(attr.get_scratchpad_mode() == dnnl::scratchpad_mode::library);
// Set scratchpad mode to `user`
attr.set_scratchpad_mode(dnnl::scratchpad_mode::user);
// Create a primitive descriptor with custom attributes
dnnl::primitive::primitive_desc op_pd(op_d, attr, engine);
// Query the scratchpad memory descriptor
dnnl::memory::desc scratchpad_md = op_pd.scratchpad_desc();
// Note, that a primitive doesn't consume memory in this configuration:
assert(op_pd.query_s64(dnnl::query::memory_consumption_s64) == 0);
// Create a primitive
dnnl::primitive prim(op_pd);
// ... more code ..
// Create a scratchpad memory
// NOTE: if scratchpad is not required for a particular primitive the
// scratchpad_md.get_size() will return 0. It is fine to have
// scratchpad_ptr == nullptr in this case.
void *scratchpad_ptr = user_memory_manager::allocate(scratchpad_md.get_size());
// NOTE: engine here must much the engine of the primitive
dnnl::memory scratchpad(scratchpad_md, engine, scratchpad_ptr);
// Pass a scratchpad memory to a primitive
prim.execute(stream, { /* other arguments */,
{DNNL_ARG_SCRATCHPAD, scratchpad}});
Quantization#
Primitives may support reduced precision computations which require quantization. This process is explained in more details in the Quantization Model section.
Quantization Attributes (scales and zero-points)#
oneDNN provides dnnl::primitive_attr::set_scales_mask()
and
dnnl::primitive_attr::set_zero_points_mask()
for setting the quantization
parameter for a given argument of a primitive.
The primitives may not support passing quantization parameters if source (and weights) tensors are not of the int8 data type. In other words, convolution operating on the single precision floating point data type may not scale and/or shift its inputs and outputs.
Broadcast semantic for quantization parameters is handled through
masks that are explicitly passed to the dnnl::primitive_attr::set_scales_mask()
and dnnl::primitive_attr::set_zero_points_mask()
methods. For example, if the
primitive destination is a \(D_0 \times ... \times D_{n-1}\)
tensor and we want to have a scale per \(d_i\) dimension (where
\(0 \le d_i < n\)), then \(mask = \sum \limits_{d_i} 2^{d_i}\)
and the number of scales should be \(\mathtt{scales.size()} =
\prod \limits_{d_i} D_{d_i}\). The mask should be set in attributes
during primitive creation, and the actual scales and zero-points are
passed as argument to the primitive execution function.
The quantization parameters are applied in the single precision
floating point data type (dnnl::memory::data_type::f32
). Before it is stored, the result is
converted to the destination data type with saturation if
required. The rounding happens according to the current hardware
setting.
When using Post-ops, the same
dnnl::primitive_attr::set_scales_mask()
and dnnl::primitive_attr::set_zero_points_mask()
are
used to pass quantization parameters to a given post-ops arguments.
Example 1: weights quantization with per-output-channel scaling#
// weights dimensions
const int OC, IC, KH, KW;
// original f32 weights in plain format
dnnl::memory::desc wei_plain_f32_md(
{OC, IC, KH, KW}, // dims
dnnl::memory::data_type::f32, // the data originally in f32
dnnl::memory::format_tag::hwigo // the plain memory format
);
// the scaling factors for quantized weights
// An unique scale for each output-channel.
std::vector<float> wei_scales(OC) = { /* values */ };
dnnl::memory();
// int8 convolution primitive descriptor
dnnl::convolution_forward::primitive_desc conv_pd(/* see the next example */);
// query the convolution weights memory descriptor
dnnl::memory::desc wei_conv_s8_md = conv_pd.weights_desc();
// prepare the attributes for the reorder
dnnl::primitive_attr attr;
const int quantization_mask = 0
| (1 << 0); // scale per OC dimension, which is the dim #0
attr.set_scales_mask(DNNL_ARG_DST, quantization_mask);
// create reorder that would perform:
// wei_s8(oc, ic, kh, kw) <- wei_f32(oc, ic, kh, kw) / scale(oc)
// including the data format conversion.
auto wei_reorder_pd = dnnl::reorder::primitive_desc(
wei_plain_f32_md, engine, // source
wei_conv_s8_md, engine, // destination,
attr);
auto wei_reorder = dnnl::reorder(wei_reorder_pd);
Example 2: convolution with groups, with per-output-channel quantization#
This example is complementary to the previous example (which should ideally be the first one). Let’s say we want to create an int8 convolution with per-output channel scaling.
const float src_scale; // src_f32[:] = src_scale * src_s8[:]
const float dst_scale; // dst_f32[:] = dst_scale * dst_s8[:]
// the scaling factors for quantized weights (as declared above)
// An unique scale for each group and output-channel.
std::vector<float> wei_scales(OC) = {...};
// Src, weights, and dst memory descriptors for convolution,
// with memory format tag == any to allow a convolution implementation
// to chose the appropriate memory format
dnnl::memory::desc src_conv_s8_any_md(
{BATCH, IC, IH, IW}, // dims
dnnl::memory::data_type::s8, // the data originally in s8
dnnl::memory::format_tag::any // let convolution to choose
);
dnnl::memory::desc wei_conv_s8_any_md(
{OC, IC, KH, KW}, // dims
dnnl::memory::data_type::s8, // the data originally in s8
dnnl::memory::format_tag::any // let convolution to choose
);
dnnl::memory::desc dst_conv_s8_any_md(...); // ditto
// prepare the attributes for the convolution
dnnl::primitive_attr attr;
const int data_mask = 0; // scale and zero-point per tensor for source and destination
const int wei_mask = 0
| (1 << 1); // scale per OC dimension, which is the dim #0 on weights tensor:
// ( OC, IC, KH, KW)
// 0 1 2 3
attr.set_scales_mask(DNNL_ARG_SRC, data_mask);
attr.set_zero_points_mask(DNNL_ARG_SRC, data_mask);
attr.set_scales_mask(DNNL_ARG_WEIGHTS, wei_mask);
attr.set_scales_mask(DNNL_ARG_DST, data_mask);
attr.set_zero_points_mask(DNNL_ARG_DST, data_mask);
// create a convolution primitive descriptor
auto conv_pd = dnnl::convolution_forward::primitive_desc(
dnnl::prop_kind::forward_inference,
dnnl::algorithm::convolution_direct,
src_conv_s8_any_md, // what's important is that
wei_conv_s8_any_md, // we specified that we want
dst_conv_s8_any_md, // computations in s8
strides, padding_l, padding_r,
dnnl::padding_kind::zero
attr); // the attributes describe the quantization flow
Implicit downconversions and floating-point math mode#
oneDNN provides dnnl::primitive_attr::set_fpmath_mode()
to allow implicit
downconversions from fp32 to lower accuracy datatypes during primitive
execution. For some applications, it allows to speedup computations
without noticeable impact on accuracy.
The dnnl::primitive_attr::set_fpmath_mode()
primitive attribute specifies
which implicit down-conversions are allowed for that given
primitive. Only down-conversions from f32 to narrower data-types (f16,
bf16, or tf32) are currently allowed. Furthermore these
down-conversions are allowed only during computation, and do not
affect the storage datatype (which must remain f32).
The dnnl::primitive_attr::set_fpmath_mode()
primitive attribute can take 3 types of values:
the strict mode disables any down-conversion (default).
the any mode allows all conversions from f32 to a smaller floating-point datatype (f16, bf16, or tf32).
a specific datatype (f16, bf16, or tf32) which specifically allows down-conversion only from f32 to a datatype at least as accurate as the specified data-type (at least same number of exponent and mantissa bits).
The default value for this attribute shall be strict. However, it is allowed to expose global functions or environment variables to change this default value.
This attribute is ignored if a primitive computation data-type is integral.
API#
-
struct dnnl::primitive_attr#
Primitive attributes.
Public Functions
-
primitive_attr()#
Constructs default (empty) primitive attributes.
-
scratchpad_mode get_scratchpad_mode() const#
Returns the scratchpad mode.
-
void set_scratchpad_mode(scratchpad_mode mode)#
Sets scratchpad mode.
- Parameters
mode – Specified scratchpad mode.
-
fpmath_mode get_fpmath_mode() const#
Returns the fpmath mode.
-
void set_fpmath_mode(fpmath_mode mode)#
Sets fpmath mode.
- Parameters
mode – Specified fpmath mode.
-
int get_scales_mask(int arg) const#
Returns scaling factors correspondence mask for a given memory argument.
- Parameters
arg – Parameter argument index as passed to the primitive::execute() call.
-
void set_scales_mask(int arg, int mask)#
Sets scaling factors correspondance mask for a given memory argument.
Note
The order of dimensions does not depend on how elements are laid out in memory. For example:
for a 2D CNN activations tensor the order is always (n, c)
for a 4D CNN activations tensor the order is always (n, c, h, w)
for a 5D CNN weights tensor the order is always
- Parameters
arg – Parameter argument index as passed to the primitive::execute() call.
mask – Scaling factors correspondence mask that defines the correspondence between the
arg
tensor dimensions and the scales vector. Setting the i-th bit indicates that a dedicated scaling factor is used for each index along that dimension. Set the mask to 0 to use a common scaling factor for the whole tensor. The scales must be passed at execution time as an argument with index DNNL_ARG_ATTR_SCALES.
-
void set_zero_points_mask(int arg, int mask)#
Sets zero points for primitive operations for a given memory argument.
See also
dnnl::primitive_attr::set_output_scales
- Parameters
arg – Parameter argument index as passed to the primitive::execute() call.
mask – Zero point correspondence mask that defines the correspondence between the tensor dimensions and the
zero_points
vector. The set i-th bit indicates that a dedicated zero point is used for each index along that dimension. Set the mask to 0 to use a common zero point for the whole output tensor. The zero points must be passed at execution time as an argument with index DNNL_ARG_ATTR_ZERO_POINTS.
-
const post_ops get_post_ops() const#
Returns post-ops previously set via set_post_ops().
- Returns
Post-ops.
-
void set_post_ops(const post_ops ops)#
Sets post-ops.
Note
There is no way to check whether the post-ops would be supported by the target primitive. Any error will be reported by the respective primitive descriptor constructor.
- Parameters
ops – Post-ops object to copy post-ops from.
-
void set_rnn_data_qparams(float scale, float shift)#
Sets quantization scale and shift parameters for RNN data tensors.
For performance reasons, the low-precision configuration of the RNN primitives expect input activations to have the unsigned 8-bit integer data type. The scale and shift parameters are used to quantize floating-point data to unsigned integer and must be passed to the RNN primitive using attributes.
The quantization formula is
scale * (data + shift)
.Example usage:
// RNN parameters int l = 2, t = 2, mb = 32, sic = 32, slc = 32, dic = 32, dlc = 32; // Activations quantization parameters float scale = 2.0f, shift = 0.5f; primitive_attr attr; // Set scale and shift for int8 quantization of activation attr.set_rnn_data_qparams(scale, shift); // Create and configure rnn op_desc vanilla_rnn_forward::desc rnn_d(/* arguments */); vanilla_rnn_forward::primitive_desc rnn_d(rnn_d, attr, engine);
Note
Quantization scale and shift are common for src_layer, src_iter, dst_iter, and dst_layer.
- Parameters
scale – The value to scale the data by.
shift – The value to shift the data by.
-
void set_rnn_weights_qparams(int mask, const std::vector<float> &scales)#
Sets quantization scaling factors for RNN weights tensors. The low-precision configuration of the RNN primitives expect input weights to use the signed 8-bit integer data type. The scaling factors are used to quantize floating-point data to signed integer and must be passed to RNN primitives using attributes.
Note
The dimension order is always native and does not depend on the actual layout used. For example, five-dimensional weights always have (l, d, i, g, o) logical dimension ordering.
Note
Quantization scales are common for weights_layer and weights_iteration
- Parameters
mask – Scaling factors correspondence mask that defines the correspondence between the output tensor dimensions and the
scales
vector. The set i-th bit indicates that a dedicated scaling factor should be used each index along that dimension. Set the mask to 0 to use a common scaling factor for the whole output tensor.scales – Constant vector of output scaling factors. The following equality must hold: \(scales.size() = \prod\limits_{d \in mask} weights.dims[d].\) Violations can only be detected when the attributes are used to create a primitive descriptor.
-
primitive_attr()#