DPC++¶
Overview¶
oneAPI Data Parallel C++ (DPC++) is the direct programming language and associated direct programming APIs of oneAPI. It provides the features needed to define data parallel functions and to launch them on devices. The language is comprised of the following components:
C++. Every DPC++ program is also a C++ program. A compliant DPC++ implementation must support the C++17 Core Language (as specified in Sections 1-19 of ISO/IEC 14882:2017) or newer. See the C++ Standard.
SYCL. DPC++ builds on the SYCL specification from The Khronos Group. The SYCL language enables the definition of data parallel functions that can be offloaded to devices and defines runtime APIs and classes that are used to orchestrate the offloaded functions.
DPC++ Language extensions. A compliant DPC++ implementation must support the specified language features. These include unified shared memory (USM), ordered queues, and reductions. Some extensions are required only when the DPC++ implementation supports a specific class of device, as summarized in the Extensions Table. An implementation supports a class of device if it can target hardware that responds “true” for a DPC++ device type query, either through explicit support built into the implementation, or by using a lower layer that can support those device classes such as the oneAPI Level Zero (Level Zero). A DPC++ implementation must pass the conformance tests for all extensions that are required (Extensions Table) for the classes of devices that the implementation can support. (See SYCL Extensions.)
This specification requires a minimum of C++17 Core Language support and DPC++ extensions. These version and feature coverage requirements will evolve over time, with specific versions of C++ and SYCL being required, some additional extensions being required, and some DPC++ extensions no longer required if covered by newer C++ or SYCL versions directly.
Feature |
Where defined |
CPU |
GPU |
FPGA |
Test 1 |
---|---|---|---|---|---|
Accessor simplifications |
Required |
Required |
Required |
NA 3 |
|
bit_cast |
Required |
Required |
Required |
NA 3 |
|
Deduction guides |
Required |
Required |
Required |
NA 3 |
|
Device specific queries |
Not required 4 |
Not required 4 |
Not required 4 |
NA 3 |
|
Extended atomics |
Required 7 |
Required 7 |
Not required 4 |
NA 3 |
|
Kernel func type attributes |
Required |
Required |
Required |
NA 3 |
|
In-order queues |
Required |
Required |
Required |
NA 3 |
|
Math array |
Not required 4 |
Not required 4 |
Not required 4 |
NA 3 |
|
Optional lambda name |
Required |
Required |
Required |
NA 3 |
|
Queue shortcuts |
Required |
Required |
Required |
NA 3 |
|
Required work-group size |
Required |
Required |
Required |
NA 3 |
|
Standard layout relaxed |
Required |
Required |
Required |
NA 3 |
|
Unified Shared Memory |
Required 2 |
Required 2 |
Required 2 |
||
DPC++ extension |
Required 8 |
Required 8 |
Required 8 |
NA 3 |
|
DPC++ extension |
Required |
Required |
Not required 4 |
NA 3 |
|
DPC++ extension |
Not required |
Not required |
Required |
||
DPC++ extension |
Required |
Required |
Required |
NA 3 |
|
DPC++ extension |
Required |
Required |
Not required 4 |
NA 3 |
|
DPC++ extension |
Not required 4 |
Not required 4 |
Not required 4 |
NA 3 |
|
DPC++ extension |
Required |
Required |
Required |
NA 3 |
|
DPC++ extension |
Required |
Required |
Required |
NA 3 |
|
DPC++ extension |
Required 5 |
Required 5 |
Not required 4 |
NA 3 |
|
DPC++ extension |
Required |
Required |
Required |
NA 3 |
|
DPC++ extension |
Not required 4 |
Not required 4 |
Not required 4 |
NA 3 |
|
DPC++ extension |
Required |
Required |
Not required |
||
DPC++ extension |
Required 6 |
Required 6 |
Not required |
- 1
Test directory within extension tests
- 2(1,2,3)
Minimum of explicit USM support
- 3(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22)
Not yet available.
- 4(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
Likely to be required in the future
- 5(1,2)
DPC++ requirement is for one dimensional reductions, single reduction variable support
- 6(1,2)
DPC++ requirement is for sub-group algorithms that have equivalent group algorithms
- 7(1,2)
DPC++ requirement does not include support for atomics in the generic address space
- 8(1,2,3)
DPC++ requirement is for the general property mechanism, and not specific properties within it
Detailed API and Language Descriptions¶
The SYCL 1.2.1 Specification describes the SYCL APIs and language. DPC++ extensions on top of SYCL are described in the SYCL Extensions repository. Some features defined in the SYCL 2020 Provisional Specification but not in the SYCL 1.2.1 Specification are required in DPC++, as summarized in Extensions Table, and most replace DPC++ extensions that were required in previous versions of this specification.
A brief summary of the required features from SYCL 2020 Provisional Specification (above SYCL 1.2.1 Specification) follows:
Accessor simplifications - simplification of the accessor interface, reduction of verbosity in common code, and removal of need to specify template arguments in common cases. Section 4.7.6 of the SYCL 2020 Provisional Specification.
bit_cast - inclusion of C++20 (p0476r2)
std::bit_cast
assycl::bit_cast
. Section 3.8.2 of the SYCL 2020 Provisional Specification.Deduction guides - simplifies common code patterns and reduces code length and verbosity by enabling Class Template Argument Deduction (CTAD) from modern C++. Distributed throughout the SYCL 2020 Provisional Specification.
Device specific queries - kernel property queries associated with a specific device. Section 4.12 of the SYCL 2020 Provisional Specification.
Extended atomics - alignment with C++20
std::atomic_ref
, including some tweaks for memory models in SYCL. Support for floating-point types and shorthand operators. Section 4.17.3 of the SYCL 2020 Provisional Specification. Additional atomic-related queries are defined in Table 4.19, and some changes to fences and barriers are reflected in Section 4.17.1 (both in the SYCL 2020 Provisional Specification).Kernel function type attributes - definition of kernel attributes as function type attributes that allows them to be applied to lambdas. Definition of some core attributes. Section 5.7 of the SYCL 2020 Provisional Specification.
In-order queues - defines simple in-order semantics for queues, to simplify common coding patterns. Section 4.6.5 of the SYCL 2020 Provisional Specification.
Math array - contiguous fixed-size portable container. Section 4.16.3 of the SYCL 2020 Provisional Specification.
Optional lambda name - removes requirement to manually name lambdas that define kernels. Simplifies coding and enables composability with libraries. Lambdas can still be manually named, if desired, such as when debugging or interfacing with a
sycl::program
object. Section 4.14.2 of the SYCL 2020 Provisional Specification.Queue shortcuts - defines kernel invocation functions directly on the queue classes, to simplify code patterns where dependencies and/or accessors do not need to be created within the additional command group scope. Reduces code verbosity in some common patterns. Section 4.6.5 of the SYCL 2020 Provisional Specification.
Required work-group size - defines an attribute that can be applied to kernels (including lambda definitions of kernels) which signals that the kernel will only be invoked with a specific work-group size. This is an optimization attribute that enables optimizations based on additional user-driven information. Section 5.7 of the SYCL 2020 Provisional Specification.
Standard layout relaxed - removes the requirement that data shared by a host and device(s) must be C++ standard layout types. Requires device compilers to validate layout compatibility. Section 4.14.4 of the SYCL 2020 Provisional Specification.
Unified Shared Memory (USM) - defines pointer based memory accesses and management interfaces. Provides the ability to create allocations that are visible and have consistent pointer values across both host and device(s). Different USM capability levels are defined, corresponding to different levels of device and implementation support. Section 4.8 of the SYCL 2020 Provisional Specification.
A brief summary of the extensions is as follows:
Accessor properties - compile-time accessor properties that are visible to the compiler.
CXX standard library - enable subset of the C and C++ standard libraries in device code.
Data flow pipes - enable efficient First-In, First-Out (FIFO) communication in DPC++, a mechanism commonly used when describing algorithms for spatial architectures such as FPGAs.
Enqueued barriers - simplifies dependence creation and tracking for some common programming patterns by allowing coarser grained synchronization within a queue without manual creation of fine grained dependencies.
Group algorithms - defines collective operations that operate across groups of work-items, including broadcast, reduce, and scan. Improves productivity by providing common algorithms without explicit coding, and enables optimized implementations to exist for combinations of device and runtime.
Group mask - defines a type that can represent a set of work-items from a group, and collective operations that create or operate on that type such as ballot and count.
Parallel for shortcuts - simplification of common patterns such as invoking a kernel with a scalar range.
Pinned memory property - optimization indicating that a buffer should use a specific memory resource if possible, to accelerate movement of data between host and devices in some implementations.
Reductions - provides a reduction abstraction to the ND-range form of parallel_for. Improves productivity by providing the common reduction pattern without explicit coding, and enables optimized implementations to exist for combinations of device, runtime, and reduction properties.
Restrict all arguments - defines an attribute that can be applied to kernels (including lambda definitions of kernels) which signals that there will be no memory aliasing between any pointer arguments that are passed to or captured by a kernel. This is an optimization attribute that can have large impact when the developer knows more about the kernel arguments than a compiler can infer or safely assume.
Static local memory query - query for the amount of local memory used by a compiler and unavailable for dynamic use.
Subgroups - defines a grouping of work-items within a work-group. Synchronization of work-items in a sub-group can occur independently of work-items in other sub-groups, and sub-groups expose communication operations across work-items in the group. Subgroups commonly map to SIMD hardware where it exists.
Subgroup algorithms - defines collective operations across work-items in a sub-group that are available only for sub-groups. Also enables algorithms from the more generic “group algorithms” extension as sub-group collective operations.
Open Source Implementation¶
An open source implementation is available under an LLVM license. Details on incomplete features and known issues are available in the Release Notes (and the Getting Started Guide until the release notes are available).
Testing¶
A DPC++ implementation must pass:
The extension tests for any extension implemented from the Extensions Table. Each extension in the Extensions Table lists the name of the directory that contains corresponding tests, within the extension tests tree.
Acknowledgment¶
We thank the DPC++ and oneDPL Technical Advisory Board for their valuable feedback, and the Khronos SYCL working group for their efforts defining and evolving the SYCL specification.