CLUDA is the foundation of reikna. It provides the unified access to basic features of CUDA and OpenCL, such as memory operations, compilation and so on. It can also be used by itself, if you want to write GPU API-independent programs and happen to only need a small subset of GPU API. The terminology is borrowed from OpenCL, since it is a more general API.
Modules for all APIs have the same generalized interface. It is referred here (and references from other parts of this documentation) as reikna.cluda.api.
Each Thread contains a special allocator for arrays with data that does not have to be persistent all the time. In many cases you only want some array to keep its contents between several kernel calls. This can be achieved by manually allocating and deallocating such arrays every time, but it slows the program down, and you have to synchronize the queue because allocation commands are not serialized. Therefore it is advantageous to use temp_array() method to get such arrays. It takes a list of dependencies as an optional parameter which gives the allocator a hint about which arrays should not use the same physical allocation.
The stuff available for the kernel passed for compilation consists of two parts.
First, there are several objects available at the template rendering stage, namely numpy, reikna.cluda.dtypes (as dtypes), and reikna.helpers (as helpers).
Second, there is a set of macros attached to any kernel depending on the API it is being compiled for:
If defined, specifies that the kernel is being compiled for CUDA API.
If defined, specifies that the compilation for this kernel was requested with fast_math == True.
Synchronizes threads inside a block.
Modifier for a device-only function declaration.
Modifier for the kernel function declaration.
Modifier for the global memory pointer argument.
Modifier for the statically allocated local memory variable.
Modifier for the dynamically allocated local memory variable.
Modifier for the local memory argument in the device-only functions.
Modifier for inline functions.
The type of local/global IDs and sizes. Equal to unsigned int for CUDA, and size_t for OpenCL (which can be 32- or 64-bit unsigned integer, depending on the device).
Local, group and global identifiers and sizes. In case of CUDA mimic the behavior of corresponding OpenCL functions.
The type of local/global IDs in the virtual grid. It is separate from SIZE_T because the former is intended to be equivalent to what the backend is using, while VSIZE_T is a separate type and can be made larger than SIZE_T in the future if necessary.
Used to specify an explicit alignment (in bytes) for fields in structures, as
typedef struct {
char ALIGN(4) a;
int b;
} MY_STRUCT;
This macro should start any kernel compiled with compile_static(). It skips all the empty threads resulting from fitting call parameters into backend limitations.
This module contains various convenience functions which operate with numpy.dtype objects.