CLUDA layer

CLUDA is the foundation of reikna. It provides the unified access to basic features of CUDA and OpenCL, such as memory operations, compilation and so on. It can also be used by itself, if you want to write GPU API-independent programs and happen to only need a small subset of GPU API. The terminology is borrowed from OpenCL, since it is a more general API.

API module

Modules for all APIs have the same generalized interface. It is referred here (and references from other parts of this documentation) as reikna.cluda.api.

Temporary Arrays

Each Thread contains a special allocator for arrays with data that does not have to be persistent all the time. In many cases you only want some array to keep its contents between several kernel calls. This can be achieved by manually allocating and deallocating such arrays every time, but it slows the program down, and you have to synchronize the queue because allocation commands are not serialized. Therefore it is advantageous to use temp_array() method to get such arrays. It takes a list of dependencies as an optional parameter which gives the allocator a hint about which arrays should not use the same physical allocation.

Function modules

Kernel toolbox

The stuff available for the kernel passed for compilation consists of two parts.

First, there are several objects available at the template rendering stage, namely numpy, reikna.cluda.dtypes (as dtypes), and reikna.helpers (as helpers).

Second, there is a set of macros attached to any kernel depending on the API it is being compiled for:

CUDA

If defined, specifies that the kernel is being compiled for CUDA API.

COMPILE_FAST_MATH

If defined, specifies that the compilation for this kernel was requested with fast_math == True.

LOCAL_BARRIER

Synchronizes threads inside a block.

WITHIN_KERNEL

Modifier for a device-only function declaration.

KERNEL

Modifier for the kernel function declaration.

GLOBAL_MEM

Modifier for the global memory pointer argument.

LOCAL_MEM

Modifier for the statically allocated local memory variable.

LOCAL_MEM_DYNAMIC

Modifier for the dynamically allocated local memory variable.

LOCAL_MEM_ARG

Modifier for the local memory argument in the device-only functions.

INLINE

Modifier for inline functions.

SIZE_T

The type of local/global IDs and sizes. Equal to unsigned int for CUDA, and size_t for OpenCL (which can be 32- or 64-bit unsigned integer, depending on the device).

SIZE_T get_local_id(int dim)
SIZE_T get_group_id(int dim)
SIZE_T get_global_id(int dim)
SIZE_T get_local_size(int dim)
SIZE_T get_num_groups(int dim)
SIZE_T get_global_size(int dim)

Local, group and global identifiers and sizes. In case of CUDA mimic the behavior of corresponding OpenCL functions.

VSIZE_T

The type of local/global IDs in the virtual grid. It is separate from SIZE_T because the former is intended to be equivalent to what the backend is using, while VSIZE_T is a separate type and can be made larger than SIZE_T in the future if necessary.

ALIGN(int)

Used to specify an explicit alignment (in bytes) for fields in structures, as

typedef struct {
    char ALIGN(4) a;
    int b;
} MY_STRUCT;
VIRTUAL_SKIP_THREADS

This macro should start any kernel compiled with compile_static(). It skips all the empty threads resulting from fitting call parameters into backend limitations.

VSIZE_T virtual_local_id(int dim)
VSIZE_T virtual_group_id(int dim)
VSIZE_T virtual_global_id(int dim)
VSIZE_T virtual_local_size(int dim)
VSIZE_T virtual_num_groups(int dim)
VSIZE_T virtual_global_size(int dim)
VSIZE_T virtual_global_flat_id()
VSIZE_T virtual_global_flat_size()

Only available in StaticKernel objects obtained from compile_static(). Since its dimensions can differ from actual call dimensions, these functions have to be used.

Datatype tools

This module contains various convenience functions which operate with numpy.dtype objects.

Table Of Contents

Related Topics

This Page