CLUDA is the foundation of reikna. It provides the unified access to basic features of CUDA and OpenCL, such as memory operations, compilation and so on. It can also be used by itself, if you want to write GPU API-independent programs and happen to only need a small subset of GPU API. The terminology is borrowed from OpenCL, since it is a more general API.
Contains a CLUDA module. See Tutorial: modules and snippets for details.
Parameters: |
|
---|
Creates a module from the Mako def:
Thrown by compile_static() if the provided local_size is too big, or one cannot be found.
Contains a CLUDA snippet. See Tutorial: modules and snippets for details.
Parameters: |
|
---|
Creates a snippet from the Mako def:
Returns one of the API modules supported by the system or raises an Exception if there are not any.
Returns a list of identifiers for all known (not necessarily available for the current system) APIs.
Returns the PyCUDA-based API module.
Returns the identifier of the PyCUDA-based API.
Find platforms and devices meeting certain criteria.
Parameters: |
|
---|---|
Returns: | a dictionary with found platform numbers as keys, and lists of device numbers as values. |
Returns an API module with the generalized interface reikna.cluda.api for the given identifier.
Returns the PyOpenCL-based API module.
Returns the identifier of the PyOpenCL-based API.
Returns a list of identifiers of supported APIs.
Returns True if given API is supported.
Modules for all APIs have the same generalized interface. It is referred here (and references from other parts of this documentation) as reikna.cluda.api.
Low-level untyped memory allocation. Actual class depends on the API: pycuda.driver.DeviceAllocation for CUDA and pyopencl.Buffer for OpenCL.
A superclass of the corresponding API’s native array (pycuda.gpuarray.GPUArray for CUDA and pyopencl.array.Array for OpenCL), with some additional functionality.
An assembly of device parameters necessary for optimizations.
Maximum block size for kernels.
List with maximum local_size for each dimension.
List with maximum number of workgroups for each dimension.
Warp size (nVidia), or wavefront size (AMD), or SIMD width is supposed to be the number of threads that are executed simultaneously on the same computation unit (so you can assume that they are perfectly synchronized).
Number of local (shared in CUDA) memory banks is a number of successive 32-bit words you can access without getting bank conflicts.
Size of the local (shared in CUDA) memory per workgroup, in bytes.
Dictionary {word_size:elements}, where elements is the number of elements with size word_size in global memory that allow coalesced access.
Checks if given numpy dtype can be used in kernels compiled using this thread.
A vendor-specific implementation of the GPGPU API.
Platform name.
Vendor name.
Platform version.
Returns a list of device objects available in the platform.
An object containing GPU kernel.
Maximum size of the work group for the kernel.
A shortcut for successive call to prepare() and prepared_call().
Prepare the kernel for execution with given parameters.
Parameters: |
|
---|
An object with compiled GPU code.
Contains module source code.
An object containing a GPU kernel with fixed call sizes.
Contains the source code of the program.
Wraps an existing context in the CLUDA thread object.
Parameters: |
|
---|
Note
If you are using CUDA API, you must keep in mind the stateful nature of CUDA calls. Briefly, this means that there is the context stack, and the current context on top of it. When the create() is called, the PyCUDA context gets pushed to the stack and made current. When the thread object goes out of scope (and the thread object owns it), the context is popped, and it is the user’s responsibility to make sure the popped context is the correct one. In simple single-context programs this only means that one should avoid reference cycles involving the thread object.
Warning
Do not pass one Stream/CommandQueue object to several Thread objects.
Module object representing the CLUDA API corresponding to this Thread.
Instance of DeviceParameters class for this thread’s device.
Instance of TemporaryManager which handles allocations of temporary arrays (see temp_array()).
Creates an Array on GPU with given shape, dtype and strides. Optionally, an allocator is a callable returning any object castable to int representing the physical address on the device (for instance, Buffer).
Creates a module object from the given template.
Parameters: |
|
---|---|
Returns: | a Program object. |
Creates a kernel object with fixed call sizes, which allows to overcome some backend limitations. Global and local sizes can have any length, providing that len(global_size) >= len(local_size), and the total number of work items and work groups is less than the corresponding total number available for the device. In order to get IDs and sizes in such kernels, virtual size functions have to be used (see VIRTUAL_SKIP_THREADS and others for details).
Parameters: |
|
---|---|
Returns: | a StaticKernel object. |
Copies array on device.
Parameters: |
|
---|
Creates a new Thread object with its own context and queue inside. Intended for cases when you want to base your whole program on CLUDA.
Parameters: |
|
---|
Allocates an array on GPU with the same attributes as arr.
Transfers the contents of arr to a numpy.ndarray object. The effect of dest parameter is the same as in to_device(). If async is True, the transfer is asynchronous (the thread-wide asynchronisity setting does not apply here).
Alternatively, one can use Array.get().
Forcefully free critical resources (rendering the object unusable). In most cases you can rely on the garbage collector taking care of things. Calling this method explicitly may be necessary in case of CUDA API when you want to make sure the context got popped.
Forcefully synchronize this thread with the main program.
Creates an Array on GPU with given shape, dtype and strides. In order to reduce the memory footprint of the program, the temporary array manager will allow these arrays to overlap. Two arrays will not overlap, if one of them was specified in dependencies for the other one. For a list of values dependencies takes, see the reference entry for TemporaryManager.
Copies an array to the device memory. If dest is specified, it is used as the destination, and the method returns None. Otherwise the destination array is created internally and returned from the method.
Returns the identifier of this API.
Each Thread contains a special allocator for arrays with data that does not have to be persistent all the time. In many cases you only want some array to keep its contents between several kernel calls. This can be achieved by manually allocating and deallocating such arrays every time, but it slows the program down, and you have to synchronize the queue because allocation commands are not serialized. Therefore it is advantageous to use temp_array() method to get such arrays. It takes a list of dependencies as an optional parameter which gives the allocator a hint about which arrays should not use the same physical allocation.
Base class for a manager of temporary allocations.
Parameters: |
|
---|
Returns a temporary array.
Parameters: |
|
---|
Packs the real allocations possibly reducing total memory usage. This process can be slow.
Trivial manager — allocates a separate buffer for each allocation request.
Tries to assign several allocation requests to a single real allocation, if dependencies allow that. All virtual allocations start from the beginning of real allocations.
This module contains Module factories which are used to compensate for the lack of complex number operations in OpenCL, and the lack of C++ synthax which would allow one to write them.
Returns a Module with a function of len(in_dtypes) arguments that adds values of types in_dtypes. If out_dtype is given, it will be set as a return type for this function.
This is necessary since on some platforms the + operator for a complex and a real number works in an unexpected way (returning (a.x + b, a.y + b) instead of (a.x + b, a.y)).
Returns a Module with a function of one argument that casts values of in_dtype to out_dtype.
Returns a Module with a function of one argument that conjugates the value of type dtype (must be a complex data type).
Returns a Module with a function of two arguments that divides values of in_dtype1 and in_dtype2. If out_dtype is given, it will be set as a return type for this function.
Returns a Module with a function of one argument that exponentiates the value of type dtype (must be a real or complex data type).
Returns a Module with a function of len(in_dtypes) arguments that multiplies values of types in_dtypes. If out_dtype is given, it will be set as a return type for this function.
Returns a Module with a function of one argument that returns the 2-norm of the value of type dtype (product by the complex conjugate if the value is complex, square otherwise).
Returns a Module with a function of two arguments that returns the complex-valued rho * exp(i * theta) for values rho, theta of type dtype (must be a real data type).
Returns a Module with a function of one argument that returns a complex number (cos(theta), sin(theta)) for a value theta of type dtype (must be a real data type).
Returns a Module with a function of two arguments that raises the first argument of type dtype to the power of the second argument of type exponent_dtype (an integer or real data type). If exponent_dtype or output_dtype are not given, they default to dtype. If dtype is not the same as output_dtype, the input is cast to output_dtype before exponentiation. If exponent_dtype is real, but both dtype and output_dtype are integer, a ValueError is raised.
The stuff available for the kernel passed for compilation consists of two parts.
First, there are several objects available at the template rendering stage, namely numpy, reikna.cluda.dtypes (as dtypes), and reikna.helpers (as helpers).
Second, there is a set of macros attached to any kernel depending on the API it is being compiled for:
If defined, specifies that the kernel is being compiled for CUDA API.
If defined, specifies that the compilation for this kernel was requested with fast_math == True.
Synchronizes threads inside a block.
Modifier for a device-only function declaration.
Modifier for the kernel function declaration.
Modifier for the global memory pointer argument.
Modifier for the statically allocated local memory variable.
Modifier for the dynamically allocated local memory variable.
Modifier for the local memory argument in the device-only functions.
Modifier for inline functions.
The type of local/global IDs and sizes. Equal to unsigned int for CUDA, and size_t for OpenCL (which can be 32- or 64-bit unsigned integer, depending on the device).
Local, group and global identifiers and sizes. In case of CUDA mimic the behavior of corresponding OpenCL functions.
The type of local/global IDs in the virtual grid. It is separate from SIZE_T because the former is intended to be equivalent to what the backend is using, while VSIZE_T is a separate type and can be made larger than SIZE_T in the future if necessary.
Used to specify an explicit alignment (in bytes) for fields in structures, as
typedef struct {
char ALIGN(4) a;
int b;
} MY_STRUCT;
This macro should start any kernel compiled with compile_static(). It skips all the empty threads resulting from fitting call parameters into backend limitations.
Only available in StaticKernel objects obtained from compile_static(). Since its dimensions can differ from actual call dimensions, these functions have to be used.
This module contains various convenience functions which operate with numpy.dtype objects.
Returns a new struct dtype with the field offsets changed to the ones a compiler would use (without being given any explicit alignment qualifiers). Ignores all existing explicit itemsizes and offsets.
Returns a C-style numerical constant. If val has a struct dtype, the generated constant will have the form { ... } and can be used as an initializer for a variable.
Returns a string corresponding to the path to a struct element in C. The path is the sequence of field names/array indices returned from flatten_dtype().
Returns function that takes one argument and casts it to dtype.
Returns name of the constructor for the given dtype.
Returns complex dtype corresponding to given floating point dtype.
For a built-in C type, returns a string with the name of the type.
For a struct type, returns a Module object with the typedef of a struct corresponding to the given dtype (with its name set to the module prefix); falls back to ctype() otherwise.
The structure definition includes the alignment required to produce field offsets specified in dtype; therefore, dtype must be either a simple type, or have proper offsets and dtypes (the ones that can be reporoduced in C using explicit alignment attributes, but without additional padding) and the attribute isalignedstruct == True. An aligned dtype can be produced either by standard means (aligned flag in numpy.dtype constructor and explicit offsets and itemsizes), or created out of an arbitrary dtype with the help of align().
If ignore_alignment is True, all of the above is ignored. The C structures produced will not have any explicit alignment modifiers. As a result, the the field offsets of dtype may differ from the ones chosen by the compiler.
Modules are cached and the function returns a single module instance for equal dtype‘s. Therefore inside a kernel it will be rendered with the same prefix everywhere it is used. This results in a behavior characteristic for a structural type system, same as for the basic dtype-ctype conversion.
Warning
As of numpy 1.8, the isalignedstruct attribute is not enough to ensure a mapping between a dtype and a C struct with only the fields that are present in the dtype. Therefore, ctype_module will make some additional checks and raise ValueError if it is not the case.
Find out the data type of val.
Extracts an element from an array of struct dtype. The path is the sequence of field names/array indices returned from flatten_dtype().
Returns a list of tuples (path, dtype) for each of the basic dtypes in a (possibly nested) dtype. path is a list of field names/array indices leading to the corresponding element.
Returns True if dtype is complex.
Returns True if dtype is double precision floating point.
Returns True if dtype is an integer.
Returns True if dtype is a real.
Wrapper for numpy.min_scalar_dtype which takes into account types supported by GPUs.
Function for wrapping all dtypes coming from the user. numpy uses two different classes to represent dtypes, and one of them does not have some important attributes.
Same as normalize_type(), but operates on a list of dtypes.
Returns floating point dtype corresponding to given complex dtype.
Wrapper for numpy.result_type which takes into account types supported by GPUs.
Returns the string with constructed zero value for the given dtype.