CLUDA layer¶
CLUDA is the foundation of reikna
.
It provides the unified access to basic features of CUDA
and OpenCL
, such as memory operations, compilation and so on.
It can also be used by itself, if you want to write GPU API-independent programs and happen to only need a small subset of GPU API.
The terminology is borrowed from OpenCL
, since it is a more general API.
-
class
reikna.cluda.
Module
(template_src, render_kwds=None)¶ Contains a CLUDA module. See Tutorial: modules and snippets for details.
Parameters: - template_src (
str
orMako
template.) – aMako
template with the module code, or a string with the template source. - render_kwds – a dictionary which will be used to render the template. Can contain other modules and snippets.
-
classmethod
create
(func_or_str, render_kwds=None)¶ Creates a module from the
Mako
def:- if
func_or_str
is a function, then the def has the same signature asfunc_or_str
(prefix will be passed as the first positional parameter), and the body equal to the string it returns; - if
func_or_str
is a string, then the def has a single positional argumentprefix
. and the bodycode
.
- if
- template_src (
-
exception
reikna.cluda.
OutOfResourcesError
¶ Thrown by
compile_static()
if the providedlocal_size
is too big, or one cannot be found.
-
class
reikna.cluda.
Snippet
(template_src, render_kwds=None)¶ Contains a CLUDA snippet. See Tutorial: modules and snippets for details.
Parameters: - template_src (
str
orMako
template.) – aMako
template with the module code, or a string with the template source. - render_kwds – a dictionary which will be used to render the template. Can contain other modules and snippets.
-
classmethod
create
(func_or_str, render_kwds=None)¶ Creates a snippet from the
Mako
def:- if
func_or_str
is a function, then the def has the same signature asfunc_or_str
, and the body equal to the string it returns; - if
func_or_str
is a string, then the def has empty signature.
- if
- template_src (
-
reikna.cluda.
any_api
()¶ Returns one of the API modules supported by the system or raises an
Exception
if there are not any.
-
reikna.cluda.
api_ids
()¶ Returns a list of identifiers for all known (not necessarily available for the current system) APIs.
-
reikna.cluda.
cuda_api
()¶ Returns the
PyCUDA
-based API module.
-
reikna.cluda.
cuda_id
()¶ Returns the identifier of the
PyCUDA
-based API.
-
reikna.cluda.
find_devices
(api, include_devices=None, exclude_devices=None, include_platforms=None, exclude_platforms=None, include_duplicate_devices=True, include_pure_only=False)¶ Find platforms and devices meeting certain criteria.
Parameters: - api – a CLUDA API object.
- include_devices – a list of masks for a device name which will be used to pick devices to include in the result.
- exclude_devices – a list of masks for a device name which will be used to pick devices to exclude from the result.
- include_platforms – a list of masks for a platform name which will be used to pick platforms to include in the result.
- exclude_platforms – a list of masks for a platform name which will be used to pick platforms to exclude in the result.
- include_duplicate_devices – if
False
, will only include a single device from the several with the same name available on a platform. - include_pure_only – if
True
, will include devices with maximum group size equal to 1.
Returns: a dictionary with found platform numbers as keys, and lists of device numbers as values.
-
reikna.cluda.
get_api
(api_id)¶ Returns an API module with the generalized interface
reikna.cluda.api
for the given identifier.
-
reikna.cluda.
ocl_api
()¶ Returns the
PyOpenCL
-based API module.
-
reikna.cluda.
ocl_id
()¶ Returns the identifier of the
PyOpenCL
-based API.
-
reikna.cluda.
supported_api_ids
()¶ Returns a list of identifiers of supported APIs.
-
reikna.cluda.
supports_api
(api_id)¶ Returns
True
if given API is supported.
API module¶
Modules for all APIs have the same generalized interface.
It is referred here (and references from other parts of this documentation) as reikna.cluda.api
.
-
class
reikna.cluda.api.
Buffer
¶ Low-level untyped memory allocation. Actual class depends on the API:
pycuda.driver.DeviceAllocation
forCUDA
andpyopencl.Buffer
forOpenCL
.-
size
¶
-
-
class
reikna.cluda.api.
Array
¶ A superclass of the corresponding API’s native array (
pycuda.gpuarray.GPUArray
forCUDA
andpyopencl.array.Array
forOpenCL
), with some additional functionality.-
shape
¶
-
dtype
¶
-
-
class
reikna.cluda.api.
DeviceParameters
(device)¶ An assembly of device parameters necessary for optimizations.
-
max_work_group_size
¶ Maximum block size for kernels.
-
max_work_item_sizes
¶ List with maximum local_size for each dimension.
-
max_num_groups
¶ List with maximum number of workgroups for each dimension.
-
warp_size
¶ Warp size (nVidia), or wavefront size (AMD), or SIMD width is supposed to be the number of threads that are executed simultaneously on the same computation unit (so you can assume that they are perfectly synchronized).
-
local_mem_banks
¶ Number of local (shared in CUDA) memory banks is a number of successive 32-bit words you can access without getting bank conflicts.
-
local_mem_size
¶ Size of the local (shared in CUDA) memory per workgroup, in bytes.
-
min_mem_coalesce_width
¶ Dictionary
{word_size:elements}
, whereelements
is the number of elements with sizeword_size
in global memory that allow coalesced access.
-
supports_dtype
(self, dtype)¶ Checks if given
numpy
dtype can be used in kernels compiled using this thread.
-
-
class
reikna.cluda.api.
Platform
¶ A vendor-specific implementation of the GPGPU API.
-
name
¶ Platform name.
-
vendor
¶ Vendor name.
-
version
¶ Platform version.
-
get_devices
()¶ Returns a list of device objects available in the platform.
-
-
class
reikna.cluda.api.
Kernel
(thr, program, name, static=False)¶ An object containing GPU kernel.
-
max_work_group_size
¶ Maximum size of the work group for the kernel.
-
__call__
(*args, **kwds)¶ A shortcut for successive call to
prepare()
andprepared_call()
. In case of the OpenCL backend, returns apyopencl.Event
object.
-
prepare
(global_size, local_size=None, local_mem=0)¶ Prepare the kernel for execution with given parameters.
Parameters: - global_size – an integer or a tuple of integers, specifying total number of work items to run.
- local_size – an integer or a tuple of integers,
specifying the size of a single work group.
Should have the same number of dimensions as
global_size
. IfNone
is passed, somelocal_size
will be picked internally. - local_mem – (CUDA API only) amount of dynamic local memory (in bytes)
-
-
class
reikna.cluda.api.
Program
(thr, src, static=False, fast_math=False, compiler_options=None, constant_arrays=None)¶ An object with compiled GPU code.
-
source
¶ Contains module source code.
-
-
class
reikna.cluda.api.
StaticKernel
(thr, template_src, name, global_size, local_size=None, render_args=None, render_kwds=None, fast_math=False, compiler_options=None, constant_arrays=None)¶ An object containing a GPU kernel with fixed call sizes.
-
source
¶ Contains the source code of the program.
-
-
class
reikna.cluda.api.
Thread
(cqd, async_=True, temp_alloc=None)¶ Wraps an existing context in the CLUDA thread object.
Parameters: - cqd – a
Context
,Device
orStream
/CommandQueue
object to base on. If a context is passed, a new stream/queue will be created internally. - async – whether to execute all operations with this thread asynchronously
(you would generally want to set it to
False
only for profiling purposes).
Note
If you are using
CUDA
API, you must keep in mind the stateful nature of CUDA calls. Briefly, this means that there is the context stack, and the current context on top of it. When thecreate()
is called, thePyCUDA
context gets pushed to the stack and made current. When the thread object goes out of scope (and the thread object owns it), the context is popped, and it is the user’s responsibility to make sure the popped context is the correct one. In simple single-context programs this only means that one should avoid reference cycles involving the thread object.Warning
Do not pass one
Stream
/CommandQueue
object to severalThread
objects.-
api
¶ Module object representing the CLUDA API corresponding to this
Thread
.
-
device_params
¶ Instance of
DeviceParameters
class for this thread’s device.
-
temp_alloc
¶ Instance of
TemporaryManager
which handles allocations of temporary arrays (seetemp_array()
).
-
array
(shape, dtype, strides=None, offset=0, allocator=None)¶ Creates an
Array
on GPU with givenshape
,dtype
,strides
andoffset
. Optionally, anallocator
is a callable returning any object castable toint
representing the physical address on the device (for instance,Buffer
).
-
compile
(template_src, render_args=None, render_kwds=None, fast_math=False, compiler_options=None, constant_arrays=None)¶ Creates a module object from the given template.
Parameters: - template_src – Mako template source to render
- render_args – an iterable with positional arguments to pass to the template.
- render_kwds – a dictionary with keyword parameters to pass to the template.
- fast_math – whether to enable fast mathematical operations during compilation.
- compiler_options – a list of strings to be passed to the compiler as arguments.
- constant_arrays – (CUDA only) a dictionary
{name: metadata}
of constant memory arrays to be declared in the compiled program.metadata
can be either an array-like object (possessingshape
anddtype
attributes), or a pair(shape, dtype)
.
Returns: a
Program
object.
-
compile_static
(template_src, name, global_size, local_size=None, render_args=None, render_kwds=None, fast_math=False, compiler_options=None, constant_arrays=None)¶ Creates a kernel object with fixed call sizes, which allows to overcome some backend limitations. Global and local sizes can have any length, providing that
len(global_size) >= len(local_size)
, and the total number of work items and work groups is less than the corresponding total number available for the device. In order to get IDs and sizes in such kernels, virtual size functions have to be used (seeVIRTUAL_SKIP_THREADS
and others for details).Parameters: - template_src – Mako template or a template source to render
- name – name of the kernel function
- global_size – global size to be used, in row-major order.
- local_size – local size to be used, in row-major order.
If
None
, some suitable one will be picked. - local_mem – (CUDA API only) amount of dynamically allocated local memory to be used (in bytes).
- render_args – a list of parameters to be passed as positional arguments to the template.
- render_kwds – a dictionary with additional parameters to be used while rendering the template.
- fast_math – whether to enable fast mathematical operations during compilation.
- compiler_options – a list of strings to be passed to the compiler as arguments.
- constant_arrays – (CUDA only) a dictionary
{name: metadata}
of constant memory arrays to be declared in the compiled program.metadata
can be either an array-like object (possessingshape
anddtype
attributes), or a pair(shape, dtype)
.
Returns: a
StaticKernel
object.
-
copy_array
(arr, dest=None, src_offset=0, dest_offset=0, size=None)¶ Copies array on device.
Parameters: - dest – the effect is the same as in
to_device()
. - src_offset – offset (in items of
arr.dtype
) in the source array. - dest_offset – offset (in items of
arr.dtype
) in the destination array. - size – how many elements of
arr.dtype
to copy.
- dest – the effect is the same as in
-
classmethod
create
(interactive=False, device_filters=None, **thread_kwds)¶ Creates a new
Thread
object with its own context and queue inside. Intended for cases when you want to base your whole program on CLUDA.Parameters: - interactive – ask a user to choose a platform and a device from the ones found. If there is only one platform/device available, they will be chosen automatically.
- device_filters – keywords to filter devices
(see the keywords for
find_devices()
). - thread_kwds – keywords to pass to
Thread
constructor. - kwds – same as in
Thread
.
-
empty_like
(arr)¶ Allocates an array on GPU with the same attributes as
arr
.
-
from_device
(arr, dest=None, async_=False)¶ Transfers the contents of
arr
to anumpy.ndarray
object. The effect ofdest
parameter is the same as into_device()
. Ifasync_
isTrue
, the transfer is asynchronous (the thread-wide asynchronisity setting does not apply here).Alternatively, one can use
Array.get()
.
-
release
()¶ Forcefully free critical resources (rendering the object unusable). In most cases you can rely on the garbage collector taking care of things. Calling this method explicitly may be necessary in case of CUDA API when you want to make sure the context got popped.
-
synchronize
()¶ Forcefully synchronize this thread with the main program.
-
temp_array
(shape, dtype, strides=None, offset=0, dependencies=None)¶ Creates an
Array
on GPU with givenshape
,dtype
,strides
andoffset
. In order to reduce the memory footprint of the program, the temporary array manager will allow these arrays to overlap. Two arrays will not overlap, if one of them was specified independencies
for the other one. For a list of valuesdependencies
takes, see the reference entry forTemporaryManager
.
-
to_device
(arr, dest=None)¶ Copies an array to the device memory. If
dest
is specified, it is used as the destination, and the method returnsNone
. Otherwise the destination array is created internally and returned from the method.
- cqd – a
-
reikna.cluda.api.
get_id
()¶ Returns the identifier of this API.
Temporary Arrays¶
Each Thread
contains a special allocator for arrays with data that does not have to be persistent all the time.
In many cases you only want some array to keep its contents between several kernel calls.
This can be achieved by manually allocating and deallocating such arrays every time, but it slows the program down, and you have to synchronize the queue because allocation commands are not serialized.
Therefore it is advantageous to use temp_array()
method to get such arrays.
It takes a list of dependencies as an optional parameter which gives the allocator a hint about which arrays should not use the same physical allocation.
-
class
reikna.cluda.tempalloc.
TemporaryManager
(thr, pack_on_alloc=False, pack_on_free=False)¶ Base class for a manager of temporary allocations.
Parameters: - thr – an instance of
Thread
. - pack_on_alloc – whether to repack allocations when a new allocation is requested.
- pack_on_free – whether to repack allocations when an allocation is freed.
-
array
(shape, dtype, strides=None, offset=0, dependencies=None)¶ Returns a temporary array.
Parameters: - shape – shape of the array.
- dtype – data type of the array.
- strides – tuple of bytes to step in each dimension when traversing an array.
- offset – the array offset (in bytes)
- dependencies – can be a
Array
instance (the ones containing persistent allocations will be ignored), an iterable with valid values, or an object with the attribute __tempalloc__ which is a valid value (the last two will be processed recursively).
-
pack
()¶ Packs the real allocations possibly reducing total memory usage. This process can be slow.
- thr – an instance of
-
class
reikna.cluda.tempalloc.
TrivialManager
(*args, **kwds)¶ Trivial manager — allocates a separate buffer for each allocation request.
-
class
reikna.cluda.tempalloc.
ZeroOffsetManager
(*args, **kwds)¶ Tries to assign several allocation requests to a single real allocation, if dependencies allow that. All virtual allocations start from the beginning of real allocations.
Function modules¶
This module contains Module
factories
which are used to compensate for the lack of complex number operations in OpenCL,
and the lack of C++ synthax which would allow one to write them.
-
reikna.cluda.functions.
add
(*in_dtypes, out_dtype=None)¶ Returns a
Module
with a function oflen(in_dtypes)
arguments that adds values of typesin_dtypes
. Ifout_dtype
is given, it will be set as a return type for this function.This is necessary since on some platforms the
+
operator for a complex and a real number works in an unexpected way (returning(a.x + b, a.y + b)
instead of(a.x + b, a.y)
).
-
reikna.cluda.functions.
cast
(out_dtype, in_dtype)¶ Returns a
Module
with a function of one argument that casts values ofin_dtype
toout_dtype
.
-
reikna.cluda.functions.
conj
(dtype)¶ Returns a
Module
with a function of one argument that conjugates the value of typedtype
(must be a complex data type).
-
reikna.cluda.functions.
div
(in_dtype1, in_dtype2, out_dtype=None)¶ Returns a
Module
with a function of two arguments that divides values ofin_dtype1
andin_dtype2
. Ifout_dtype
is given, it will be set as a return type for this function.
-
reikna.cluda.functions.
exp
(dtype)¶ Returns a
Module
with a function of one argument that exponentiates the value of typedtype
(must be a real or complex data type).
-
reikna.cluda.functions.
mul
(*in_dtypes, out_dtype=None)¶ Returns a
Module
with a function oflen(in_dtypes)
arguments that multiplies values of typesin_dtypes
. Ifout_dtype
is given, it will be set as a return type for this function.
-
reikna.cluda.functions.
norm
(dtype)¶ Returns a
Module
with a function of one argument that returns the 2-norm of the value of typedtype
(product by the complex conjugate if the value is complex, square otherwise).
-
reikna.cluda.functions.
polar
(dtype)¶ Returns a
Module
with a function of two arguments that returns the complex-valuedrho * exp(i * theta)
for valuesrho, theta
of typedtype
(must be a real data type).
-
reikna.cluda.functions.
polar_unit
(dtype)¶ Returns a
Module
with a function of one argument that returns a complex number(cos(theta), sin(theta))
for a valuetheta
of typedtype
(must be a real data type).
-
reikna.cluda.functions.
pow
(dtype, exponent_dtype=None, output_dtype=None)¶ Returns a
Module
with a function of two arguments that raises the first argument of typedtype
to the power of the second argument of typeexponent_dtype
(an integer or real data type). Ifexponent_dtype
oroutput_dtype
are not given, they default todtype
. Ifdtype
is not the same asoutput_dtype
, the input is cast tooutput_dtype
before exponentiation. Ifexponent_dtype
is real, but bothdtype
andoutput_dtype
are integer, aValueError
is raised.
Kernel toolbox¶
The stuff available for the kernel passed for compilation consists of two parts.
First, there are several objects available at the template rendering stage, namely numpy
, reikna.cluda.dtypes
(as dtypes
), and reikna.helpers
(as helpers
).
Second, there is a set of macros attached to any kernel depending on the API it is being compiled for:
-
CUDA
¶ If defined, specifies that the kernel is being compiled for CUDA API.
-
COMPILE_FAST_MATH
¶ If defined, specifies that the compilation for this kernel was requested with
fast_math == True
.
-
LOCAL_BARRIER
¶ Synchronizes threads inside a block.
-
WITHIN_KERNEL
¶ Modifier for a device-only function declaration.
-
KERNEL
¶ Modifier for a kernel function declaration.
-
GLOBAL_MEM
¶ Modifier for a global memory pointer argument.
-
LOCAL_MEM
¶ Modifier for a statically allocated local memory variable.
-
LOCAL_MEM_DYNAMIC
¶ Modifier for a dynamically allocated local memory variable.
-
LOCAL_MEM_ARG
¶ Modifier for a local memory argument in device-only functions.
-
CONSTANT_MEM
¶ Modifier for a statically allocated constant memory variable.
-
CONSTANT_MEM_ARG
¶ Modifier for a constant memory argument in device-only functions.
-
INLINE
¶ Modifier for inline functions.
-
SIZE_T
¶ The type of local/global IDs and sizes. Equal to
unsigned int
for CUDA, andsize_t
for OpenCL (which can be 32- or 64-bit unsigned integer, depending on the device).
-
SIZE_T
get_global_size
(int dim)¶ Local, group and global identifiers and sizes. In case of CUDA mimic the behavior of corresponding OpenCL functions.
-
VSIZE_T
¶ The type of local/global IDs in the virtual grid. It is separate from
SIZE_T
because the former is intended to be equivalent to what the backend is using, whileVSIZE_T
is a separate type and can be made larger thanSIZE_T
in the future if necessary.
-
ALIGN
(int)¶ Used to specify an explicit alignment (in bytes) for fields in structures, as
typedef struct { char ALIGN(4) a; int b; } MY_STRUCT;
-
VIRTUAL_SKIP_THREADS
¶ This macro should start any kernel compiled with
compile_static()
. It skips all the empty threads resulting from fitting call parameters into backend limitations.
-
VSIZE_T
virtual_global_flat_size
()¶ Only available in
StaticKernel
objects obtained fromcompile_static()
. Since its dimensions can differ from actual call dimensions, these functions have to be used.
Datatype tools¶
This module contains various convenience functions which operate with numpy.dtype
objects.
-
reikna.cluda.dtypes.
align
(dtype)¶ Returns a new struct dtype with the field offsets changed to the ones a compiler would use (without being given any explicit alignment qualifiers). Ignores all existing explicit itemsizes and offsets.
-
reikna.cluda.dtypes.
c_constant
(val, dtype=None)¶ Returns a C-style numerical constant. If
val
has a struct dtype, the generated constant will have the form{ ... }
and can be used as an initializer for a variable.
-
reikna.cluda.dtypes.
c_path
(path)¶ Returns a string corresponding to the
path
to a struct element in C. Thepath
is the sequence of field names/array indices returned fromflatten_dtype()
.
-
reikna.cluda.dtypes.
cast
(dtype)¶ Returns function that takes one argument and casts it to
dtype
.
-
reikna.cluda.dtypes.
complex_ctr
(dtype)¶ Returns name of the constructor for the given
dtype
.
-
reikna.cluda.dtypes.
complex_for
(dtype)¶ Returns complex dtype corresponding to given floating point
dtype
.
-
reikna.cluda.dtypes.
ctype
(dtype)¶ For a built-in C type, returns a string with the name of the type.
-
reikna.cluda.dtypes.
ctype_module
(dtype, ignore_alignment=False)¶ For a struct type, returns a
Module
object with thetypedef
of a struct corresponding to the givendtype
(with its name set to the module prefix); falls back toctype()
otherwise.The structure definition includes the alignment required to produce field offsets specified in
dtype
; therefore,dtype
must be either a simple type, or have proper offsets and dtypes (the ones that can be reporoduced in C using explicit alignment attributes, but without additional padding) and the attributeisalignedstruct == True
. An aligned dtype can be produced either by standard means (aligned
flag innumpy.dtype
constructor and explicit offsets and itemsizes), or created out of an arbitrary dtype with the help ofalign()
.If
ignore_alignment
is True, all of the above is ignored. The C structures produced will not have any explicit alignment modifiers. As a result, the the field offsets ofdtype
may differ from the ones chosen by the compiler.Modules are cached and the function returns a single module instance for equal
dtype
’s. Therefore inside a kernel it will be rendered with the same prefix everywhere it is used. This results in a behavior characteristic for a structural type system, same as for the basic dtype-ctype conversion.Warning
As of
numpy
1.8, theisalignedstruct
attribute is not enough to ensure a mapping between a dtype and a C struct with only the fields that are present in the dtype. Therefore,ctype_module
will make some additional checks and raiseValueError
if it is not the case.
-
reikna.cluda.dtypes.
detect_type
(val)¶ Find out the data type of
val
.
-
reikna.cluda.dtypes.
extract_field
(arr, path)¶ Extracts an element from an array of struct dtype. The
path
is the sequence of field names/array indices returned fromflatten_dtype()
.
-
reikna.cluda.dtypes.
flatten_dtype
(dtype)¶ Returns a list of tuples
(path, dtype)
for each of the basic dtypes in a (possibly nested)dtype
.path
is a list of field names/array indices leading to the corresponding element.
-
reikna.cluda.dtypes.
is_complex
(dtype)¶ Returns
True
ifdtype
is complex.
-
reikna.cluda.dtypes.
is_double
(dtype)¶ Returns
True
ifdtype
is double precision floating point.
-
reikna.cluda.dtypes.
is_integer
(dtype)¶ Returns
True
ifdtype
is an integer.
-
reikna.cluda.dtypes.
is_real
(dtype)¶ Returns
True
ifdtype
is a real.
-
reikna.cluda.dtypes.
min_scalar_type
(val)¶ Wrapper for
numpy.min_scalar_dtype
which takes into account types supported by GPUs.
-
reikna.cluda.dtypes.
normalize_type
(dtype)¶ Function for wrapping all dtypes coming from the user.
numpy
uses two different classes to represent dtypes, and one of them does not have some important attributes.
-
reikna.cluda.dtypes.
normalize_types
(dtypes)¶ Same as
normalize_type()
, but operates on a list of dtypes.
-
reikna.cluda.dtypes.
real_for
(dtype)¶ Returns floating point dtype corresponding to given complex
dtype
.
-
reikna.cluda.dtypes.
result_type
(*dtypes)¶ Wrapper for
numpy.result_type
which takes into account types supported by GPUs.
-
reikna.cluda.dtypes.
zero_ctr
(dtype)¶ Returns the string with constructed zero value for the given
dtype
.