Reikna is a library containing various GPU algorithms built on top of PyCUDA and PyOpenCL. The main design goals are:
The installation is as simple as
$ pip install reikna
This section contains a brief illustration of what reikna does. For more details see basic and advanced tutorials.
CLUDA is an abstraction layer on top of PyCUDA/PyOpenCL. Its main purpose is to separate the rest of reikna from the difference in their APIs, but it can be used by itself too for some simple tasks.
Consider the following example, which is very similar to the one from the index page on PyCUDA documentation:
import numpy
import reikna.cluda as cluda
N = 256
api = cluda.ocl_api()
thr = api.Thread.create()
program = thr.compile("""
KERNEL void multiply_them(
GLOBAL_MEM float *dest,
GLOBAL_MEM float *a,
GLOBAL_MEM float *b)
{
const SIZE_T i = get_local_id(0);
dest[i] = a[i] * b[i];
}
""")
multiply_them = program.multiply_them
a = numpy.random.randn(N).astype(numpy.float32)
b = numpy.random.randn(N).astype(numpy.float32)
a_dev = thr.to_device(a)
b_dev = thr.to_device(b)
dest_dev = thr.empty_like(a_dev)
multiply_them(dest_dev, a_dev, b_dev, local_size=N, global_size=N)
print((dest_dev.get() - a * b == 0).all())
If you are familiar with PyCUDA or PyOpenCL, you will easily understand all the steps we have made here. The cluda.ocl_api() call is the only place where OpenCL is mentioned, and if you replace it with cluda.cuda_api() it will be enough to make the code use CUDA. The abstraction is achieved by using generic API module on the Python side, and special macros (KERNEL, GLOBAL_MEM, and others) on the kernel side.
The argument of compile() method can also be a template, which is quite useful for metaprogramming, and also used to compensate for the lack of complex number operations in CUDA and OpenCL. Let us illustrate both scenarios by making the initial example multiply complex arrays. The template engine of choice in reikna is Mako, and you are encouraged to read about it as it is quite useful. For the purpose of this example all we need to know is that ${python_expression()} is a synthax construction which renders the expression result.
import numpy
from numpy.linalg import norm
from reikna import cluda
from reikna.cluda import functions, dtypes
N = 256
dtype = numpy.complex64
api = cluda.ocl_api()
thr = api.Thread.create()
program = thr.compile("""
KERNEL void multiply_them(
GLOBAL_MEM ${ctype} *dest,
GLOBAL_MEM ${ctype} *a,
GLOBAL_MEM ${ctype} *b)
{
const SIZE_T i = get_local_id(0);
dest[i] = ${mul}(a[i], b[i]);
}
""", render_kwds=dict(
ctype=dtypes.ctype(dtype),
mul=functions.mul(dtype, dtype)))
multiply_them = program.multiply_them
r1 = numpy.random.randn(N).astype(numpy.float32)
r2 = numpy.random.randn(N).astype(numpy.float32)
a = r1 + 1j * r2
b = r1 - 1j * r2
a_dev = thr.to_device(a)
b_dev = thr.to_device(b)
dest_dev = thr.empty_like(a_dev)
multiply_them(dest_dev, a_dev, b_dev, local_size=N, global_size=N)
print(norm(dest_dev.get() - a * b) / norm(a * b) <= 1e-6)
Note that CLUDA Thread is created by means of a static method and not using the constructor. The constructor is reserved for more probable scenario, where we want to include some reikna functionality in a larger program, and we want it to use the existing context and stream/queue (see the Thread constructor). In this case all further operations with the thread will be performed using the objects provided.
Here we have passed two values to the template: ctype (a string with C type name), and mul which is a Module object containing a single multiplication function. The object is created by a function mul() which takes data types being multiplied and returns a module that was parametrized accordingly. Inside the template the variable mul is essentially the prefix for all the global C objects (functions, structures, macros etc) from the module. If there is only one public object in the module (which is recommended), it is a common practice to give it the name consisting just of the prefix, so that it could be called easily from the parent code.
For more information on modules, see Tutorial: modules and snippets; the complete list of things available in CLUDA can be found in CLUDA reference.
Now it’s time for the main part of the functionality. reikna provides GPGPU algorithms in the form of Computation-based cores and Transformation-based plug-ins. Computations contain the algorithm itself; examples are matrix multiplication, reduction, sorting and so on. Transformations are parallel operations on inputs or outputs of computations, used for scaling, typecast and other auxiliary purposes. Transformations are compiled into the main computation kernel and are therefore quite cheap in terms of performance.
As an example, we will consider the matrix multiplication.
import numpy
from numpy.linalg import norm
import reikna.cluda as cluda
from reikna.linalg import MatrixMul
api = cluda.ocl_api()
thr = api.Thread.create()
shape1 = (100, 200)
shape2 = (200, 100)
a = numpy.random.randn(*shape1).astype(numpy.float32)
b = numpy.random.randn(*shape2).astype(numpy.float32)
a_dev = thr.to_device(a)
b_dev = thr.to_device(b)
res_dev = thr.array((shape1[0], shape2[1]), dtype=numpy.float32)
dot = MatrixMul(a_dev, b_dev, out_arr=res_dev)
dotc = dot.compile(thr)
dotc(res_dev, a_dev, b_dev)
res_reference = numpy.dot(a, b)
print(norm(res_dev.get() - res_reference) / norm(res_reference) < 1e-6)
Most of the code above should be already familiar, with the exception of the creation of MatrixMul object. The computation constructor takes two array-like objects, representing arrays that will participate in the computation. After that the computation object has to be compiled. The compile() method requires a Thread object, which serves as a source of data about the target API and device, and provides an execution queue.
Now imagine that you want to multiply complex matrices, but real and imaginary parts of your data are kept in separate arrays. You could create additional kernels that would join your data into arrays of complex values, but this would require additional storage and additional calls to GPU. Transformation API allows you to connect these transformations to the core computation — matrix multiplication — effectively adding the code into the main computation kernel and changing its signature.
Let us change the previous example and connect transformations to it.
import numpy
from numpy.linalg import norm
import reikna.cluda as cluda
from reikna.core import Type
from reikna.linalg import MatrixMul
from reikna.transformations import combine_complex
api = cluda.ocl_api()
thr = api.Thread.create()
shape1 = (100, 200)
shape2 = (200, 100)
a_re = numpy.random.randn(*shape1).astype(numpy.float32)
a_im = numpy.random.randn(*shape1).astype(numpy.float32)
b_re = numpy.random.randn(*shape2).astype(numpy.float32)
b_im = numpy.random.randn(*shape2).astype(numpy.float32)
arrays = [thr.to_device(x) for x in [a_re, a_im, b_re, b_im]]
a_re_dev, a_im_dev, b_re_dev, b_im_dev = arrays
a_type = Type(numpy.complex64, shape=shape1)
b_type = Type(numpy.complex64, shape=shape2)
res_dev = thr.array((shape1[0], shape2[1]), dtype=numpy.complex64)
dot = MatrixMul(a_type, b_type, out_arr=res_dev)
combine_a = combine_complex(a_type)
combine_b = combine_complex(b_type)
dot.parameter.matrix_a.connect(
combine_a, combine_a.output, a_re=combine_a.real, a_im=combine_a.imag)
dot.parameter.matrix_b.connect(
combine_b, combine_b.output, b_re=combine_b.real, b_im=combine_b.imag)
dotc = dot.compile(thr)
dotc(res_dev, a_re_dev, a_im_dev, b_re_dev, b_im_dev)
res_reference = numpy.dot(a_re + 1j * a_im, b_re + 1j * b_im)
print(norm(res_dev.get() - res_reference) / norm(res_reference) < 1e-6)
We have used a pre-created transformation combine_complex() from reikna.transformations for simplicity; developing a custom transformation is also possible and described in Writing a transformation. From the documentation we know that it transforms two inputs into one output; therefore we need to attach it to one of the inputs of dot (identified by its name), and provide names for two new inputs.
Names to attach to are obtained from the documentation for the particular computation; for MatrixMul these are out, a and b.
In the current example we have attached the transformations to both inputs. Note that the computation has a new signature now, and the compiled dot object now works with split complex numbers.
Modules and snippets are important primitives in CLUDA which are used in the rest of reikna, although mostly internally. Even if you do not write modules yourself, you will most likely use operations from the functions module, or common transformations from the transformations module, which are essentially snippet and module factories (callables returning Snippet and Module objects). Therefore it helps if you know how they work under the hood.
Snippets are Mako template defs (essentially functions returning rendered text) with the associated dictionary of render keywords. Some computations which are parametrized by custom code (for example, PureParallel) require this code to be provided in form of a snippet with a certain call signature. When a snippet is used in a template, the result is quite straightworward: its template function is called, rendering and returning its contents, just as a normal Mako def.
Let us demonstrate it with a simple example. Consider the following snippet:
add = Snippet("""
<%def name="add(varname)">
${varname} + ${num}
</%def>
""",
render_kwds=dict(num=1))
Now we can compile a template which uses this snippet:
program = thr.compile("""
KERNEL void test(int *arr)
{
const SIZE_T idx = get_global_id(0);
int a = arr[idx];
arr[idx] = ${add('x')};
}
""",
render_kwds=dict(add=add))
As a result, the code that gets compiled is
KERNEL void test(int *arr)
{
const SIZE_T idx = get_global_id(0);
int a = arr[idx];
arr[idx] = x + 1;
}
If the snippet is used without parentheses (e.g. ${add}), it is equivalent to calling it without arguments (${add()}).
The root code that gets passed to compile() can be viewed as a snippet with an empty signature.
Modules are quite similar to snippets in a sense that they are also Mako defs with an associated dictionary of render keywords. The difference lies in the way they are processed. Consider a module containing a single function:
add = Module("""
<%def name="add(prefix, arg)">
WITHIN_KERNEL int ${prefix}(int x)
{
return x + ${num} + ${arg};
}
</%def>
""",
render_kwds=dict(num=1))
Modules contain complete C entities (function, macros, structures) and get rendered in the root level of the source file. In order to avoid name clashes, their def gets a string as a first argument, which it has to use to prefix these entities’ names. If the module contains only one entity that is supposed to be used by the parent code, it is a good idea to set its name to prefix only, to simplify its usage.
Let us now create a kernel that uses this module:
program = thr.compile("""
KERNEL void test(int *arr)
{
const SIZE_T idx = get_global_id(0);
int a = arr[idx];
arr[idx] = ${add(2)}(x);
}
""",
render_kwds=dict(add=add))
Before the compilation render keywords are inspected, and if a module object is encountered, the following things happen:
With the code above, the rendered module will produce the code
WITHIN_KERNEL int _module0_(int x)
{
return x + 1 + 2;
}
and the add keyword in the render_kwds gets its value changed to _module0_. Then the main code is rendered and appended to the previously renderd parts, giving
WITHIN_KERNEL int _module0_(int x)
{
return x + 1;
}
KERNEL void test(int *arr)
{
const SIZE_T idx = get_global_id(0);
int a = arr[idx];
arr[idx] = _module0_(x);
}
which is then passed to the compiler. If your module’s template def does not take any arguments except for prefix, you can call it in the parent template just as ${add} (without empty parentheses).
Warning
Note that add in this case is not a string, it is an object that has __str__() defined. If you want to concatenate a module prefix with some other string, you have to either call str() explicitly (str(add) + "abc"), or concatenate it inside a template (${add} abc).
Modules can reference snippets in their render_kwds, which, in turn, can reference other modules. This produces a tree-like structure with the snippet made from the code passed by user at the root. When it is rendered, it is traversed depth-first, modules are extracted from it and arranged in a flat list in the order of appearance. Their positions in render_kwds are replaced by assigned prefixes. This flat list is then rendered, producing a single source file being fed to the compiler.
Note that if the same module object was used without arguments in several other modules or in the kernel itself, it will only be rendered once. Therefore one can create a “root” module with the data structure declaration and then use that structure in other modules without producing type errors on compilation.
The amount of boilerplate code can be somewhat reduced by using Snippet.create and Module.create constructors. For the snippet above it would look like:
add = Snippet.create(
lambda varname: "${varname} + ${num}",
render_kwds=dict(num=1))
Note that the lambda here serves only to provide the information about the Mako def’s signature. Therefore it should return the template code regardless of the actual arguments passed.
If the argument list is created dynamically, you can use template_def() with a normal constructor:
argnames = ['varname']
add = Snippet(
template_def(argnames, "${varname} + ${num}"),
render_kwds=dict(num=1))
Modules have a similar shortcut constructor. The only difference is that by default the resulting template def has one positional argument called prefix. If you provide your own signature, its first positional argument will receive the prefix value.
add = Module.create("""
WITHIN_KERNEL int ${prefix}(int x)
{
return x + ${num};
}
""",
render_kwds=dict(num=1))
Of course, both Snippet and Module constructors can take already created Mako defs, which is convenient if you keep templates in a separate file.
Sometimes you may want to pass a module or a snippet inside a template as an attribute of a custom object. In order for CLUDA to be able to discover and process it without modifying your original object, you need to make your object comply to a discovery protocol. The protocol method takes a processing function and is expected to return a new object of the same class with the processing function applied to all the attributes that may contain a module or a snippet. By default, objects of type tuple, list, and dict are discoverable.
For example:
class MyClass:
def __init__(self, coeff, mul_module, div_module):
self.coeff = coeff
self.mul = mul_module
self.div = div_module
def __process_modules__(self, process):
return MyClass(self.coeff, process(self.mul), process(self.div))
Modules were introduced to help split big kernels into small reusable pieces which in CUDA or OpenCL program would be put into different source or header files. For example, a random number generator may be assembled from a function generating random integers, a function transforming these integers into random numbers with a certain distribution, and a PureParallel computation calling these functions and saving results to global memory. These two functions can be extracted into separate modules, so that a user could call them from some custom kernel if he does not need to store the intermediate results.
Going further with this example, one notices that functions that produce randoms with sophisticated distributions are often based on simpler distributions. For instance, the commonly used Marsaglia algorithm for generating Gamma-distributed random numbers requires several uniformly and normally distributed randoms. Normally distributed randoms, in turn, require several uniformly distributed randoms — with the range which differs from the one for uniformly distributed randoms used by the initial Gamma distribution. Instead of copy-pasting the function or setting its parameters dynamically (which in more complicated cases may affect the performance), one just specifies the dependencies between modules and lets the underlying system handle things.
The final render tree may look like:
Snippet(
PureParallel,
render_kwds = {
base_rng -> Snippet(...)
gamma -> Snippet(
} Gamma,
render_kwds = {
uniform -> Snippet(...)
normal -> Snippet(
} Normal,
) render_kwds = {
uniform -> Snippet(...)
}
)
All reikna computation classes are derived from the Computation class and therefore share the same API and behavior. A computation object is an opaque typed function-like object containing all the information necessary to generate GPU kernels that implement some algorithm, along with necessary internal temporary and persistent memory buffers. Before use it needs to be compiled by calling compile() for a given Thread (thus using its associated device and queue). This method returns a ComputationCallable object which takes GPU arrays and scalar parameters and calls its internal kernels.
One often needs to perform some simple processing of the input or output values of a computation. This can be scaling, splitting complex values into components, padding, and so on. Some of these operations require additional memory to store intermediate results, and all of them involve additional overhead of calling the kernel, and passing values to and from the device memory. Reikna porvides an API to define such transformations and attach them to “core” computations, effectively compiling the transformation code into the main kernel(s), thus avoiding all these drawbacks.
Before talking about transformations themselves, we need to take a closer look at the computation signatures. Every Computation object has a signature attribute containing funcsigs.Signature object. It is the same signature object as can be exctracted from any Python function using funcsigs.signature function (or inspect.signature from the standard library for Python >= 3.3). When the computation object is compiled, the resulting callable will have this exact signature.
The base signature for any computation can be found in its documentation (and, sometimes, can depend on the arguments passed to its constructor — see, for example, PureParallel). The signature can change if a user connects transformations to some parameter via connect(); in this case the signature attribute will change accordingly.
All attached transformations form a tree with roots being the base parameters computation has right after creation, and leaves forming the user-visible signature, which the compiled ComputationCallable will have.
As an example, let us consider a pure parallel computation object with one output, two inputs and a scalar parameter, which performs the calculation out = in1 + in2 + param:
from __future__ import print_function
import numpy
from reikna import cluda
from reikna.cluda import Snippet
from reikna.core import Transformation, Type, Annotation, Parameter
from reikna.algorithms import PureParallel
import reikna.transformations as transformations
arr_t = Type(numpy.float32, shape=128)
carr_t = Type(numpy.complex64, shape=128)
comp = PureParallel(
[Parameter('out', Annotation(carr_t, 'o')),
Parameter('in1', Annotation(carr_t, 'i')),
Parameter('in2', Annotation(carr_t, 'i')),
Parameter('param', Annotation(numpy.float32))],
"""
VSIZE_T idx = ${idxs[0]};
${out.store_idx}(
idx, ${in1.load_idx}(idx) + ${in2.load_idx}(idx) + ${param});
""")
The details of creating the computation itself are not important for this example; they are provided here just for the sake of completeness. The initial transformation tree of comp object looks like:
| out | >>
>> | in1 |
>> | in2 |
>> | param |
Here the insides of || are the base computation (the one defined by the developer), and >> denote inputs and outputs provided by the user. The computation signature is:
>>> for param in comp.signature.parameters.values():
... print(param.name + ":" + repr(param.annotation))
out:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='o')
in1:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
in2:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
param:Annotation(float32)
Now let us attach the transformation to the output which will split it into two halves: out1 = out / 2, out2 = out / 2:
tr = transformations.split_complex(comp.parameter.out)
comp.parameter.out.connect(tr, tr.input, out1=tr.real, out2=tr.imag)
We have used the pre-created transformation here for simplicity; writing custom transformations is described in Writing a transformation.
In addition, we want in2 to be scaled before being passed to the main computation. To achieve this, we connect the scaling transformation to it:
tr = transformations.mul_param(comp.parameter.in2, numpy.float32)
comp.parameter.in2.connect(tr, tr.output, in2_prime=tr.input, param2=tr.param)
The transformation tree now looks like:
| out | ----> out1 >>
| | \-> out2 >>
>> | in1 |
>> in2_prime ------> | in2 |
>> param2 ----/ | |
| param |
As can be seen, nothing has changed from the base computation’s point of view: it still gets the same inputs and outputs to the same array. But user-supplied parameters (>>) have changed, which can be also seen in the value of the signature:
>>> for param in comp.signature.parameters.values():
... print(param.name + ":" + repr(param.annotation))
out1:Annotation(Type(float32, shape=(128,), strides=(4,)), role='o')
out2:Annotation(Type(float32, shape=(128,), strides=(4,)), role='o')
in1:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
in2_prime:Annotation(Type(complex64, shape=(128,), strides=(8,)), role='i')
param2:Annotation(float32)
param:Annotation(float32)
Notice that the order of the final signature is obtained by traversing the transformation tree depth-first, starting from the base parameters. For more details see the note in the documentation for connect().
The resulting computation returns the value in1 + (in2_prime * param2) + param split in half. In order to run it, we have to compile it first. When prepare_for is called, the data types and shapes of the given arguments will be propagated to the roots and used to prepare the original computation.
api = cluda.ocl_api()
thr = api.Thread.create()
in1_t = comp.parameter.in1
in2p_t = comp.parameter.in2_prime
out1 = thr.empty_like(comp.parameter.out1)
out2 = thr.empty_like(comp.parameter.out2)
in1 = thr.to_device(numpy.ones(in1_t.shape, in1_t.dtype))
in2_prime = thr.to_device(numpy.ones(in2p_t.shape, in2p_t.dtype))
c_comp = comp.compile(thr)
c_comp(out1, out2, in1, in2_prime, 4, 3)
There are some limitations of the transformation mechanics:
This tutorial goes into more detail about the internals of computations and transformations, describing how to write them.
Reikna uses Mako extensively as a templating engine for transformations and computations. For the purpose of this tutorial you only need to know several things about the synthax:
Some common transformations are already available from transformations module. But you can create a custom one if you need to. Transformations are based on the class Transformation, and are very similar to PureParallel instances, with some additional limitations.
Let us consider a (not very useful, but quite involved) example:
tr = Transformation(
[
Parameter('out1', Annotation(Type(numpy.float32, shape=100), 'o')),
Parameter('out2', Annotation(Type(numpy.float32, shape=80), 'o')),
Parameter('in1', Annotation(Type(numpy.float32, shape=100), 'i')),
Parameter('in2', Annotation(Type(numpy.float32, shape=100), 'i')),
Parameter('param', Annotation(Type(numpy.float32))),
],
"""
VSIZE_T idx = ${idxs[0]};
float i1 = ${in1.load_same};
float i2 = ${in2.load_idx}(100 - idx) * ${param};
${out1.store_same}(i1);
if (idx < 80)
${out2.store_same}(i2);
""",
connectors=['in1', 'out1'])
Connectors. A transformation gets activated when the main computation attempts to load some value from some index in global memory, or store one to some index. This index is passed to the transformation attached to the corresponding parameter, and used to invoke loads/stores either without changes (to perform strictly elementwise operations), or, possibly, with some changes (as the example illustrates).
If some parameter is only queried once, and only using load_same or store_same, it is called a connector, which means that it can be used to attach the transformation to a computation. Currently connectors cannot be detected automatically, so it is the responsibility of the user to provide a list of them to the constructor. By default all parameters are considered to be connectors.
Shape changing. Parameters in transformations are typed, and it is possible to change data type or shape of a parameter the transformation is attached to. In our example out2 has length 80, so the current index is checked before the output to make sure there is no out of bounds access.
Parameter objects. The transformation example above has some hardcoded stuff, for example the type of parameters (float), or their shapes (100 and 80). These can be accessed from argument objects out1, in1 etc; they all have the type KernelParameter. In addition, the transformation code gets an Indices object with the name idxs, which allows one to manipulate index names directly.
A computation must derive Computation. As an example, let us create a computation which calculates output = input1 + input2 * param.
Defining a class:
import numpy
from reikna.helpers import *
from reikna.core import *
class TestComputation(Computation):
Each computation class has to define the constructor, and the plan building callback.
Constructor. Computation constructor takes a list of computation parameters, which the deriving class constructor has to create according to arguments passed to it. You will often need Type objects, which can be extracted from arrays, scalars or other Type objects with the help of from_value() (or they can be passed straight to Annotation) which does the same thing.
def __init__(self, arr, coeff):
assert len(arr.shape) == 1
Computation.__init__(self, [
Parameter('output', Annotation(arr, 'o')),
Parameter('input1', Annotation(arr, 'i')),
Parameter('input2', Annotation(arr, 'i')),
Parameter('param', Annotation(coeff))])
In addition to that, the constructor can create some internal state which will be used by the plan builder.
Plan builder. The second method is called when the computation is being compiled, and has to fill and return the computation plan — a sequence of kernel calls, plus maybe some temporary or persistent internal allocations its kernels use. In addition, the plan can include calls to nested computations.
The method takes two predefined positional parameters, plus KernelArgument objects corresponding to computation parameters. The plan_factory is a callable that creates a new ComputationPlan object (in some cases you may want to recreate the plan, for example, if the workgroup size you were using turned out to be too big), and device_params is a DeviceParameters object, which is used to optimize the computation for the specific device. The method must return a filled ComputationPlan object.
For our example we only need one action, which is the execution of an elementwise kernel:
def _build_plan(self, plan_factory, device_params, output, input1, input2, param):
plan = plan_factory()
template = template_from(
"""
<%def name='testcomp(kernel_declaration, k_output, k_input1, k_input2, k_param)'>
${kernel_declaration}
{
VIRTUAL_SKIP_THREADS;
const VSIZE_T idx = virtual_global_id(0);
${k_output.ctype} result =
${k_input1.load_idx}(idx) +
${mul}(${k_input2.load_idx}(idx), ${k_param});
${k_output.store_idx}(idx, result);
}
</%def>
""")
plan.kernel_call(
template.get_def('testcomp'),
[output, input1, input2, param],
global_size=output.shape,
render_kwds=dict(mul=functions.mul(input2.dtype, param.dtype)))
return plan
Every kernel call is based on the separate Mako template def. The template can be specified as a string using template_def(), or loaded as a separate file. Usual pattern in this case is to call the template file same as the file where the computation class is defined (for example, testcomp.mako for testcomp.py), and store it in some variable on module load using template_for() as TEMPLATE = template_for(__file__).
The template function should take the same number of positional arguments as the kernel plus one; you can view <%def ... > part as an actual kernel definition, but with the arguments being KernelParameter objects containing parameter metadata. The first argument will contain the string with the kernel declaration.
Also, depending on whether the corresponding argument is an output array, an input array or a scalar parameter, the object can be used as ${obj.store_idx}(index, val), ${obj.load_idx}(index) or ${obj}. This will produce the corresponding request to the global memory or kernel arguments.
If you need additional device functions, they have to be specified between <%def ... > and ${kernel_declaration}. Obviously, these functions can still use dtype and ctype object properties, although store_idx and load_idx will most likely result in compilation error (since they are rendered as macros using main kernel arguments).
Since kernel call parameters (global_size and local_size) are specified on creation, all kernel calls are rendered as CLUDA static kernels (see compile_static()) and therefore can use all the corresponding macros and functions (like virtual_global_flat_id() in our kernel). Also, they must have VIRTUAL_SKIP_THREADS at the beginning of the kernel which remainder threads (which can be present, for example, if the workgroup size is not a multiple of the global size).
This module contains information about the library version.
A tuple with version numbers, major components first.
A string fully identifying the current build.
A string with Git SHA identifying the revision used to create this build.
A boolean variable, equals True if current version is a release version.
This module contains various auxiliary functions which are used throughout the library.
Returns the minimal number of the form 2**m such that it is greater or equal to n.
Returns the list of pairs (factor, num/factor) for all factors of num (including 1 and num), sorted by factor. If limit is set, only pairs with factor <= limit are returned.
Context manager for ignoring integer overflow in numpy operations on scalars (not ignored by default because of a bug in numpy).
Integer-valued logarigthm with base 2. If n is not a power of 2, the result is rounded to the smallest number.
Returns minimum number of blocks with length block necessary to cover the array with length length.
Returns the product of elements in the iterable seq.
Returns a Mako template with the given signature.
Parameters: | signature – a list of postitional argument names, or a Signature object from funcsigs module. |
---|---|
Code: | a body of the template. |
Returns the Mako template object created from the file which has the same name as filename and the extension .mako. Typically used in computation modules as template_for(__filename__).
Creates a Mako template object from a given string. If template already has render() method, does nothing.
If seq_or_elem is a sequence, converts it to a tuple, otherwise returns a tuple with a single element seq_or_elem.
CLUDA is the foundation of reikna. It provides the unified access to basic features of CUDA and OpenCL, such as memory operations, compilation and so on. It can also be used by itself, if you want to write GPU API-independent programs and happen to only need a small subset of GPU API. The terminology is borrowed from OpenCL, since it is a more general API.
Contains a CLUDA module. See Tutorial: modules and snippets for details.
Parameters: |
|
---|
Creates a module from the Mako def:
Thrown by compile_static() if the provided local_size is too big, or one cannot be found.
Contains a CLUDA snippet. See Tutorial: modules and snippets for details.
Parameters: |
|
---|
Creates a snippet from the Mako def:
Returns one of the API modules supported by the system or raises an Exception if there are not any.
Returns a list of identifiers for all known (not necessarily available for the current system) APIs.
Returns the PyCUDA-based API module.
Returns the identifier of the PyCUDA-based API.
Find platforms and devices meeting certain criteria.
Parameters: |
|
---|---|
Returns: | a dictionary with found platform numbers as keys, and lists of device numbers as values. |
Returns an API module with the generalized interface reikna.cluda.api for the given identifier.
Returns the PyOpenCL-based API module.
Returns the identifier of the PyOpenCL-based API.
Returns a list of identifiers of supported APIs.
Returns True if given API is supported.
Modules for all APIs have the same generalized interface. It is referred here (and references from other parts of this documentation) as reikna.cluda.api.
Low-level untyped memory allocation. Actual class depends on the API: pycuda.driver.DeviceAllocation for CUDA and pyopencl.Buffer for OpenCL.
A superclass of the corresponding API’s native array (pycuda.gpuarray.GPUArray for CUDA and pyopencl.array.Array for OpenCL), with some additional functionality.
An assembly of device parameters necessary for optimizations.
Maximum block size for kernels.
List with maximum local_size for each dimension.
List with maximum number of workgroups for each dimension.
Warp size (nVidia), or wavefront size (AMD), or SIMD width is supposed to be the number of threads that are executed simultaneously on the same computation unit (so you can assume that they are perfectly synchronized).
Number of local (shared in CUDA) memory banks is a number of successive 32-bit words you can access without getting bank conflicts.
Size of the local (shared in CUDA) memory per workgroup, in bytes.
Dictionary {word_size:elements}, where elements is the number of elements with size word_size in global memory that allow coalesced access.
Checks if given numpy dtype can be used in kernels compiled using this thread.
A vendor-specific implementation of the GPGPU API.
Platform name.
Vendor name.
Platform version.
Returns a list of device objects available in the platform.
An object containing GPU kernel.
Maximum size of the work group for the kernel.
A shortcut for successive call to prepare() and prepared_call().
Prepare the kernel for execution with given parameters.
Parameters: |
|
---|
An object with compiled GPU code.
Contains module source code.
An object containing a GPU kernel with fixed call sizes.
Contains the source code of the program.
Wraps an existing context in the CLUDA thread object.
Parameters: |
|
---|
Note
If you are using CUDA API, you must keep in mind the stateful nature of CUDA calls. Briefly, this means that there is the context stack, and the current context on top of it. When the create() is called, the PyCUDA context gets pushed to the stack and made current. When the thread object goes out of scope (and the thread object owns it), the context is popped, and it is the user’s responsibility to make sure the popped context is the correct one. In simple single-context programs this only means that one should avoid reference cycles involving the thread object.
Warning
Do not pass one Stream/CommandQueue object to several Thread objects.
Module object representing the CLUDA API corresponding to this Thread.
Instance of DeviceParameters class for this thread’s device.
Instance of TemporaryManager which handles allocations of temporary arrays (see temp_array()).
Creates an Array on GPU with given shape, dtype and strides. Optionally, an allocator is a callable returning any object castable to int representing the physical address on the device (for instance, Buffer).
Creates a module object from the given template.
Parameters: |
|
---|---|
Returns: | a Program object. |
Creates a kernel object with fixed call sizes, which allows to overcome some backend limitations. Global and local sizes can have any length, providing that len(global_size) >= len(local_size), and the total number of work items and work groups is less than the corresponding total number available for the device. In order to get IDs and sizes in such kernels, virtual size functions have to be used (see VIRTUAL_SKIP_THREADS and others for details).
Parameters: |
|
---|---|
Returns: | a StaticKernel object. |
Copies array on device.
Parameters: |
|
---|
Creates a new Thread object with its own context and queue inside. Intended for cases when you want to base your whole program on CLUDA.
Parameters: |
|
---|
Allocates an array on GPU with the same attributes as arr.
Transfers the contents of arr to a numpy.ndarray object. The effect of dest parameter is the same as in to_device(). If async is True, the transfer is asynchronous (the thread-wide asynchronisity setting does not apply here).
Alternatively, one can use Array.get().
Forcefully free critical resources (rendering the object unusable). In most cases you can rely on the garbage collector taking care of things. Calling this method explicitly may be necessary in case of CUDA API when you want to make sure the context got popped.
Forcefully synchronize this thread with the main program.
Creates an Array on GPU with given shape, dtype and strides. In order to reduce the memory footprint of the program, the temporary array manager will allow these arrays to overlap. Two arrays will not overlap, if one of them was specified in dependencies for the other one. For a list of values dependencies takes, see the reference entry for TemporaryManager.
Copies an array to the device memory. If dest is specified, it is used as the destination, and the method returns None. Otherwise the destination array is created internally and returned from the method.
Returns the identifier of this API.
Each Thread contains a special allocator for arrays with data that does not have to be persistent all the time. In many cases you only want some array to keep its contents between several kernel calls. This can be achieved by manually allocating and deallocating such arrays every time, but it slows the program down, and you have to synchronize the queue because allocation commands are not serialized. Therefore it is advantageous to use temp_array() method to get such arrays. It takes a list of dependencies as an optional parameter which gives the allocator a hint about which arrays should not use the same physical allocation.
Base class for a manager of temporary allocations.
Parameters: |
|
---|
Returns a temporary array.
Parameters: |
|
---|
Packs the real allocations possibly reducing total memory usage. This process can be slow.
Trivial manager — allocates a separate buffer for each allocation request.
Tries to assign several allocation requests to a single real allocation, if dependencies allow that. All virtual allocations start from the beginning of real allocations.
This module contains Module factories which are used to compensate for the lack of complex number operations in OpenCL, and the lack of C++ synthax which would allow one to write them.
Returns a Module with a function of len(in_dtypes) arguments that adds values of types in_dtypes. If out_dtype is given, it will be set as a return type for this function.
This is necessary since on some platforms the + operator for a complex and a real number works in an unexpected way (returning (a.x + b, a.y + b) instead of (a.x + b, a.y)).
Returns a Module with a function of one argument that casts values of in_dtype to out_dtype.
Returns a Module with a function of one argument that conjugates the value of type dtype (must be a complex data type).
Returns a Module with a function of two arguments that divides values of in_dtype1 and in_dtype2. If out_dtype is given, it will be set as a return type for this function.
Returns a Module with a function of one argument that exponentiates the value of type dtype (must be a real or complex data type).
Returns a Module with a function of len(in_dtypes) arguments that multiplies values of types in_dtypes. If out_dtype is given, it will be set as a return type for this function.
Returns a Module with a function of one argument that returns the 2-norm of the value of type dtype (product by the complex conjugate if the value is complex, square otherwise).
Returns a Module with a function of two arguments that returns the complex-valued rho * exp(i * theta) for values rho, theta of type dtype (must be a real data type).
Returns a Module with a function of one argument that returns a complex number (cos(theta), sin(theta)) for a value theta of type dtype (must be a real data type).
Returns a Module with a function of two arguments that raises the first argument of type dtype to the power of the second argument of type exponent_dtype (an integer or real data type). If exponent_dtype or output_dtype are not given, they default to dtype. If dtype is not the same as output_dtype, the input is cast to output_dtype before exponentiation. If exponent_dtype is real, but both dtype and output_dtype are integer, a ValueError is raised.
The stuff available for the kernel passed for compilation consists of two parts.
First, there are several objects available at the template rendering stage, namely numpy, reikna.cluda.dtypes (as dtypes), and reikna.helpers (as helpers).
Second, there is a set of macros attached to any kernel depending on the API it is being compiled for:
If defined, specifies that the kernel is being compiled for CUDA API.
If defined, specifies that the compilation for this kernel was requested with fast_math == True.
Synchronizes threads inside a block.
Modifier for a device-only function declaration.
Modifier for the kernel function declaration.
Modifier for the global memory pointer argument.
Modifier for the statically allocated local memory variable.
Modifier for the dynamically allocated local memory variable.
Modifier for the local memory argument in the device-only functions.
Modifier for inline functions.
The type of local/global IDs and sizes. Equal to unsigned int for CUDA, and size_t for OpenCL (which can be 32- or 64-bit unsigned integer, depending on the device).
Local, group and global identifiers and sizes. In case of CUDA mimic the behavior of corresponding OpenCL functions.
The type of local/global IDs in the virtual grid. It is separate from SIZE_T because the former is intended to be equivalent to what the backend is using, while VSIZE_T is a separate type and can be made larger than SIZE_T in the future if necessary.
Used to specify an explicit alignment (in bytes) for fields in structures, as
typedef struct {
char ALIGN(4) a;
int b;
} MY_STRUCT;
This macro should start any kernel compiled with compile_static(). It skips all the empty threads resulting from fitting call parameters into backend limitations.
Only available in StaticKernel objects obtained from compile_static(). Since its dimensions can differ from actual call dimensions, these functions have to be used.
This module contains various convenience functions which operate with numpy.dtype objects.
Returns a new struct dtype with the field offsets changed to the ones a compiler would use (without being given any explicit alignment qualifiers). Ignores all existing explicit itemsizes and offsets.
Returns a C-style numerical constant. If val has a struct dtype, the generated constant will have the form { ... } and can be used as an initializer for a variable.
Returns a string corresponding to the path to a struct element in C. The path is the sequence of field names/array indices returned from flatten_dtype().
Returns function that takes one argument and casts it to dtype.
Returns name of the constructor for the given dtype.
Returns complex dtype corresponding to given floating point dtype.
For a built-in C type, returns a string with the name of the type.
For a struct type, returns a Module object with the typedef of a struct corresponding to the given dtype (with its name set to the module prefix); falls back to ctype() otherwise.
The structure definition includes the alignment required to produce field offsets specified in dtype; therefore, dtype must be either a simple type, or have proper offsets and dtypes (the ones that can be reporoduced in C using explicit alignment attributes, but without additional padding) and the attribute isalignedstruct == True. An aligned dtype can be produced either by standard means (aligned flag in numpy.dtype constructor and explicit offsets and itemsizes), or created out of an arbitrary dtype with the help of align().
If ignore_alignment is True, all of the above is ignored. The C structures produced will not have any explicit alignment modifiers. As a result, the the field offsets of dtype may differ from the ones chosen by the compiler.
Modules are cached and the function returns a single module instance for equal dtype‘s. Therefore inside a kernel it will be rendered with the same prefix everywhere it is used. This results in a behavior characteristic for a structural type system, same as for the basic dtype-ctype conversion.
Warning
As of numpy 1.8, the isalignedstruct attribute is not enough to ensure a mapping between a dtype and a C struct with only the fields that are present in the dtype. Therefore, ctype_module will make some additional checks and raise ValueError if it is not the case.
Find out the data type of val.
Extracts an element from an array of struct dtype. The path is the sequence of field names/array indices returned from flatten_dtype().
Returns a list of tuples (path, dtype) for each of the basic dtypes in a (possibly nested) dtype. path is a list of field names/array indices leading to the corresponding element.
Returns True if dtype is complex.
Returns True if dtype is double precision floating point.
Returns True if dtype is an integer.
Returns True if dtype is a real.
Wrapper for numpy.min_scalar_dtype which takes into account types supported by GPUs.
Function for wrapping all dtypes coming from the user. numpy uses two different classes to represent dtypes, and one of them does not have some important attributes.
Same as normalize_type(), but operates on a list of dtypes.
Returns floating point dtype corresponding to given complex dtype.
Wrapper for numpy.result_type which takes into account types supported by GPUs.
Returns the string with constructed zero value for the given dtype.
Classes necessary to create computations and transformations are exposed from the core module.
Represents an array or, as a degenerate case, scalar type of a computation parameter.
A tuple of integers. Scalars are represented by an empty tuple.
A numpy.dtype instance.
Tuple of bytes to step in each dimension when traversing an array.
Casts the given value to this type.
Computation parameter annotation, in the same sense as it is used for functions in the standard library.
Parameters: |
|
---|
Computation parameter, in the same sense as it is used for functions in the standard library. In its terms, all computation parameters have kind POSITIONAL_OR_KEYWORD.
Parameters: |
|
---|
Computation signature, in the same sense as it is used for functions in the standard library.
Parameters: | parameters – a list of Parameter objects. |
---|
Binds passed positional and keyword arguments to parameters in the signature and returns the resulting BoundArguments object.
A base class for computations, intended to be subclassed.
Parameters: | root_parameters – a list of Parameter objects. |
---|
A Signature object representing current computation signature (taking into account connected transformations).
A named tuple of ComputationParameter objects corresponding to parameters from the current signature.
Derived classes override this method. It is called by compile() and supposed to return a ComputationPlan object.
Parameters: |
|
---|
Updates signature and parameter attributes. Called by the methods that change the signature.
Compiles the computation with the given Thread object and returns a ComputationCallable object. If fast_math is enabled, the compilation of all kernels is performed using the compiler options for fast and imprecise mathematical functions.
Connect a transformation to the computation.
Parameters: |
|
---|---|
Returns: | this computation object (modified). |
Note
The resulting parameter order is determined by traversing the graph of connections depth-first (starting from the initial computation parameters), with the additional condition: the nodes do not change their order in the same branching level (i.e. in the list of computation or transformation parameters, both of which are ordered).
For example, consider a computation with parameters (a, b, c, d). If you connect a transformation (a', c) -> a, the resulting computation will have the signature (a', b, c, d) (as opposed to (a', c, b, d) it would have for the pure depth-first traversal).
A class containing a pure parallel transformation of arrays. Some restrictions apply:
Parameters: |
|
---|
Encapsulates the information about index variables available for the snippet.
Returns the name of the index varibale for the dimension dim.
Returns the comma-separated list of all index variable names (useful for passing the guiding indices verbatim in a load or store call).
A result of calling compile() on a computation. Represents a callable opaque GPGPU computation.
Execute the computation.
Bases: Type
Represents a typed computation parameter. Can be used as a substitute of an array for functions which are only interested in array metadata.
Bases: Type
Represents an argument suitable to pass to planned kernel or computation call.
Computation plan recorder.
Adds a nested computation call. The computation value must be a Computation object. args and kwds are values to be passed to the computation.
Adds a kernel call to the plan.
Parameters: |
|
---|
Adds a persistent GPU array to the plan, and returns the corresponding KernelArgument.
Adds a temporary GPU array to the plan, and returns the corresponding KernelArgument. Temporary arrays can share physical memory, but in such a way that their contents is guaranteed to persist between the first and the last use in a kernel during the execution of the plan.
Same as temp_array(), taking the array properties from array or array-like object arr.
Bases: Type
Represents a typed transformation parameter. Can be used as a substitute of an array for functions which are only interested in array metadata.
Providing an interface for accessing kernel arguments in a template. Depending on the parameter type, and whether it is used inside a computation or a transformation template, can have different load/store attributes available.
Parameter name
Returns the C kernel parameter name corresponding to this parameter. It is the only method available for scalar parameters.
A module providing a macro with the signature (idx0, idx1, ...), returning the corresponding element of the array.
A module providing a macro with the signature (idx0, idx1, ..., val), saving val into the specified position.
A module providing a macro with the signature (cidx0, cidx1, ...), returning the element of the array corresponding to the new slicing of indices (e.g. an array with shape (2, 3, 4, 5, 6) sliced as slices=(2, 2, 1) is indexed as an array with shape (6, 20, 6)).
A module providing a macro with the signature (cidx0, cidx1, ..., val), saving val into the specified position corresponding to the new slicing of indices.
A module providing a macro that returns the element of the array corresponding to the indices used by the caller of the transformation.
A module providing a macro with the signature (val) that stores val using the indices used by the caller of the transformation.
General purpose algorithms.
Bases: Computation
A general class for pure parallel computations (i.e. with no interaction between threads).
Parameters: |
|
---|
Parameters: | args – corresponds to the given parameters. |
---|
Creates a PureParallel instance from a Transformation object. guiding_array can be a string with a name of an array parameter from trf, or the corresponding TransformationParameter object.
Bases: Computation
Changes the order of axes in a multidimensional array. Works analogous to numpy.transpose.
Parameters: |
|
---|
Parameters: |
|
---|
A predicate used in Reduce.
Parameters: |
|
---|
Bases: Computation
Reduces the array over given axis using given binary operation.
Parameters: |
|
---|
Parameters: |
|
---|
Linear algebra algorithms.
Bases: Computation
Multiplies two matrices using last two dimensions and batching over remaining dimensions. For batching to work, the products of remaining dimensions should be equal (then the multiplication will be performed piecewise), or one of them should equal 1 (then the multiplication will be batched over the remaining dimensions of the other matrix).
Parameters: |
|
---|
Parameters: |
|
---|
Bases: Computation
Calculates the entrywise matrix norm (same as numpy.linalg.norm) of an arbitrary order \(r\):
Parameters: |
|
---|
Parameters: |
|
---|
Bases: Computation
Performs the Fast Fourier Transform. The interface is similar to numpy.fft.fftn. The inverse transform is normalized so that IFFT(FFT(X)) = X.
Parameters: |
|
---|
Note
Current algorithm works most effectively with array dimensions being power of 2 This mostly applies to the axes over which the transform is performed, beacuse otherwise the computation falls back to the Bluestein’s algorithm, which effectively halves the performance.
output and input may be the same array.
Parameters: |
|
---|
Bases: Computation
Shift the zero-frequency component to the center of the spectrum. The interface is similar to numpy.fft.fftshift, and the output is the same for the same array shape and axes.
Parameters: |
|
---|
output and input may be the same array.
Parameters: |
|
---|
Returns the spatial grid required to calculate the order power of a function defined in the harmonic mode space of the size modes. If add_points is 0, the grid has the minimum size required for exact transformation back to the mode space.
Returns an eigenfunction of order \(n = \mathrm{mode}\) for the harmonic oscillator:
where \(H_n\) is the \(n\)-th order “physicists’” Hermite polynomial. The normalization is chosen so that \(\int \phi_n^2(x) dx = 1\).
Bases: Computation
Discrete transform to and from harmonic oscillator modes. With inverse=True transforms a function defined by its expansion \(C_m,\,m=0 \ldots M-1\) in the mode space with mode functions from harmonic(), to the coordinate space (\(F(x)\) on the grid \(x\) from get_spatial_grid()). With inverse=False guarantees to recover first \(M\) modes of \(F^k(x)\), where \(k\) is the order parameter.
For multiple dimensions the operation is the same, and the mode functions are products of 1D mode functions, i.e. \(\phi_{l,m,n}^{3D}(x,y,z) = \phi_l(x) \phi_m(y) \phi_n(z)\).
For the detailed description of the algorithm, see Dion & Cances, PRE 67(4) 046706 (2003)
Parameters: |
|
---|
Depending on inverse value, either of these two will be created.
Parameters: |
|
---|
This module is based on the paper by Salmon et al., P. Int. C. High. Perform. 16 (2011). and the source code of Random123 library.
A counter-based random-number generator (CBRNG) is a parametrized function \(f_k(c)\), where \(k\) is the key, \(c\) is the counter, and the function \(f_k\) defines a bijection in the set of integer numbers. Being applied to successive counters, the function produces a sequence of pseudo-random numbers. The key is an analogue of the seed of stateful RNGs; if the CBRNG is used to generate random num bers in parallel threads, the key is a combination of a seed and a unique thread number.
There are two types of generators available, threefry (uses large number of simple functions), and philox (uses smaller number of more complicated functions). The latter one is generally faster on GPUs; see the paper above for detailed comparisons. These generators can be further specialized to use words=2 or words=4 bitness=32-bit or bitness=64-bit counters. Obviously, the period of the generator equals to the cardinality of the set of possible counters. For example, if the counter consits of 4 64-bit numbers, then the period of the generator is \(2^{256}\). As for the key size, in case of threefry the key has the same size as the counter, and for philox the key is half its size.
The CBRNG class sets one of the words of the key (except for philox-2x64, where 32 bit of the only word in the key are used), the rest are the same for all threads and are derived from the provided seed. This limits the maximum number of number-generating threads (size). philox-2x32 has a 32-bit key and therefore cannot be used in CBRNG (although it can be used separately with the help of the kernel API).
The CBRNG class itself is stateless, same as other computations in Reikna, so you have to manage the generator state yourself. The state is created by the create_counters() method and contains a size counters. This state is then passed to, and updated by a CBRNG object.
Bases: Computation
Counter-based pseudo-random number generator class.
Parameters: |
|
---|
A convenience constructor for the sampler sampler_name from samplers. The contents of the dictionary sampler_kwds will be passed to the sampler constructor function (with bijection being created automatically, and dtype taken from randoms_arr).
Parameters: |
|
---|
Contains a CBRNG bijection module and accompanying metadata. Supports __process_modules__ protocol.
The data type of the integer word used by the generator.
The number of words used by the key.
The number of words used by the counter.
The numpy.dtype object representing a bijection key. Contains a single array field v with key_words of word_dtype elements.
The numpy.dtype object representing a bijection counter. Contains a single array field v with key_words of word_dtype elements.
A dictionary dtype:function_name of available functions function_name in module that produce a random full-range integer dtype from a State, advancing it. Available functions: get_raw_uint32(), get_raw_uint64().
The module containing the CBRNG function. It provides the C functions below.
Contains the value of counter_words.
Contains the type corresponding to word_dtype.
Describes the bijection key. Alias for the structure generated from key_dtype.
Describes the bijection counter, or its output. Alias for the structure generated from counter_dtype.
Extracts a counter which has not been used in random sampling.
A type of unsigned 32-bit word, corresponds to numpy.uint32.
A type of unsigned 64-bit word, corresponds to numpy.uint64.
A CBRNG based on a low number of slow rounds (multiplications).
Parameters: |
|
---|---|
Returns: | a Bijection object. |
A CBRNG based on a big number of fast rounds (bit rotations).
Parameters: |
|
---|---|
Returns: | a Bijection object. |
Contains a random distribution sampler module and accompanying metadata. Supports __process_modules__ protocol.
If True, every sampled random number consumes the same amount of counters.
How many random numbers one call to sample creates.
The data type of one random value produced by the sampler.
The module containing the distribution sampling function. It provides the C functions below.
Contains the value of randoms_per_call.
Describes the sampling result.
Generates random numbers from the gamma distribution
where \(k\) is shape, and \(\theta\) is scale. Supported dtypes: float(32/64). Returns a Sampler object.
Generates normally distributed random numbers with the mean mean and the standard deviation std using Box-Muller transform. Supported dtypes: float(32/64), complex(64/128). Produces two random numbers per call for real types and one number for complex types. Returns a Sampler object.
Note
In case of a complex dtype, std refers to the standard deviation of the complex numbers (same as numpy.std() returns), not real and imaginary components (which will be normally distributed with the standard deviation std / sqrt(2)). Consequently, while mean is of type dtype, std must be real.
Generates uniformly distributed floating-points numbers in the interval [low, high). Supported dtypes: float(32/64). A fixed number of counters is used in each thread. Returns a Sampler object.
Generates uniformly distributed integer numbers in the interval [low, high). If high is None, the interval is [0, low). Supported dtypes: any numpy integers. If the size of the interval is a power of 2, a fixed number of counters is used in each thread. Returns a Sampler object.
Contains a key generator module and accompanying metadata. Supports __process_modules__ protocol.
A module with the key generator function:
Generates and returns a key, suitable for the bijection which was given to the constructor.
Creates a generator.
Parameters: |
|
---|
Reference function that returns the key given the thread identifier. Uses the same algorithm as the module.
This module contains a number of pre-created transformations.
Returns an addition transformation with a fixed parameter (1 output, 1 input): output = input + param.
Returns an addition transformation with a dynamic parameter (1 output, 1 input, 1 scalar): output = input + param.
Returns a transformation that broadcasts the given constant to the array output (1 output): output = val.
Returns a transformation that broadcasts the free parameter to the array output (1 output, 1 param): output = param.
Returns a transformation that joins two real inputs into complex output (1 output, 2 inputs): output = real + 1j * imag.
Returns an identity transformation (1 output, 1 input): output = input. Output array type out_arr_t may have different strides, but must have the same shape and data type.
Returns a transformation that ignores the output it is attached to.
Returns a scaling transformation with a fixed parameter (1 output, 1 input): output = input * param.
Returns a scaling transformation with a dynamic parameter (1 output, 1 input, 1 scalar): output = input * param.
Returns a transformation that calculates the order-norm (1 output, 1 input): output = abs(input) ** order.
Returns a transformation that calculates the order-norm (1 output, 1 input, 1 param): output = abs(input) ** order.
Returns a transformation that splits complex input into two real outputs (2 outputs, 1 input): real = Re(input), imag = Im(input).
Major core API change:
Other stuff: