- CHANGED: renamed
`power_dtype`parameter to`exponent_dtype`(a more correct term) in`pow()`. - FIXED: (PR by @ringw) exception caused by printing CUDA program object.
- FIXED:
`pow()`(0, 0) now returns 1 as it should. - ADDED: an example of
`FFT`with a custom transformation. - ADDED: a type check in the
`FFT`constructor. - ADDED: an explicit
`output_dtype`parameter for`pow()`. - ADDED:
`Array`objects for each backend expose the attribute`thread`.

- FIXED: (@schreon) a bug preventing the usage of
`EntrywiseNorm`with custom`axes`. - FIXED: (PR by @SyamGadde) removed syntax constructions incompatible with Python 2.6.
- FIXED: added Python 3.4 to the list of classifiers.

- ADDED:
`pow()`function module in CLUDA. - ADDED: a function
`any_api()`that returns some supported GPGPU API module. - ADDED: an example of
`Reduce`with a custom data type. - FIXED: a Py3 compatibility issue in
`Reduce`introduced in`0.6.1`. - FIXED: a bug due to the interaction between the implementation of
`from_trf()`and the logic of processing nested computations. - FIXED: a bug in
`FFT`leading to undefined behavior on some OpenCL platforms.

- FIXED:
`Reduce`can now pick a decreased work group size if the attached transformations are too demanding.

- CHANGED: some computations were moved to sub-packages:
`PureParallel`,`Transpose`and`Reduce`to`reikna.algorithms`,`MatrixMul`and`EntrywiseNorm`to`reikna.linalg`. - CHANGED:
`scale_const`and`scale_param`were renamed to`mul_const()`and`mul_param()`, and the scalar parameter name of the latter was renamed from`coeff`to`param`. - ADDED: two transformations for norm of an arbitrary order:
`norm_const()`and`norm_param()`. - ADDED: stub transformation
`ignore()`. - ADDED: broadcasting transformations
`broadcast_const()`and`broadcast_param()`. - ADDED: addition transformations
`add_const()`and`add_param()`. - ADDED:
`EntrywiseNorm`computation. - ADDED: support for multi-dimensional sub-arrays in
`c_constant()`and`flatten_dtype()`. - ADDED: helper functions
`extract_field()`and`c_path()`to work in conjunction with`flatten_dtype()`. - ADDED: a function module
`add()`. - FIXED: casting a coefficient in the
`normal_bm()`template to a correct dtype. - FIXED:
`cast()`avoids casting if the value already has the target dtype (since`numpy.cast`does not work with struct dtypes, see issue #4148). - FIXED: a error in transformation module rendering for scalar parameters with struct dtypes.
- FIXED: normalizing dtypes in several functions from
`dtypes`to avoid errors with`numpy`dtype shortcuts.

- ADDED:
`normal_bm()`now supports complex dtypes. - FIXED: a nested
`PureParallel`can now take several identical argument objects as arguments. - FIXED: a nested computation can now take a single input/output argument (e.g. a temporary array) as separate input and output arguments.
- FIXED: a critical bug in
`CBRNG`that could lead to the counter array not being updated. - FIXED: convenience constructors of
`CBRNG`can now properly handle`None`as`samplers_kwds`.

- FIXED: a possible infinite loop in
`compile_static()`local size finding algorithm.

- CHANGED:
`KernelParameter`is not derived from`Type`anymore (although it still retains the corresponding attributes). - CHANGED:
`Predicate`now takes a dtype’d value as`empty`, not a string. - CHANGED: The logic of processing struct dtypes was reworked, and
`adjust_alignment`was removed. Instead, one should use`align()`(which does not take a`Thread`parameter) to get a dtype with the offsets and itemsize equal to those a compiler would set. On the other hand,`ctype_module()`attempts to set the alignments such that the field offsets are the same as in the given numpy dtype (unless`ignore_alignments`flag is set). - ADDED: struct dtypes support in
`c_constant()`. - ADDED:
`flatten_dtype()`helper function. - ADDED: added
`transposed_a`and`transposed_b`keyword parameters to`MatrixMul`. - ADDED: algorithm cascading to
`Reduce`, leading to 3-4 times increase in performance. - ADDED:
`polar_unit()`function module in CLUDA. - ADDED: support for arrays with 0-dimensional shape as computation and transformation arguments.
- FIXED: a bug in
`Reduce`, which lead to incorrect results in cases when the reduction power is exactly equal to the maximum one. - FIXED:
`Transpose`now works correctly for struct dtypes. - FIXED:
`bounding_power_of_2`now correctly returns`1`instead of`2`being given`1`as an argument. - FIXED:
`compile_static()`local size finding algorithm is much less prone to failure now.

- CHANGED:
`supports_dtype()`method moved from`Thread`to`DeviceParameters`. - CHANGED:
`fast_math`keyword parameter moved from`Thread`constructor to`compile()`and`compile_static()`. It is also`False`by default, instead of`True`. Correspondingly,`THREAD_FAST_MATH`macro was renamed to`COMPILE_FAST_MATH`. - CHANGED: CBRNG modules are using the dtype-to-ctype support.
Correspondingly, the C types for keys and counters can be obtained by calling
`ctype_module()`on`key_dtype`and`counter_dtype`attributes. The module wrappers still define their types, but their names are using a different naming convention now. - ADDED: module generator for nested dtypes (
`ctype_module()`) and a function to get natural field offsets for a given API/device (`adjust_alignment`). - ADDED:
`fast_math`keyword parameter in`compile()`. In other words, now`fast_math`can be set per computation. - ADDED:
`ALIGN`macro is available in CLUDA kernels. - ADDED: support for struct types as
`Computation`arguments (for them, the`ctypes`attributes contain the corresponding module obtained with`ctype_module()`). - ADDED: support for non-sequential axes in
`Reduce`. - FIXED: bug in the interactive
`Thread`creation (reported by James Bergstra). - FIXED: Py3-incompatibility in the interactive
`Thread`creation. - FIXED: some code paths in virtual size finding algorithm could result in a type error.
- FIXED: improved the speed of test collection by reusing
`Thread`objects.

- ADDED: the first argument to the
`Transformation`or`PureParallel`snippet is now a`reikna.core.Indices`object instead of a list. - ADDED: classmethod
`PureParallel.from_trf()`, which allows one to create a pure parallel computation out of a transformation. - FIXED: improved
`Computation.compile()`performance for complicated computations by precreating transformation templates.

- FIXED: bug with virtual size algorithms returning floating point global and local sizes in Py2.

- CHANGED: virtual sizes algorithms were rewritten and are now more maintainable. In addition, virtual sizes can now handle any number of dimensions of local and global size, providing the device can support the corresponding total number of work items and groups.
- CHANGED: id- and size- getting kernel functions now have return types corresponding to their equivalents. Virtual size functions have their own independent return type.
- CHANGED:
`Thread.compile_static()`and`ComputationPlan.kernel_call()`take global and local sizes in the row-major order, to correspond to the matrix indexing in load/store macros. - FIXED: requirements for PyCUDA extras (a currently non-existent version was specified).
- FIXED: an error in gamma distribution sampler, which lead to slightly wrong shape of the resulting distribution.

- FIXED: package metadata.

- ADDED: same module object, when being called without arguments from other modules/snippets, is rendered only once and returns the same prefix each time. This allows one to create structure declarations that can be used by functions in several modules.
- ADDED: reworked
`cbrng`module and exposed kernel interface of bijections and samplers. - CHANGED: slightly changed the algorithm that determines the order of computation parameters after a transformation is connected to it. Now the ordering inside a list of initial computation parameters or a list of a single transformation parameters is preserved.
- CHANGED: kernel declaration string is now passed explicitly to a kernel template as the first parameter.
- FIXED: typo in FFT performance test.
- FIXED: bug in FFT that could result in changing the contents of the input array to one of the intermediate results.
- FIXED: missing data type normalization in
`c_constant()`. - FIXED: Py3 incompatibility in
`cluda.cuda`. - FIXED: updated some obsolete computation docstrings.

- FIXED: too strict array type check for nested computations that caused some tests to fail.
- FIXED: default values of scalar parameters are now processed correctly.
- FIXED: Mako threw name-not-found exceptions on some list comprehensions in FFT template.
- FIXED: some earlier-introduced errors in tests.
- INTERNAL:
`pylint`was ran and many stylistic errors fixed.

Major core API change:

- Computations have function-like signatures with the standard
`Signature`interface; no more separation of inputs/outputs/scalars. - Generic transformations were ditched; all the transformations have static types now.
- Transformations can now change array shapes, and load/store from/to external arrays in output/input transformations.
- No flat array access in kernels; all access goes through indices. This opens the road for correct and automatic stride support (not fully implemented yet).
- Computations and accompanying classes are stateless, and their creation is more straightforward.

Other stuff:

- Bumped Python requirements to >=2.6 or >=3.2, and added a dependency on
`funcsig`. - ADDED: more tests for cluda.functions.
- ADDED: module/snippet attributes discovery protocol for custom objects.
- ADDED: strides support to array allocation functions in CLUDA.
- ADDED: modules can now take positional arguments on instantiation, same as snippets.
- CHANGED:
`Elementwise`becomes`PureParallel`(as it is not always elementwise). - FIXED: incorrect behavior of functions.norm() for non-complex arguments.
- FIXED: undefined variable in functions.exp() template (reported by Thibault North).
- FIXED: inconsistent block/grid shapes in static kernels

- ADDED: ability to introduce new scalar arguments for nested computations (the API is quite ugly at the moment).
- FIXED: handling prefixes properly when connecting transformations to nested computations.
- FIXED: bug in dependency inference algorithm which caused it to ignore allocations in nested computations.

- ADDED: explicit
`release()`(primarily for certain rare CUDA use cases). - CHANGED: CLUDA API discovery interface (see the documentation).
- CHANGED: The part of CLUDA API that is supposed to be used by other layers was moved to the
`__init__.py`. - CHANGED: CLUDA
`Context`was renamed to`Thread`, to avoid confusion with`PyCUDA`/`PyOpenCL`contexts. - CHANGED: signature of
`create()`; it can filter devices now, and supports interactive mode. - CHANGED:
`Module`with`snippet=True`is now`Snippet` - FIXED: added
`transformation.mako`and`cbrng_ref.py`to the distribution package. - FIXED: incorrect parameter generation in
`test/cluda/cluda_vsizes/ids`. - FIXED: skipping testcases with incompatible parameters in
`test/cluda/cluda_vsizes/ids`and`sizes`. - FIXED: setting the correct length of
`max_num_groups`in case of CUDA and a device with CC < 2. - FIXED: typo in
`cluda.api_discovery`.

- ADDED: ability to use custom argument names in transformations.
- ADDED: multi-argument
`mul()`. - ADDED: counter-based random number generator
`CBRNG`. - ADDED:
`reikna.elementwise.Elementwise`now supports argument dependencies. - ADDED: Module support in CLUDA; see
*Tutorial: modules and snippets*for details. - ADDED:
`template_def()`. - CHANGED:
`reikna.cluda.kernel.render_template_source`is the main renderer now. - CHANGED:
`FuncCollector`class was removed; functions are now used as common modules. - CHANGED: all templates created with
`template_for()`are now rendered with`from __future__ import division`. - CHANGED: signature of
`OperationRecorder.add_kernel`takes a renderable instead of a full template. - CHANGED:
`compile_static()`now takes a template instead of a source. - CHANGED:
`reikna.elementwise.Elementwise`now uses modules. - FIXED: potential problem with local size finidng in static kernels (first approximation for the maximum workgroup size was not that good)
- FIXED: some OpenCL compilation warnings caused by an incorrect version querying macro.
- FIXED: bug with incorrect processing of scalar global size in static kernels.
- FIXED: bug in variance estimates in CBRNG tests.
- FIXED: error in the temporary varaiable type in
`reikna.cluda.functions.polar()`and`reikna.cluda.functions.exp()`.

- FIXED: function names for kernel
`polar()`,`exp()`and`conj()`. - FIXED: added forgotten kernel
`norm()`handler. - FIXED: bug in
`Py.Test`testcase execution hook which caused every test to run twice. - FIXED: bug in nested computation processing for computation with more than one kernel.
- FIXED: added dependencies between
`MatrixMul`kernel arguments. - FIXED: taking into account dependencies between input and output arrays as well as the ones between internal allocations — necessary for nested computations.
- ADDED: discrete harmonic transform
`DHT`(calculated using Gauss-Hermite quadrature).

- Added FFT computation (slightly optimized PyFFT version + Bluestein’s algorithm for non-power-of-2 FFT sizes)
- Added Python 3 compatibility
- Added Thread-global automatic memory packing
- Added polar(), conj() and exp() functions to kernel toolbox
- Changed name because of the clash with another Tigger.

- Lots of changes in the API
- Added elementwise, reduction and transposition computations
- Extended API reference and added topical guides

- Created basic core for computations and transformations
- Added matrix multiplication computation
- Created basic documentation