CUDA Fast Math#

As noted in Fastmath, for certain classes of applications that utilize floating point, strict IEEE-754 conformance is not required. For this subset of applications, performance speedups may be possible.

The CUDA target implements Fastmath behavior with two differences.

First, the fastmath argument to the @jit decorator is limited to the values True and False. When True, the following optimizations are enabled:
- Flushing of denormals to zero.
- Use of a fast approximation to the square root function.
- Use of a fast approximation to the division operation.
- Contraction of multiply and add operations into single fused multiply-add operations.
See the documentation for nvvmCompileProgram for more details of these optimizations.
Secondly, calls to a subset of math module functions on float32 operands will be implemented using fast approximate implementations from the libdevice library.
- math.cos(): Implemented using __nv_fast_cosf.
- math.sin(): Implemented using __nv_fast_sinf.
- math.tan(): Implemented using __nv_fast_tanf.
- math.exp(): Implemented using __nv_fast_expf.
- math.log2(): Implemented using __nv_fast_log2f.
- math.log10(): Implemented using __nv_fast_log10f.
- math.log(): Implemented using __nv_fast_logf.
- math.pow(): Implemented using __nv_fast_powf.