CUDA Fast Math
As noted in Fastmath, for certain classes of applications that utilize floating point, strict IEEE-754 conformance is not required. For this subset of applications, performance speedups may be possible.
The CUDA target implements Fastmath behavior with two differences.
First, the
fastmath
argument to the@jit decorator
is limited to the valuesTrue
andFalse
. WhenTrue
, the following optimizations are enabled:Flushing of denormals to zero.
Use of a fast approximation to the square root function.
Use of a fast approximation to the division operation.
Contraction of multiply and add operations into single fused multiply-add operations.
See the documentation for nvvmCompileProgram for more details of these optimizations.
Secondly, calls to a subset of math module functions on
float32
operands will be implemented using fast approximate implementations from the libdevice library.math.cos()
: Implemented using __nv_fast_cosf.math.sin()
: Implemented using __nv_fast_sinf.math.tan()
: Implemented using __nv_fast_tanf.math.exp()
: Implemented using __nv_fast_expf.math.log2()
: Implemented using __nv_fast_log2f.math.log10()
: Implemented using __nv_fast_log10f.math.log()
: Implemented using __nv_fast_logf.math.pow()
: Implemented using __nv_fast_powf.