nvrtc¶

Error Handling¶

NVRTC defines the following enumeration type and function for API call error handling.

class cuda.bindings.nvrtc.nvrtcResult(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶

The enumerated type nvrtcResult defines API call result codes. NVRTC API functions return nvrtcResult to indicate the call result.

NVRTC_SUCCESS = 0¶

NVRTC_ERROR_OUT_OF_MEMORY = 1¶

NVRTC_ERROR_PROGRAM_CREATION_FAILURE = 2¶

NVRTC_ERROR_INVALID_INPUT = 3¶

NVRTC_ERROR_INVALID_PROGRAM = 4¶

NVRTC_ERROR_INVALID_OPTION = 5¶

NVRTC_ERROR_COMPILATION = 6¶

NVRTC_ERROR_BUILTIN_OPERATION_FAILURE = 7¶

NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION = 8¶

NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 9¶

NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID = 10¶

NVRTC_ERROR_INTERNAL_ERROR = 11¶

NVRTC_ERROR_TIME_FILE_WRITE_FAILED = 12¶

cuda.bindings.nvrtc.nvrtcGetErrorString(result: nvrtcResult)¶

nvrtcGetErrorString is a helper function that returns a string describing the given nvrtcResult code, e.g., NVRTC_SUCCESS to “NVRTC_SUCCESS”. For unrecognized enumeration values, it returns “NVRTC_ERROR unknown”.

Parameters:

result (nvrtcResult) – CUDA Runtime Compilation API result code.

Returns:

nvrtcResult.NVRTC_SUCCESS – nvrtcResult.NVRTC_SUCCESS
bytes – Message string for the given nvrtcResult code.

General Information Query¶

NVRTC defines the following function for general information query.

cuda.bindings.nvrtc.nvrtcVersion()¶

nvrtcVersion sets the output parameters major and minor with the CUDA Runtime Compilation version number.

Returns:

nvrtcResult –
- NVRTC_SUCCESS
- NVRTC_ERROR_INVALID_INPUT
major (int) – CUDA Runtime Compilation major version number.
minor (int) – CUDA Runtime Compilation minor version number.

cuda.bindings.nvrtc.nvrtcGetNumSupportedArchs()¶

nvrtcGetNumSupportedArchs sets the output parameter numArchs with the number of architectures supported by NVRTC. This can then be used to pass an array to nvrtcGetSupportedArchs to get the supported architectures.

see nvrtcGetSupportedArchs

Returns:

nvrtcResult –
- NVRTC_SUCCESS
- NVRTC_ERROR_INVALID_INPUT
numArchs (int) – number of supported architectures.

cuda.bindings.nvrtc.nvrtcGetSupportedArchs()¶

nvrtcGetSupportedArchs populates the array passed via the output parameter supportedArchs with the architectures supported by NVRTC. The array is sorted in the ascending order. The size of the array to be passed can be determined using nvrtcGetNumSupportedArchs.

see nvrtcGetNumSupportedArchs

Returns:

nvrtcResult –
- NVRTC_SUCCESS
- NVRTC_ERROR_INVALID_INPUT
supportedArchs (List[int]) – sorted array of supported architectures.

Compilation¶

NVRTC defines the following type and functions for actual compilation.

class cuda.bindings.nvrtc.nvrtcProgram(*args, **kwargs)¶

nvrtcProgram is the unit of compilation, and an opaque handle for a program.

To compile a CUDA program string, an instance of nvrtcProgram must be created first with nvrtcCreateProgram, then compiled with nvrtcCompileProgram.

getPtr()¶: Get memory address of class instance

cuda.bindings.nvrtc.nvrtcCreateProgram(char *src, char *name, int numHeaders, headers: Optional[Tuple[bytes] | List[bytes]], includeNames: Optional[Tuple[bytes] | List[bytes]])¶

nvrtcCreateProgram creates an instance of nvrtcProgram with the given input parameters, and sets the output parameter prog with it.

Parameters:

src (bytes) – CUDA program source.
name (bytes) – CUDA program name. name can be NULL; “default_program” is used when name is NULL or “”.
numHeaders (int) – Number of headers used. numHeaders must be greater than or equal to 0.
headers (List[bytes]) – Sources of the headers. headers can be NULL when numHeaders is 0.
includeNames (List[bytes]) – Name of each header by which they can be included in the CUDA program source. includeNames can be NULL when numHeaders is 0. These headers must be included with the exact names specified here.

Returns:

nvrtcResult –
prog (nvrtcProgram) – CUDA Runtime Compilation program.

See also

nvrtcDestroyProgram

cuda.bindings.nvrtc.nvrtcDestroyProgram(prog)¶

nvrtcDestroyProgram destroys the given program.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcCreateProgram

cuda.bindings.nvrtc.nvrtcCompileProgram(prog, int numOptions, options: Optional[Tuple[bytes] | List[bytes]])¶

nvrtcCompileProgram compiles the given program.

It supports compile options listed in Supported Compile Options.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
numOptions (int) – Number of compiler options passed.
options (List[bytes]) – Compiler options in the form of C string array. options can be NULL when numOptions is 0.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_OUT_OF_MEMORY
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_OPTION
NVRTC_ERROR_COMPILATION
NVRTC_ERROR_BUILTIN_OPERATION_FAILURE
NVRTC_ERROR_TIME_FILE_WRITE_FAILED

Return type:

nvrtcResult

cuda.bindings.nvrtc.nvrtcGetPTXSize(prog)¶

nvrtcGetPTXSize sets the value of ptxSizeRet with the size of the PTX generated by the previous compilation of prog (including the trailing NULL).

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:

nvrtcResult –
ptxSizeRet (int) – Size of the generated PTX (including the trailing NULL).

See also

nvrtcGetPTX

cuda.bindings.nvrtc.nvrtcGetPTX(prog, char *ptx)¶

nvrtcGetPTX stores the PTX generated by the previous compilation of prog in the memory pointed by ptx.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
ptx (bytes) – Compiled result.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetPTXSize

cuda.bindings.nvrtc.nvrtcGetCUBINSize(prog)¶

nvrtcGetCUBINSize sets the value of cubinSizeRet with the size of the cubin generated by the previous compilation of prog. The value of cubinSizeRet is set to 0 if the value specified to -arch is a virtual architecture instead of an actual architecture.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:

nvrtcResult –
cubinSizeRet (int) – Size of the generated cubin.

See also

nvrtcGetCUBIN

cuda.bindings.nvrtc.nvrtcGetCUBIN(prog, char *cubin)¶

nvrtcGetCUBIN stores the cubin generated by the previous compilation of prog in the memory pointed by cubin. No cubin is available if the value specified to -arch is a virtual architecture instead of an actual architecture.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
cubin (bytes) – Compiled and assembled result.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetCUBINSize

cuda.bindings.nvrtc.nvrtcGetNVVMSize(prog)¶

DEPRECATION NOTICE: This function will be removed in a future release. Please use nvrtcGetLTOIRSize (and nvrtcGetLTOIR) instead.

Parameters:

prog (nvrtcProgram) – None

Returns:

nvrtcResult
nvvmSizeRet (int) – None

cuda.bindings.nvrtc.nvrtcGetNVVM(prog, char *nvvm)¶

DEPRECATION NOTICE: This function will be removed in a future release. Please use nvrtcGetLTOIR (and nvrtcGetLTOIRSize) instead.

Parameters:

prog (nvrtcProgram) – None
nvvm (bytes) – None

Return type:

nvrtcResult

cuda.bindings.nvrtc.nvrtcGetLTOIRSize(prog)¶

nvrtcGetLTOIRSize sets the value of LTOIRSizeRet with the size of the LTO IR generated by the previous compilation of prog. The value of LTOIRSizeRet is set to 0 if the program was not compiled with -dlto.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:

nvrtcResult –
LTOIRSizeRet (int) – Size of the generated LTO IR.

See also

nvrtcGetLTOIR

cuda.bindings.nvrtc.nvrtcGetLTOIR(prog, char *LTOIR)¶

nvrtcGetLTOIR stores the LTO IR generated by the previous compilation of prog in the memory pointed by LTOIR. No LTO IR is available if the program was compiled without -dlto.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
LTOIR (bytes) – Compiled result.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetLTOIRSize

cuda.bindings.nvrtc.nvrtcGetOptiXIRSize(prog)¶

nvrtcGetOptiXIRSize sets the value of optixirSizeRet with the size of the OptiX IR generated by the previous compilation of prog. The value of nvrtcGetOptiXIRSize is set to 0 if the program was compiled with options incompatible with OptiX IR generation.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:

nvrtcResult –
optixirSizeRet (int) – Size of the generated LTO IR.

See also

nvrtcGetOptiXIR

cuda.bindings.nvrtc.nvrtcGetOptiXIR(prog, char *optixir)¶

nvrtcGetOptiXIR stores the OptiX IR generated by the previous compilation of prog in the memory pointed by optixir. No OptiX IR is available if the program was compiled with options incompatible with OptiX IR generation.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
Optix (bytes) – IR Compiled result.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetOptiXIRSize

cuda.bindings.nvrtc.nvrtcGetProgramLogSize(prog)¶

nvrtcGetProgramLogSize sets logSizeRet with the size of the log generated by the previous compilation of prog (including the trailing NULL).

Note that compilation log may be generated with warnings and informative messages, even when the compilation of prog succeeds.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.

Returns:

nvrtcResult –
logSizeRet (int) – Size of the compilation log (including the trailing NULL).

See also

nvrtcGetProgramLog

cuda.bindings.nvrtc.nvrtcGetProgramLog(prog, char *log)¶

nvrtcGetProgramLog stores the log generated by the previous compilation of prog in the memory pointed by log.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
log (bytes) – Compilation log.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_INVALID_PROGRAM

Return type:

nvrtcResult

See also

nvrtcGetProgramLogSize

cuda.bindings.nvrtc.nvrtcAddNameExpression(prog, char *name_expression)¶

nvrtcAddNameExpression notes the given name expression denoting the address of a global function or device/__constant__ variable.

The identical name expression string must be provided on a subsequent call to nvrtcGetLoweredName to extract the lowered name.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
name_expression (bytes) – constant expression denoting the address of a global function or device/__constant__ variable.

Returns:

NVRTC_SUCCESS
NVRTC_ERROR_INVALID_PROGRAM
NVRTC_ERROR_INVALID_INPUT
NVRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION

Return type:

nvrtcResult

See also

nvrtcGetLoweredName

cuda.bindings.nvrtc.nvrtcGetLoweredName(prog, char *name_expression)¶

nvrtcGetLoweredName extracts the lowered (mangled) name for a global function or device/__constant__ variable, and updates lowered_name to point to it. The memory containing the name is released when the NVRTC program is destroyed by nvrtcDestroyProgram. The identical name expression must have been previously provided to nvrtcAddNameExpression.

Parameters:

prog (nvrtcProgram) – CUDA Runtime Compilation program.
name_expression (bytes) – constant expression denoting the address of a global function or device/__constant__ variable.

Returns:

nvrtcResult – NVRTC_SUCCESS NVRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION NVRTC_ERROR_NAME_EXPRESSION_NOT_VALID
lowered_name (bytes) – initialized by the function to point to a C string containing the lowered (mangled) name corresponding to the provided name expression.

See also

nvrtcAddNameExpression

Supported Compile Options¶

NVRTC supports the compile options below. Option names with two preceding dashs (--) are long option names and option names with one preceding dash (-) are short option names. Short option names can be used instead of long option names. When a compile option takes an argument, an assignment operator (=) is used to separate the compile option argument from the compile option name, e.g., "--gpu-architecture=compute_60". Alternatively, the compile option name and the argument can be specified in separate strings without an assignment operator, .e.g, "--gpu-architecture" "compute_60". Single-character short option names, such as -D, -U, and -I, do not require an assignment operator, and the compile option name and the argument can be present in the same string with or without spaces between them. For instance, "-D=<def>", "-D<def>", and "-D <def>" are all supported.

The valid compiler options are:

Compilation targets
- --gpu-architecture=<arch> (-arch)
  
  Specify the name of the class of GPU architectures for which the input must be compiled.
  - Valid <arch>s:
    - compute_50
    - compute_52
    - compute_53
    - compute_60
    - compute_61
    - compute_62
    - compute_70
    - compute_72
    - compute_75
    - compute_80
    - compute_87
    - compute_89
    - compute_90
    - compute_90a
    - sm_50
    - sm_52
    - sm_53
    - sm_60
    - sm_61
    - sm_62
    - sm_70
    - sm_72
    - sm_75
    - sm_80
    - sm_87
    - sm_89
    - sm_90
    - sm_90a
  - Default: compute_52
Separate compilation / whole-program compilation
- --device-c (-dc)
  
  Generate relocatable code that can be linked with other relocatable device code. It is equivalent to –relocatable-device-code=true.
- --device-w (-dw)
  
  Generate non-relocatable code. It is equivalent to --relocatable-device-code=false.
- --relocatable-device-code={true|false} (-rdc)
  
  Enable (disable) the generation of relocatable device code.
  - Default: false
- --extensible-whole-program (-ewp)
  
  Do extensible whole program compilation of device code.
  - Default: false
Debugging support
- --device-debug (-G)
  
  Generate debug information. If –dopt is not specified, then turns off all optimizations.
- --generate-line-info (-lineinfo)
  
  Generate line-number information.
Code generation
- --dopt on (-dopt)
- --dopt=on
  
  Enable device code optimization. When specified along with ‘-G’, enables limited debug information generation for optimized device code (currently, only line number information). When ‘-G’ is not specified, ‘-dopt=on’ is implicit.
- --ptxas-options <options> (-Xptxas)
- --ptxas-options=<options>
  
  Specify options directly to ptxas, the PTX optimizing assembler.
- --maxrregcount=<N> (-maxrregcount)
  
  Specify the maximum amount of registers that GPU functions can use. Until a function-specific limit, a higher value will generally increase the performance of individual GPU threads that execute this function. However, because thread registers are allocated from a global register pool on each GPU, a higher value of this option will also reduce the maximum thread block size, thereby reducing the amount of thread parallelism. Hence, a good maxrregcount value is the result of a trade-off. If this option is not specified, then no maximum is assumed. Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.
- --ftz={true|false} (-ftz)
  
  When performing single-precision floating-point operations, flush denormal values to zero or preserve denormal values. --use_fast_math implies --ftz=true.
  - Default: false
- --prec-sqrt={true|false} (-prec-sqrt)
  
  For single-precision floating-point square root, use IEEE round-to-nearest mode or use a faster approximation. --use_fast_math implies --prec-sqrt=false.
  - Default: true
- --prec-div={true|false} (-prec-div)
  
  For single-precision floating-point division and reciprocals, use IEEE round-to-nearest mode or use a faster approximation. --use_fast_math implies --prec-div=false.
  - Default: true
- --fmad={true|false} (-fmad)
  
  Enables (disables) the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA). --use_fast_math implies --fmad=true.
  - Default: true
- --use_fast_math (-use_fast_math)
  
  Make use of fast math operations. --use_fast_math implies --ftz=true --prec-div=false --prec-sqrt=false --fmad=true.
- --extra-device-vectorization (-extra-device-vectorization)
  
  Enables more aggressive device code vectorization in the NVVM optimizer.
- --modify-stack-limit={true|false} (-modify-stack-limit)
  
  On Linux, during compilation, use setrlimit() to increase stack size to maximum allowed. The limit is reset to the previous value at the end of compilation. Note: setrlimit() changes the value for the entire process.
  - Default: true
- --dlink-time-opt (-dlto)
  
  Generate intermediate code for later link-time optimization. It implies -rdc=true. Note: when this option is used the nvrtcGetLTOIR API should be used, as PTX or Cubin will not be generated.
- --gen-opt-lto (-gen-opt-lto)
  
  Run the optimizer passes before generating the LTO IR.
- --optix-ir (-optix-ir)
  
  Generate OptiX IR. The Optix IR is only intended for consumption by OptiX through appropriate APIs. This feature is not supported with link-time-optimization (-dlto)

. Note: when this option is used the nvrtcGetOptiX API should be used, as PTX or Cubin will not be generated.

--jump-table-density=[0-101] (-jtd)

Specify the case density percentage in switch statements, and use it as a minimal threshold to determine whether jump table(brx.idx instruction) will be used to implement a switch statement. Default value is 101. The percentage ranges from 0 to 101 inclusively.

--device-stack-protector={true|false} (-device-stack-protector)

Enable (disable) the generation of stack canaries in device code.

Default: false

Preprocessing
- --define-macro=<def> (-D)
  
  <def> can be either <name> or <name=definitions>.
  - <name>
    
    Predefine <name> as a macro with definition 1.
  - <name>=<definition>
    
    The contents of <definition> are tokenized and preprocessed as if they appeared during translation phase three in a #define directive. In particular, the definition will be truncated by embedded new line characters.
- --undefine-macro=<def> (-U)
  
  Cancel any previous definition of <def>.
- --include-path=<dir> (-I)
  
  Add the directory <dir> to the list of directories to be searched for headers. These paths are searched after the list of headers given to nvrtcCreateProgram.
- --pre-include=<header> (-include)
  
  Preinclude <header> during preprocessing.
- --no-source-include (-no-source-include) The preprocessor by default adds the directory of each input sources to the include path. This option disables this feature and only considers the path specified explicitly.
Language Dialect
- --std={c++03|c++11|c++14|c++17|c++20} (-std={c++11|c++14|c++17|c++20})
  
  Set language dialect to C++03, C++11, C++14, C++17 or C++20
  - Default: c++17
- --builtin-move-forward={true|false} (-builtin-move-forward)
  
  Provide builtin definitions of std::move and std::forward, when C++11 or later language dialect is selected.
  - Default: true
- --builtin-initializer-list={true|false} (-builtin-initializer-list)
  
  Provide builtin definitions of std::initializer_list class and member functions when C++11 or later language dialect is selected.
  - Default: true
Misc.
- --disable-warnings (-w)
  
  Inhibit all warning messages.
- --restrict (-restrict)
  
  Programmer assertion that all kernel pointer parameters are restrict pointers.
- --device-as-default-execution-space (-default-device)
  
  Treat entities with no execution space annotation as device entities.
- --device-int128 (-device-int128)
  
  Allow the __int128 type in device code. Also causes the macro CUDACC_RTC_INT128 to be defined.
- --optimization-info=<kind> (-opt-info)
  
  Provide optimization reports for the specified kind of optimization. The following kind tags are supported:
  - inline : emit a remark when a function is inlined.
- --display-error-number (-err-no)
  
  Display diagnostic number for warning messages. (Default)
- --no-display-error-number (-no-err-no)
  
  Disables the display of a diagnostic number for warning messages.
- --diag-error=<error-number>,… (-diag-error)
  
  Emit error for specified diagnostic message number(s). Message numbers can be separated by comma.
- --diag-suppress=<error-number>,… (-diag-suppress)
  
  Suppress specified diagnostic message number(s). Message numbers can be separated by comma.
- --diag-warn=<error-number>,… (-diag-warn)
  
  Emit warning for specified diagnostic message number(s). Message numbers can be separated by comma.
- --brief-diagnostics={true|false} (-brief-diag)
  
  This option disables or enables showing source line and column info in a diagnostic. The –brief-diagnostics=true will not show the source line and column info.
  - Default: false
- --time=<file-name> (-time)
  
  Generate a comma separated value table with the time taken by each compilation phase, and append it at the end of the file given as the option argument. If the file does not exist, the column headings are generated in the first row of the table. If the file name is ‘-’, the timing data is written to the compilation log.
- --split-compile= <number of threads> (-split-compile= <number of threads>)
  
  Perform compiler optimizations in parallel. Split compilation attempts to reduce compile time by enabling the compiler to run certain optimization passes concurrently. This option accepts a numerical value that specifies the maximum number of threads the compiler can use. One can also allow the compiler to use the maximum threads available on the system by setting –split-compile=0. Setting –split-compile=1 will cause this option to be ignored.
- --fdevice-syntax-only (-fdevice-syntax-only)
  
  Ends device compilation after front-end syntax checking. This option does not generate valid device code.
- --minimal (-minimal)
  
  Omit certain language features to reduce compile time for small programs. In particular, the following are omitted:
  - Texture and surface functions and associated types, e.g., cudaTextureObject_t.
  - CUDA Runtime Functions that are provided by the cudadevrt device code library, typically named with prefix “cuda”, e.g., cudaMalloc.
  - Kernel launch from device code.
  - Types and macros associated with CUDA Runtime and Driver APIs, provided by cuda/tools/cudart/driver_types.h, typically named with prefix “cuda”, e.g., cudaError_t.
- --device-stack-protector (-device-stack-protector)
  
  Enable stack canaries in device code. Stack canaries make it more difficult to exploit certain types of memory safety bugs involving stack-local variables. The compiler uses heuristics to assess the risk of such a bug in each function. Only those functions which are deemed high-risk make use of a stack canary.