PTX

The cuda::ptx namespace contains functions that map one-to-one to PTX instructions. These can be used for maximal control of the generated code, or to experiment with new hardware features before a high-level C++ API is available.

Versions and compatibility

The cuda/ptx header is intended to present a stable API (not ABI) within one major version of the CTK on a best effort basis. This means that:

  • All functions are marked static inline.

  • The type of a function parameter can be changed to be more generic if that means that code that called the original version can still be compiled.

  • Good exposure of the PTX should be high priority. If, at a new major version, we face a difficult choice between breaking backward-compatibility and an improvement of the PTX exposure, we will tend to the latter option more easily than in other parts of libcu++.

API stability is not taken to the extreme. Call functions like below to ensure forward-compatibility:

// Use arguments to drive overload resolution:
cuda::ptx::mbarrier_arrive_expect_tx(cuda::ptx::sem_release, cuda::ptx::scope_cta, cuda::ptx::space_shared, &bar, 1);

// Specifying templates directly is not forward-compatible, as order and number
// of template parameters may change in a minor release:
cuda::ptx::mbarrier_arrive_expect_tx<cuda::ptx::sem_release_t>(
  cuda::ptx::sem_release, cuda::ptx::scope_cta, cuda::ptx::space_shared, &bar, 1
);

PTX ISA version and compute capability. Each binding notes under which PTX ISA version and SM version it may be used. Example:

// mbarrier.arrive.shared::cta.b64 state, [addr]; // 1.  PTX ISA 70, SM_80
__device__ inline uint64_t mbarrier_arrive(
  cuda::ptx::sem_release_t sem,
  cuda::ptx::scope_cta_t scope,
  cuda::ptx::space_shared_t space,
  uint64_t* addr);

To check if the current compiler is recent enough, use:

#if __cccl_ptx_isa >= 700
cuda::ptx::mbarrier_arrive(cuda::ptx::sem_release, cuda::ptx::scope_cta, cuda::ptx::space_shared, &bar, 1);
#endif

Ensure that you only call the function when compiling for a recent enough compute capability (SM version), like this:

NV_IF_TARGET(NV_PROVIDES_SM_80,(
  cuda::ptx::mbarrier_arrive(cuda::ptx::sem_release, cuda::ptx::scope_cta, cuda::ptx::space_shared, &bar, 1);
));

For more information on which compilers correspond to which PTX ISA, see the PTX ISA release notes.

Instructions by section

Integer Arithmetic Instructions

Instruction

Available in libcu++

sad

No

div

No

rem

No

abs

No

neg

No

min

No

max

No

popc

No

clz

No

bfind

No

fns

No

brev

No

bfe

No

bfi

No

szext

No

bmsk

No

dp4a

No

dp2a

No

Extended-Precision Integer Arithmetic Instructions

Instruction

Available in libcu++

add.cc

No

addc

No

sub.cc

No

subc

No

mad.cc

No

madc

No

Floating-Point Instructions

Instruction

Available in libcu++

testp

No

copysign

No

add

No

sub

No

mul

No

fma

No

mad

No

div

No

abs

No

neg

No

min

No

max

No

rcp

No

rcp.approx.ftz.f64

No

sqrt

No

rsqrt

No

rsqrt.approx.ftz.f64

No

sin

No

cos

No

lg2

No

ex2

No

tanh

No

Half Precision Floating-Point Instructions

Instruction

Available in libcu++

add

No

sub

No

mul

No

fma

No

neg

No

abs

No

min

No

max

No

tanh

No

ex2

No

Comparison and Selection Instructions

Instruction

Available in libcu++

set

No

setp

No

selp

No

slct

No

Half Precision Comparison Instructions

Instruction

Available in libcu++

set

No

setp

No

Logic and Shift Instructions

Instruction

Available in libcu++

and

No

or

No

xor

No

not

No

cnot

No

lop3

No

shf

No

shl

No

shr

No

Data Movement and Conversion Instructions

Instruction

Available in libcu++

mov

No

shfl

No

shfl.s

No

prmt

No

ld

No

ld.global.nc

No

ldu

No

st

No

st.async

CCCL 2.3.0 / CUDA 12.4

multimem.ld_reduce, multimem.st, multimem.red

No

prefetch, prefetchu

No

applypriority

No

discard

No

createpolicy

No

isspacep

No

cvta

No

cvt

No

cvt.pack

No

mapa

No

getctarank

CCCL 2.4.0 / CUDA 12.5

Data Movement and Conversion Instructions: Asynchronous copy

Instruction

Available in libcu++

cp.async

No

cp.async.commit_group

No

cp.async.wait_group

No

cp.async.bulk

CCCL 2.4.0 / CUDA 12.5

cp.reduce.async.bulk

CCCL 2.4.0 / CUDA 12.5

cp.async.bulk.prefetch

No

cp.reduce.async.bulk

CCCL 2.4.0 / CUDA 12.5

cp.reduce.async.bulk.tensor

CCCL 2.4.0 / CUDA 12.5

cp.async.bulk.prefetch.tensor

No

cp.async.bulk.commit_group

CCCL 2.4.0 / CUDA 12.5

cp.async.bulk.wait_group

CCCL 2.4.0 / CUDA 12.5

tensormap.replace

CCCL 2.4.0 / CUDA 12.5

Texture Instructions

Instruction

Available in libcu++

tex

No

tld4

No

txq

No

istypep

No

Surface Instructions

Instruction

Available in libcu++

suld

No

sust

No

sured

No

suq

No

Control Flow Instructions

Instruction

Available in libcu++

{}

No

@

No

bra

No

bra

No

call

No

ret

No

exit

No

Parallel Synchronization and Communication Instructions

Instruction

Available in libcu++

bar, barrier

No

bar.warp.sync

No

barrier.cluster

CCCL 2.4.0 / CUDA 12.5

membar

No

fence

CCCL 2.4.0 / CUDA 12.5

atom

No

red

No

red.async

CCCL 2.3.0 / CUDA 12.4

vote

No

vote.sync

No

match.sync

No

activemask

No

redux.sync

No

griddepcontrol

No

elect.sync

No

Parallel Synchronization and Communication Instructions: mbarrier

Instruction

Available in libcu++

mbarrier.init

CCCL 2.5.0 / CUDA Future

mbarrier.inval

No

mbarrier.expect_tx

No

mbarrier.complete_tx

No

mbarrier.arrive

CCCL 2.3.0 / CUDA 12.4

mbarrier.arrive_drop

No

cp.async.mbarrier.arrive

No

mbarrier.test_wait

CCCL 2.3.0 / CUDA 12.4

mbarrier.try_wait

CCCL 2.3.0 / CUDA 12.4

mbarrier.pending_count

No

tensormap.cp_fenceproxy

CCCL 2.4.0 / CUDA 12.5

Warp Level Matrix Multiply-Accumulate Instructions

Instruction

Available in libcu++

wmma.load

No

wmma.store

No

wmma.mma

No

mma

No

ldmatrix

No

stmatrix

No

movmatrix

No

mma.sp

No

Asynchronous Warpgroup Level Matrix Multiply-Accumulate Instructions

Instruction

Available in libcu++

wgmma.mma_async

No

wgmma.mma_async.sp

No

wgmma.fence

No

wgmma.commit_group

No

wgmma.wait_group

No

Stack Manipulation Instructions

Instruction

Available in libcu++

stacksave

No

stackrestore

No

alloca

No

Video Instructions

Instruction

Available in libcu++

vadd, vsub, vabsdiff, vmin, vmax

No

vshl, vshr

No

vmad

No

vset

No

SIMD Video Instructions

Instruction

Available in libcu++

vadd2, vsub2, vavrg2, vabsdiff2, vmin2, vmax2

No

vset2

No

vadd4, vsub4, vavrg4, vabsdiff4, vmin4, vmax4

No

vset4

No

Miscellaneous Instructions

Instruction

Available in libcu++

brkpt

No

nanosleep

No

pmevent

No

trap

No

setmaxnreg

No

Special registers <libcudacxx-ptx-instructions-special-registers>

Instruction

PTX ISA

SM Version

Available in libcu++

tid

20

All

CCCL 2.4.0 / CUDA 12.5

ntid

20

All

CCCL 2.4.0 / CUDA 12.5

laneid

13

All

CCCL 2.4.0 / CUDA 12.5

warpid

13

All

CCCL 2.4.0 / CUDA 12.5

nwarpid

20

20

CCCL 2.4.0 / CUDA 12.5

ctaid

20

All

CCCL 2.4.0 / CUDA 12.5

nctaid

20

All

CCCL 2.4.0 / CUDA 12.5

smid

13

All

CCCL 2.4.0 / CUDA 12.5

nsmid

20

20

CCCL 2.4.0 / CUDA 12.5

gridid

30

30

CCCL 2.4.0 / CUDA 12.5

is_explicit_cluster

78

90

CCCL 2.4.0 / CUDA 12.5

clusterid

78

90

CCCL 2.4.0 / CUDA 12.5

nclusterid

78

90

CCCL 2.4.0 / CUDA 12.5

cluster_ctaid

78

90

CCCL 2.4.0 / CUDA 12.5

cluster_nctaid

78

90

CCCL 2.4.0 / CUDA 12.5

cluster_ctarank

78

90

CCCL 2.4.0 / CUDA 12.5

cluster_nctarank

78

90

CCCL 2.4.0 / CUDA 12.5

lanemask_eq

20

20

CCCL 2.4.0 / CUDA 12.5

lanemask_le

20

20

CCCL 2.4.0 / CUDA 12.5

lanemask_lt

20

20

CCCL 2.4.0 / CUDA 12.5

lanemask_ge

20

20

CCCL 2.4.0 / CUDA 12.5

lanemask_gt

20

20

CCCL 2.4.0 / CUDA 12.5

clock, clock_hi

10

All

CCCL 2.4.0 / CUDA 12.5

clock64

20

20

CCCL 2.4.0 / CUDA 12.5

pm0

No

pm0_64

No

envreg

No

globaltimer, globaltimer_lo, globaltimer_hi

31

31

CCCL 2.4.0 / CUDA 12.5

reserved_smem_offset_begin, reserved_smem_offset_end, reserved_smem_offset_cap, reserved_smem_offset_2

No

total_smem_size

41

20

CCCL 2.4.0 / CUDA 12.5

aggr_smem_size

81

90

CCCL 2.4.0 / CUDA 12.5

dynamic_smem_size

41

20

CCCL 2.4.0 / CUDA 12.5

current_graph_exec

80

50

CCCL 2.4.0 / CUDA 12.5