cub::DeviceSpmv

Defined in /home/runner/work/cccl/cccl/cub/cub/device/device_spmv.cuh

struct DeviceSpmv

DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV).

Overview

The SpMV computation performs the matrix-vector operation y = A * x + y, where:

  • A is an m * n sparse matrix whose non-zero structure is specified in compressed-storage-row (CSR) format (i.e., three arrays: values, row_offsets, and column_indices)

  • x and y are dense vectors

Usage Considerations

  • Dynamic parallelism. DeviceSpmv methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

CSR matrix operations

template<typename ValueT>
static inline cudaError_t CsrMV(void *d_temp_storage, size_t &temp_storage_bytes, const ValueT *d_values, const int *d_row_offsets, const int *d_column_indices, const ValueT *d_vector_x, ValueT *d_vector_y, int num_rows, int num_cols, int num_nonzeros, cudaStream_t stream = 0)

This function performs the matrix-vector operation y = A*x.

Snippet

The code snippet below illustrates SpMV upon a 9x9 CSR matrix A representing a 3x3 lattice (24 non-zeros).

#include <cub/cub.cuh>   // or equivalently <cub/device/device_spmv.cuh>

// Declare, allocate, and initialize device-accessible pointers for input matrix A, input
vector x,
// and output vector y
int    num_rows = 9;
int    num_cols = 9;
int    num_nonzeros = 24;

float* d_values;  // e.g., [1, 1, 1, 1, 1, 1, 1, 1,
                  //        1, 1, 1, 1, 1, 1, 1, 1,
                  //        1, 1, 1, 1, 1, 1, 1, 1]

int*   d_column_indices; // e.g., [1, 3, 0, 2, 4, 1, 5, 0,
                         //        4, 6, 1, 3, 5, 7, 2, 4,
                         //        8, 3, 7, 4, 6, 8, 5, 7]

int*   d_row_offsets;    // e.g., [0, 2, 5, 7, 10, 14, 17, 19, 22, 24]

float* d_vector_x;       // e.g., [1, 1, 1, 1, 1, 1, 1, 1, 1]
float* d_vector_y;       // e.g., [ ,  ,  ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements
void*    d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
    d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
    num_rows, num_cols, num_nonzeros);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run SpMV
cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
    d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
    num_rows, num_cols, num_nonzeros);

// d_vector_y <-- [2, 3, 2, 3, 4, 3, 2, 3, 2]

Template Parameters

ValueT[inferred] Matrix and vector value type (e.g., float, double, etc.)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_values[in] Pointer to the array of num_nonzeros values of the corresponding nonzero elements of matrix A.

  • d_row_offsets[in] Pointer to the array of m + 1 offsets demarcating the start of every row in d_column_indices and d_values (with the final entry being equal to num_nonzeros)

  • d_column_indices[in] Pointer to the array of num_nonzeros column-indices of the corresponding nonzero elements of matrix A. (Indices are zero-valued.)

  • d_vector_x[in] Pointer to the array of num_cols values corresponding to the dense input vector x

  • d_vector_y[out] Pointer to the array of num_rows values corresponding to the dense output vector y

  • num_rows[in] number of rows of matrix A.

  • num_cols[in] number of columns of matrix A.

  • num_nonzeros[in] number of nonzero elements of matrix A.

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename ValueT>
static inline cudaError_t CsrMV(void *d_temp_storage, size_t &temp_storage_bytes, const ValueT *d_values, const int *d_row_offsets, const int *d_column_indices, const ValueT *d_vector_x, ValueT *d_vector_y, int num_rows, int num_cols, int num_nonzeros, cudaStream_t stream, bool debug_synchronous)