cub::DeviceSpmv

struct DeviceSpmv

DeviceSpmv provides device-wide parallel operations for performing sparse-matrix * dense-vector multiplication (SpMV).

Overview

The SpMV computation performs the matrix-vector operation y = A * x + y, where:

A is an m * n sparse matrix whose non-zero structure is specified in compressed-storage-row (CSR) format (i.e., three arrays: values, row_offsets, and column_indices)

x and y are dense vectors

Usage Considerations

Dynamic parallelism. DeviceSpmv methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

CSR matrix operations

template<typename ValueT> static inline cudaError_t CsrMV(void *d_temp_storage, size_t &temp_storage_bytes, const ValueT *d_values, const int *d_row_offsets, const int *d_column_indices, const ValueT *d_vector_x, ValueT *d_vector_y, int num_rows, int num_cols, int num_nonzeros, cudaStream_t stream = 0)

This function performs the matrix-vector operation y = A*x.

Snippet

The code snippet below illustrates SpMV upon a 9x9 CSR matrix A representing a 3x3 lattice (24 non-zeros).

#include <cub/cub.cuh>   // or equivalently <cub/device/device_spmv.cuh>

// Declare, allocate, and initialize device-accessible pointers for input matrix A, input
vector x,
// and output vector y
int    num_rows = 9;
int    num_cols = 9;
int    num_nonzeros = 24;

float* d_values;  // e.g., [1, 1, 1, 1, 1, 1, 1, 1,
                  //        1, 1, 1, 1, 1, 1, 1, 1,
                  //        1, 1, 1, 1, 1, 1, 1, 1]

int*   d_column_indices; // e.g., [1, 3, 0, 2, 4, 1, 5, 0,
                         //        4, 6, 1, 3, 5, 7, 2, 4,
                         //        8, 3, 7, 4, 6, 8, 5, 7]

int*   d_row_offsets;    // e.g., [0, 2, 5, 7, 10, 14, 17, 19, 22, 24]

float* d_vector_x;       // e.g., [1, 1, 1, 1, 1, 1, 1, 1, 1]
float* d_vector_y;       // e.g., [ ,  ,  ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements
void*    d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
    d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
    num_rows, num_cols, num_nonzeros);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run SpMV
cub::DeviceSpmv::CsrMV(d_temp_storage, temp_storage_bytes, d_values,
    d_row_offsets, d_column_indices, d_vector_x, d_vector_y,
    num_rows, num_cols, num_nonzeros);

// d_vector_y <-- [2, 3, 2, 3, 4, 3, 2, 3, 2]

Template Parameters

ValueT – [inferred] Matrix and vector value type (e.g., float, double, etc.)

Parameters

d_temp_storage – [in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.
temp_storage_bytes – [inout] Reference to size in bytes of d_temp_storage allocation
d_values – [in] Pointer to the array of num_nonzeros values of the corresponding nonzero elements of matrix A.
d_row_offsets – [in] Pointer to the array of m + 1 offsets demarcating the start of every row in d_column_indices and d_values (with the final entry being equal to num_nonzeros)
d_column_indices – [in] Pointer to the array of num_nonzeros column-indices of the corresponding nonzero elements of matrix A. (Indices are zero-valued.)
d_vector_x – [in] Pointer to the array of num_cols values corresponding to the dense input vector x
d_vector_y – [out] Pointer to the array of num_rows values corresponding to the dense output vector y
num_rows – [in] number of rows of matrix A.
num_cols – [in] number of columns of matrix A.
num_nonzeros – [in] number of nonzero elements of matrix A.
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream₀.