CUDASTF offers an extensible API that allows users to implement their own data interface.

Let us for example go through the different steps to implement a data interface for a very simple simple implementation of a matrix class.

For the sake of simplicity, we here only consider the CUDA stream backend, but adding support for the CUDA graph backend simply require some extra steps which use the CUDA graph API.

Implementation of the matrix class

For the sake of simplicity, we consider a very simple representation of matrix, only defined by the dimensions m and n, and by the base address of the matrix which we assume to be contiguous.

template <typename T>
class matrix {
public:
    matrix(size_t m, size_t n, T* base) : m(m), n(n), base(base) {}
    __host__ __device__ T& operator()(size_t i, size_t j) { return base[i + j * m]; }
    __host__ __device__ const T& operator()(size_t i, size_t j) const { return base[i + j * m]; }
    size_t m, n;
    T* base;
};

Defining the shape of a matrix

The first step consists in defining what is the shape of a matrix. The shape of a matrix should be a class that defines all parameters which are the same for all data instances, m and n. On the other hand, the base address should not be part of this shape class, because each data instance will have its own base address.

To define what is the shape of a matrix, we need to specialize the cudastf::shape_of trait class.

template <typename T>
class cudastf::shape_of<matrix<T>> {
public:
    /**
     * @brief The default constructor.
     *
     * All `shape_of` specializations must define this constructor.
     */
    shape_of() = default;

    explicit shape_of(size_t m, size_t n) : m(m), n(n) {}

    /**
     * @name Copies a shape.
     *
     * All `shape_of` specializations must define this constructor.
     */
    shape_of(const shape_of&) = default;

    /**
     * @brief Extracts the shape from a matrix
     *
     * @param M matrix to get the shape from
     *
     * All `shape_of` specializations must define this constructor.
     */
    shape_of(const matrix<T>& M) : shape_of<matrix<T>>(M.m, M.n) {}

    /// Mandatory method : defined the total number of elements in the shape
    size_t size() const { return m * n; }

    size_t m;
    size_t n;
};

We here see that shape_of<matrix<T>> contains two size_t fields m and n.

In addition, we need to define a default constructor and a copy constructors.

To implement the .shape() member of the logical_data class, we need to define a constructor which takes a const reference to a matrix.

Finally, if the ctx.parallel_for construct is needed, we must define a size_t size() const method which computes the total number of elements in a shape.

Hash of a matrix

For internal needs, such as using (unordered) maps of data instances, CUDASTF need to have specialized forms of the std::hash trait class.

The () operator of this class should compute a unique identifier associated to the description of the data instance. This typically means computing a hash of the matrix sizes, and of the base address. Note that this hash does not depend on the actual content of the matrix.

In code snippet, we are using the cudastf::hash_combine helper which updates a hash value with another value. This function is available from the cudastf/utility/hash.h header.

template <typename T>
struct std::hash<matrix<T>> {
    std::size_t operator()(matrix<T> const& m) const noexcept {
        // Combine hashes from the base address and sizes
        return cudastf::hash_all(m.m, m.n, m.base);
    }
};

Defining a data interface

We can now implement the actual data interface for a matrix class, which defines the basic operations that CUDASTF need to perform on a matrix.

The matrix_stream_interface class inherits from the data_interface class, but to implement a data interface using APIs based on CUDA streams, matrix_stream_interface inherits from stream_data_interface_simple<matrix<T>> which contains pure virtual functions that need to be implemented.

template <typename T>
class matrix_stream_interface : public stream_data_interface_simple<matrix<T>> {
public:
    using base = stream_data_interface_simple<matrix<T>>;
    using base::shape_t;

    /// Initialize from an existing matrix
    matrix_stream_interface(matrix<T> m) : base(std::move(m)) {}

    /// Initialize from a shape of matrix
    matrix_stream_interface(shape_t s) : base(s) {}

    /// Copy the content of an instance to another instance
    ///
    /// This implementation assumes that we have registered memory if one of the data place is the host
    void stream_data_copy(const data_place& dst_memory_node, instance_id_t dst_instance_id,
            const data_place& src_memory_node, instance_id_t src_instance_id, cudaStream_t stream) override {
        assert(src_memory_node != dst_memory_node);

        cudaMemcpyKind kind = cudaMemcpyDeviceToDevice;
        if (src_memory_node == data_place::host) {
            kind = cudaMemcpyHostToDevice;
        }

        if (dst_memory_node == data_place::host) {
            kind = cudaMemcpyDeviceToHost;
        }

        const matrix<T>& src_instance = this->instance(src_instance_id);
        const matrix<T>& dst_instance = this->instance(dst_instance_id);

        size_t sz = src_instance.m * src_instance.n * sizeof(T);

        cuda_safe_call(cudaMemcpyAsync((void*) dst_instance.base, (void*) src_instance.base, sz, kind, stream));
    }

    /// allocate an instance on a specific data place
    ///
    /// setting *s to a negative value informs CUDASTF that the allocation
    /// failed, and that a memory reclaiming mechanism need to be performed.
    void stream_data_allocate(backend_ctx_untyped& ctx, const data_place& memory_node, instance_id_t instance_id, ssize_t& s,
            void** extra_args, cudaStream_t stream) override {
        matrix<T>& instance = this->instance(instance_id);
        size_t sz = instance.m * instance.n * sizeof(T);

        T* base_ptr;

        if (memory_node == data_place::host) {
            // Fallback to a synchronous method as there is no asynchronous host allocation API
            cuda_safe_call(cudaStreamSynchronize(stream));
            cuda_safe_call(cudaHostAlloc(&base_ptr, sz, cudaHostAllocMapped));
        } else {
            cuda_safe_call(cudaMallocAsync(&base_ptr, sz, stream));
        }

        // By filling a positive number, we notify that the allocation was succesful
        *s = sz;

        instance.base = base_ptr;
    }

    /// deallocate an instance
    void stream_data_deallocate(backend_ctx_untyped& ctx, const data_place& memory_node, instance_id_t instance_id, void* extra_args,
            cudaStream_t stream) override {
        matrix<T>& instance = this->instance(instance_id);
        if (memory_node == data_place::host) {
            // Fallback to a synchronous method as there is no asynchronous host deallocation API
            cuda_safe_call(cudaStreamSynchronize(stream));
            cuda_safe_call(cudaFreeHost(instance.base));
        } else {
            cuda_safe_call(cudaFreeAsync(instance.base, stream));
        }
    }

    /// Register the host memory associated to an instance of matrix
    ///
    /// Note that this pin_host_memory method is not mandatory, but then it is
    /// the responsability of the user to only passed memory that is already
    /// registered, and the allocation method on the host must allocate
    /// registered memory too. Otherwise, copy methods need to be synchronous.
    bool pin_host_memory(instance_id_t instance_id) override {
        matrix<T>& instance = this->instance(instance_id);
        if (!instance.base) {
            return false;
        }

        cuda_safe_call(pin_memory(instance.base, instance.m * instance.n * sizeof(T)));

        return true;
    }

    /// Unregister memory pinned by pin_host_memory
    void unpin_host_memory(instance_id_t instance_id) override {
        matrix<T>& instance = this->instance(instance_id);
        unpin_memory(instance.base);
    }
};

matrix_stream_interface must meet the following requirements so that they can be used in the CUDA stream backend : - It must provide constructors which take either a matrix, or a shape of matrix as arguments. - It must implement the stream_data_copy, stream_data_allocate and stream_data_deallocate virtual methods, which respectively define how to copy an instance into another instance, how to allocate an instance, and how to deallocate an instance. - It may implement the pin_host_memory and unpin_host_memory virtual methods which respectively register and unregister the memory associated to an instance allocated on the host. These two methods are not mandatory, but it is the responsibility of the user to either only pass and allocate registered host buffers, or to ensure that the copy method does not require such memory pinning. Similarly, accessing an instance located in host memory from a device typically requires to access registered memory.

Associating a data interface with the CUDA stream backend

To ensure that we can initialize a logical data from a matrix, or from the shape of a matrix with stream_ctx::logical_data, we then need to specialize the cudastf::streamed_interface_of trait class.

The resulting class must simply define a type named type which is the type of the data interface for the CUDA stream backend.

template <typename T>
class cudastf::streamed_interface_of<matrix<T>> {
public:
    using type = matrix_stream_interface<T>;
};

Once we have defined this trait class, it is for example possible to initialize a logical data from a matrix, or from a matrix shape :

std::vector<int> v(m * n, 0);
matrix M(m, n, &v[0]);

// Initialize from a matrix
auto lM = ctx.logical_data(M);

// Initialize from a shape
auto lM2 = ctx.logical_data(shape_of<matrix<int>>(m, n));

Example of code using the matrix data interface

We can now use the matrix class in CUDASTF, and access it from tasks. In this code, we first initialize a matrix on the host, we then apply a task which will update its content on the current device. We finally check that the content is correct, by the means of the write-back mechanism that automatically updates the reference data instance of a logical data when calling ctx.sync()t.

template <typename T>
__global__ void kernel(matrix<T> M) {
    int tid_x = blockIdx.x * blockDim.x + threadIdx.x;
    int nthreads_x = gridDim.x * blockDim.x;

    int tid_y = blockIdx.y * blockDim.y + threadIdx.y;
    int nthreads_y = gridDim.y * blockDim.y;

    for (int x = tid_x; x < M.m; x += nthreads_x)
        for (int y = tid_y; y < M.n; y += nthreads_y) {
            M(x, y) += -x + 7 * y;
        }
}

int main() {
    stream_ctx ctx;

    const size_t m = 8;
    const size_t n = 10;
    std::vector<int> v(m * n);

    for (size_t j = 0; j < n; j++)
        for (size_t i = 0; i < m; i++) {
            v[i + j * m] = 17 * i + 23 * j;
        }

    matrix<int> M(m, n, &v[0]);

    auto lM = ctx.logical_data(M);

    // M(i,j) +=  -i + 7*i
    ctx.task(lM.rw())->*[](cudaStream_t s, auto dM) { kernel<<<dim3(8, 8), dim3(8, 8), 0, s>>>(dM); };

    ctx.sync();

    for (size_t j = 0; j < n; j++)
        for (size_t i = 0; i < m; i++) {
            assert(v[i + j * m] == (17 * i + 23 * j) + (-i + 7*i));
        }
}