Tutorial: Building a Merkle Tree Membership Proof Application

This tutorial walks you through building a complete Merkle tree membership proof application using cuPQC-Hash. You'll learn how to generate Merkle trees, create membership proofs, and verify them using GPU-accelerated cryptographic primitives.

cuPQC-Hash supports two computation modes for Merkle trees:

single-block mode (high throughput) where one thread block computes one complete tree
multi-block mode (low latency) where multiple GPU blocks collaborate to compute a single large tree by generating subtrees that are combined

This tutorial demonstrates the multi-block mode, which uses multiple GPU blocks to compute larger trees faster.

Step 1: Project Setup

Clone the cuPQC repository to get the example files:

git clone https://github.com/NVIDIA/cuPQC.git
cd cuPQC/applications/merkle_proof_poseidon2_bb

Step 2: Include Required Headers

Start by including the necessary cuPQC SDK headers:

#include <vector>
#include <iomanip>
#include <iostream>
#include <random>

#include <hash.hpp>

using namespace cupqc;

The hash.hpp header provides access to: - Poseidon2 hash function (BabyBear field) - Merkle tree data structures and operations

Step 3: Define Hash and Merkle Tree Types

Define the hash function and Merkle tree configuration. This example uses multi-block mode for computing large trees:

using HASH   = decltype(POSEIDON2_BB_8_16() + Thread());
using MERKLE = decltype(MERKLE_FIELD_2097152() + BlockDim<256>());
using SUB_MERKLE = decltype(MERKLE_FIELD_2048() + BlockDim<256>());
using FINAL_MERKLE = decltype(MERKLE_FIELD_1024() + BlockDim<256>());

POSEIDON2_BB_8_16() chooses Poseidon2 with BabyBear field, capacity 8, and width 16.
Thread() specifies that the operator will execute independently for each thread.
MERKLE_FIELD_2097152() creates a Merkle tree supporting up to 2,097,152 leaves (main tree).
MERKLE_FIELD_2048() creates subtrees with 2,048 leaves each for parallel processing.
MERKLE_FIELD_1024() creates the final tree with 1,024 leaves to combine subtrees.
BlockDim<256>() specifies using 256 threads per block for tree operations.

Multi-Block Mode Configuration

When using multi-block mode, ensure that the subtree size multiplied by the final tree size equals the main tree size. This ensures proper tree construction where subtrees are correctly combined into the final Merkle tree.

Step 4: Create Leaf Generation Kernel

Create a CUDA kernel to hash input data into Merkle tree leaves:

__global__ void create_leaves_kernel(
    tree<MERKLE::Size, HASH, uint32_t> merkle, 
    const uint32_t* msg, 
    size_t inbuf_len)
{
    HASH hash{};
    size_t i = threadIdx.x + blockIdx.x * blockDim.x;

    if (i < MERKLE::Size) {
        MERKLE().create_leaf(merkle.nodes + i * merkle.digest_size, 
                             msg + i * inbuf_len, 
                             hash, 
                             inbuf_len);
    }
}

MERKLE().create_leaf(): Creates a leaf node in the Merkle tree by hashing the input message data and storing the result in the tree's node array
Each thread processes one leaf, hashing the input data using Poseidon2 and storing the result in the Merkle tree's node array.

Step 5: Implement Tree Generation Function

Create kernels for the multi-block tree generation approach. First, create a kernel to generate subtrees in parallel:

__global__ void generate_sub_tree_kernel(tree<MERKLE::Size, HASH, uint32_t> merkle)
{
    HASH hash{};
    SUB_MERKLE().generate_sub_tree(hash, merkle, blockIdx.x);
}

Create a kernel to generate the final tree from subtrees:

__global__ void generate_final_tree_kernel(
    tree<MERKLE::Size, HASH, uint32_t> merkle, 
    uint32_t* root)
{
    HASH hash;
    tree<FINAL_MERKLE::Size, HASH, uint32_t> merkle_final;
    const size_t left_overs = merkle.size - merkle_final.size;
    merkle_final.nodes = merkle.nodes + left_overs * merkle.digest_size;

    __syncthreads();
    FINAL_MERKLE().generate_tree(hash, merkle_final);

    if(threadIdx.x == 0) {
        for(uint32_t i = 0; i < merkle.digest_size; i++) {
            root[i] = merkle.root()[i];
        }
    }
}

Create a host function to generate the complete Merkle tree using the multi-block approach:

void generate_tree(
    const std::vector<uint32_t>& msg,
    size_t in_len,
    tree<MERKLE::Size, HASH, uint32_t> merkle,
    uint32_t* d_root)
{
    uint32_t* d_msg;
    cudaMalloc(reinterpret_cast<void**>(&d_msg), msg.size() * sizeof(uint32_t));
    cudaMemcpy(d_msg, msg.data(), msg.size() * sizeof(uint32_t), cudaMemcpyHostToDevice);

    create_leaves_kernel<<<MERKLE::Size / MERKLE::BlockDim.x, 
                           MERKLE::BlockDim>>>(merkle, d_msg, in_len);
    generate_sub_tree_kernel<<<FINAL_MERKLE::Size, MERKLE::BlockDim>>>(merkle);
    generate_final_tree_kernel<<<1, MERKLE::BlockDim>>>(merkle, d_root);

    cudaFree(d_msg);
}

SUB_MERKLE().generate_sub_tree(): Generates subtrees in parallel using multiple thread blocks
FINAL_MERKLE().generate_tree(): Generates the final tree by combining the subtrees
This multi-block approach reduces latency for computing large trees by leveraging more GPU resources simultaneously.

Step 6: Implement Proof Generation

Create a kernel to generate membership proofs:

__global__ void generate_proof_kernel(
    proof<MERKLE::Size, HASH, uint32_t> proof_obj, 
    const uint32_t* leaf, 
    const uint32_t leaf_index, 
    const tree<MERKLE::Size, HASH, uint32_t> merkle)
{
    MERKLE().generate_proof(proof_obj, leaf, leaf_index, merkle);
}

And the host function:

void generate_proof_for_leaf(
    uint32_t leaf_index,
    const tree<MERKLE::Size, HASH, uint32_t>& merkle,
    proof<MERKLE::Size, HASH, uint32_t>& proof_obj)
{
    // Get the leaf from the tree
    uint32_t* d_leaf;
    cudaMalloc(&d_leaf, merkle.digest_size * sizeof(uint32_t));
    cudaMemcpy(d_leaf, 
               merkle.nodes + leaf_index * merkle.digest_size, 
               merkle.digest_size * sizeof(uint32_t), 
               cudaMemcpyDeviceToDevice);

    // Generate the proof
    generate_proof_kernel<<<1, 1>>>(proof_obj, d_leaf, leaf_index, merkle);
    cudaDeviceSynchronize();

    cudaFree(d_leaf);
}

Step 7: Implement Proof Verification

Create a verification kernel:

__global__ void verify_proof_kernel(
    const proof<MERKLE::Size, HASH, uint32_t> proof_obj,
    const uint32_t* leaf,
    const uint32_t leaf_index,
    const uint32_t* root,
    bool* verified)
{
    HASH hash{};
    *verified = MERKLE().verify_proof(proof_obj, leaf, leaf_index, root, hash);
}

And the host function:

bool verify_proof(
    const proof<MERKLE::Size, HASH, uint32_t>& proof_obj,
    uint32_t leaf_index,
    const tree<MERKLE::Size, HASH, uint32_t>& merkle,
    const uint32_t* root)
{
    uint32_t* d_leaf;
    bool* d_verified;
    bool verified = false;

    cudaMalloc(&d_leaf, merkle.digest_size * sizeof(uint32_t));
    cudaMalloc(&d_verified, sizeof(bool));

    cudaMemcpy(d_leaf, 
               merkle.nodes + leaf_index * merkle.digest_size, 
               merkle.digest_size * sizeof(uint32_t), 
               cudaMemcpyDeviceToDevice);

    verify_proof_kernel<<<1, 1>>>(proof_obj, d_leaf, leaf_index, root, d_verified);
    cudaDeviceSynchronize();

    cudaMemcpy(&verified, d_verified, sizeof(bool), cudaMemcpyDeviceToHost);

    cudaFree(d_leaf);
    cudaFree(d_verified);

    return verified;
}

Step 8: Create the Main Application

Implement the main function:

int main(int argc, char* argv[]) {
    std::cout << "Merkle Tree Membership Proof Application\n";
    std::cout << "========================================\n\n";

    constexpr auto N = MERKLE::Size;  // 2,097,152 leaves
    constexpr size_t in_len = 64;      // 64 BabyBear field elements per leaf

    // Allocate and initialize Merkle tree
    tree<MERKLE::Size, HASH, uint32_t> merkle;
    merkle.allocate_tree();

    // Generate sample data
    std::vector<uint32_t> msg(in_len * N, 0);
    for (size_t i = 0; i < N; i++) {
        for (size_t j = 0; j < in_len; j++) {
            msg[i * in_len + j] = (i * j) % cupqc_common::BabyBearPrime;
        }
    }

    // Generate tree and root
    uint32_t* d_root = nullptr;
    uint32_t* root = new uint32_t[merkle.digest_size];
    cudaMalloc(&d_root, merkle.digest_size * sizeof(uint32_t));

    std::cout << "Generating Merkle tree with " << N << " elements...\n";
    generate_tree(msg, in_len, merkle, d_root);

    // Copy root to host
    cudaMemcpy(root, d_root, merkle.digest_size * sizeof(uint32_t), 
               cudaMemcpyDeviceToHost);

    std::cout << "Tree generated successfully!\n\n";

    // Generate proof for a random leaf
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<uint32_t> dis(0, N - 1);
    uint32_t proof_leaf_index = dis(gen);

    proof<MERKLE::Size, HASH, uint32_t> proof_obj;
    proof_obj.allocate_proof();

    std::cout << "Generating proof for leaf " << proof_leaf_index << "...\n";
    generate_proof_for_leaf(proof_leaf_index, merkle, proof_obj);
    std::cout << "Proof generated!\n\n";

    // Verify the proof
    std::cout << "Verifying proof...\n";
    bool is_valid = verify_proof(proof_obj, proof_leaf_index, merkle, d_root);

    if (is_valid) {
        std::cout << "✓ Proof verification: VALID\n";
        std::cout << "  The leaf is confirmed to be a member of the tree.\n";
    } else {
        std::cout << "✗ Proof verification: INVALID\n";
    }

    // Cleanup
    delete[] root;
    cudaFree(d_root);

    return 0;
}

Step 9: Build and Run

The Makefile will build all examples in the folder. Run the Merkle tree membership proof example:

make
./merkle_proof

Customization Tips

Adjust Tree Size

Change the tree size by modifying the MERKLE type. For multi-block mode (using Generate Sub Tree), supported tree sizes range from 2¹⁶ to 2²² leaves (65,536 to 4,194,304 leaves) with subtree sizes from 2¹⁰ to 2¹² leaves (1,024 to 4,096 leaves):

// For smaller trees (2^18 = 262,144 leaves)
using MERKLE = decltype(MERKLE_FIELD_262144() + BlockDim<256>());
using SUB_MERKLE = decltype(MERKLE_FIELD_2048() + BlockDim<256>());
using FINAL_MERKLE = decltype(MERKLE_FIELD_1024() + BlockDim<256>());

// For larger trees (2^22 = 4,194,304 leaves)
using MERKLE = decltype(MERKLE_FIELD_4194304() + BlockDim<256>());
using SUB_MERKLE = decltype(MERKLE_FIELD_4096() + BlockDim<256>());
using FINAL_MERKLE = decltype(MERKLE_FIELD_1024() + BlockDim<256>());

Tree Size Configuration

Ensure that SUB_MERKLE size × FINAL_MERKLE size = MERKLE size, and that all sizes are within the supported ranges. For multi-block mode, tree sizes must be between 2¹⁶ to 2²² leaves with subtree sizes from 2¹⁰ to 2¹² leaves.

A complete list of supported Merkle tree configurations, including single-block mode (Generate Tree) which supports 2² to 2²¹ leaves, is available here: Supported Merkle Tree Configurations

Change Input Data Size

Modify the in_len constant to change the size of each leaf's input data:

constexpr size_t in_len = 128;  // 128 BabyBear field elements

Use Different Hash Functions

You can use different hash functions supported for Merkle Tree operations. Note that Poseidon2 uses uint32_t precision while SHA-2, SHA-3, and SHAKE use uint8_t precision:

// SHA-2 (uint8_t precision)
using HASH = decltype(SHA2_256() + Thread());

// SHA-3 (uint8_t precision)
using HASH = decltype(SHA3_256() + Thread());

// SHAKE (uint8_t precision)
using HASH = decltype(SHAKE128() + Thread());

// Poseidon2 (uint32_t precision) - recommended for ZK applications
using HASH = decltype(POSEIDON2_BB_8_16() + Thread());

A complete list of supported Merkle Tree hash functions is available here: Supported Merkle Tree Hash Functions

Learn More

Example Source Code - Complete Merkle tree membership proof example implementation
cuPQC-Hash User Guide - Usage guide with examples
Merkle Tree API Reference - Complete Merkle tree API documentation
Hash Functions API Reference - Complete hash functions API documentation