Tutorial: Building a Merkle Tree Membership Proof Application
This tutorial walks you through building a complete Merkle tree membership proof application using cuPQC-Hash. You'll learn how to generate Merkle trees, create membership proofs, and verify them using GPU-accelerated cryptographic primitives.
cuPQC-Hash supports two computation modes for Merkle trees:
- single-block mode (high throughput) where one thread block computes one complete tree
- multi-block mode (low latency) where multiple GPU blocks collaborate to compute a single large tree by generating subtrees that are combined
This tutorial demonstrates the multi-block mode, which uses multiple GPU blocks to compute larger trees faster.
Step 1: Project Setup
Clone the cuPQC repository to get the example files:
Step 2: Include Required Headers
Start by including the necessary cuPQC SDK headers:
#include <vector>
#include <iomanip>
#include <iostream>
#include <random>
#include <hash.hpp>
using namespace cupqc;
The hash.hpp header provides access to:
- Poseidon2 hash function (BabyBear field)
- Merkle tree data structures and operations
Step 3: Define Hash and Merkle Tree Types
Define the hash function and Merkle tree configuration. This example uses multi-block mode for computing large trees:
using HASH = decltype(POSEIDON2_BB_8_16() + Thread());
using MERKLE = decltype(MERKLE_FIELD_2097152() + BlockDim<256>());
using SUB_MERKLE = decltype(MERKLE_FIELD_2048() + BlockDim<256>());
using FINAL_MERKLE = decltype(MERKLE_FIELD_1024() + BlockDim<256>());
POSEIDON2_BB_8_16() chooses Poseidon2 with BabyBear field, capacity 8, and width 16.
Thread() specifies that the operator will execute independently for each thread.
MERKLE_FIELD_2097152() creates a Merkle tree supporting up to 2,097,152 leaves (main tree).
MERKLE_FIELD_2048() creates subtrees with 2,048 leaves each for parallel processing.
MERKLE_FIELD_1024() creates the final tree with 1,024 leaves to combine subtrees.
BlockDim<256>() specifies using 256 threads per block for tree operations.
Multi-Block Mode Configuration
When using multi-block mode, ensure that the subtree size multiplied by the final tree size equals the main tree size. This ensures proper tree construction where subtrees are correctly combined into the final Merkle tree.
Step 4: Create Leaf Generation Kernel
Create a CUDA kernel to hash input data into Merkle tree leaves:
__global__ void create_leaves_kernel(
tree<MERKLE::Size, HASH, uint32_t> merkle,
const uint32_t* msg,
size_t inbuf_len)
{
HASH hash{};
size_t i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < MERKLE::Size) {
MERKLE().create_leaf(merkle.nodes + i * merkle.digest_size,
msg + i * inbuf_len,
hash,
inbuf_len);
}
}
MERKLE().create_leaf(): Creates a leaf node in the Merkle tree by hashing the input message data and storing the result in the tree's node array
Each thread processes one leaf, hashing the input data using Poseidon2 and storing the result in the Merkle tree's node array.
Step 5: Implement Tree Generation Function
Create kernels for the multi-block tree generation approach. First, create a kernel to generate subtrees in parallel:
__global__ void generate_sub_tree_kernel(tree<MERKLE::Size, HASH, uint32_t> merkle)
{
HASH hash{};
SUB_MERKLE().generate_sub_tree(hash, merkle, blockIdx.x);
}
Create a kernel to generate the final tree from subtrees:
__global__ void generate_final_tree_kernel(
tree<MERKLE::Size, HASH, uint32_t> merkle,
uint32_t* root)
{
HASH hash;
tree<FINAL_MERKLE::Size, HASH, uint32_t> merkle_final;
const size_t left_overs = merkle.size - merkle_final.size;
merkle_final.nodes = merkle.nodes + left_overs * merkle.digest_size;
__syncthreads();
FINAL_MERKLE().generate_tree(hash, merkle_final);
if(threadIdx.x == 0) {
for(uint32_t i = 0; i < merkle.digest_size; i++) {
root[i] = merkle.root()[i];
}
}
}
Create a host function to generate the complete Merkle tree using the multi-block approach:
void generate_tree(
const std::vector<uint32_t>& msg,
size_t in_len,
tree<MERKLE::Size, HASH, uint32_t> merkle,
uint32_t* d_root)
{
uint32_t* d_msg;
cudaMalloc(reinterpret_cast<void**>(&d_msg), msg.size() * sizeof(uint32_t));
cudaMemcpy(d_msg, msg.data(), msg.size() * sizeof(uint32_t), cudaMemcpyHostToDevice);
create_leaves_kernel<<<MERKLE::Size / MERKLE::BlockDim.x,
MERKLE::BlockDim>>>(merkle, d_msg, in_len);
generate_sub_tree_kernel<<<FINAL_MERKLE::Size, MERKLE::BlockDim>>>(merkle);
generate_final_tree_kernel<<<1, MERKLE::BlockDim>>>(merkle, d_root);
cudaFree(d_msg);
}
SUB_MERKLE().generate_sub_tree(): Generates subtrees in parallel using multiple thread blocks
FINAL_MERKLE().generate_tree(): Generates the final tree by combining the subtrees
This multi-block approach reduces latency for computing large trees by leveraging more GPU resources simultaneously.
Step 6: Implement Proof Generation
Create a kernel to generate membership proofs:
__global__ void generate_proof_kernel(
proof<MERKLE::Size, HASH, uint32_t> proof_obj,
const uint32_t* leaf,
const uint32_t leaf_index,
const tree<MERKLE::Size, HASH, uint32_t> merkle)
{
MERKLE().generate_proof(proof_obj, leaf, leaf_index, merkle);
}
And the host function:
void generate_proof_for_leaf(
uint32_t leaf_index,
const tree<MERKLE::Size, HASH, uint32_t>& merkle,
proof<MERKLE::Size, HASH, uint32_t>& proof_obj)
{
// Get the leaf from the tree
uint32_t* d_leaf;
cudaMalloc(&d_leaf, merkle.digest_size * sizeof(uint32_t));
cudaMemcpy(d_leaf,
merkle.nodes + leaf_index * merkle.digest_size,
merkle.digest_size * sizeof(uint32_t),
cudaMemcpyDeviceToDevice);
// Generate the proof
generate_proof_kernel<<<1, 1>>>(proof_obj, d_leaf, leaf_index, merkle);
cudaDeviceSynchronize();
cudaFree(d_leaf);
}
Step 7: Implement Proof Verification
Create a verification kernel:
__global__ void verify_proof_kernel(
const proof<MERKLE::Size, HASH, uint32_t> proof_obj,
const uint32_t* leaf,
const uint32_t leaf_index,
const uint32_t* root,
bool* verified)
{
HASH hash{};
*verified = MERKLE().verify_proof(proof_obj, leaf, leaf_index, root, hash);
}
And the host function:
bool verify_proof(
const proof<MERKLE::Size, HASH, uint32_t>& proof_obj,
uint32_t leaf_index,
const tree<MERKLE::Size, HASH, uint32_t>& merkle,
const uint32_t* root)
{
uint32_t* d_leaf;
bool* d_verified;
bool verified = false;
cudaMalloc(&d_leaf, merkle.digest_size * sizeof(uint32_t));
cudaMalloc(&d_verified, sizeof(bool));
cudaMemcpy(d_leaf,
merkle.nodes + leaf_index * merkle.digest_size,
merkle.digest_size * sizeof(uint32_t),
cudaMemcpyDeviceToDevice);
verify_proof_kernel<<<1, 1>>>(proof_obj, d_leaf, leaf_index, root, d_verified);
cudaDeviceSynchronize();
cudaMemcpy(&verified, d_verified, sizeof(bool), cudaMemcpyDeviceToHost);
cudaFree(d_leaf);
cudaFree(d_verified);
return verified;
}
Step 8: Create the Main Application
Implement the main function:
int main(int argc, char* argv[]) {
std::cout << "Merkle Tree Membership Proof Application\n";
std::cout << "========================================\n\n";
constexpr auto N = MERKLE::Size; // 2,097,152 leaves
constexpr size_t in_len = 64; // 64 BabyBear field elements per leaf
// Allocate and initialize Merkle tree
tree<MERKLE::Size, HASH, uint32_t> merkle;
merkle.allocate_tree();
// Generate sample data
std::vector<uint32_t> msg(in_len * N, 0);
for (size_t i = 0; i < N; i++) {
for (size_t j = 0; j < in_len; j++) {
msg[i * in_len + j] = (i * j) % cupqc_common::BabyBearPrime;
}
}
// Generate tree and root
uint32_t* d_root = nullptr;
uint32_t* root = new uint32_t[merkle.digest_size];
cudaMalloc(&d_root, merkle.digest_size * sizeof(uint32_t));
std::cout << "Generating Merkle tree with " << N << " elements...\n";
generate_tree(msg, in_len, merkle, d_root);
// Copy root to host
cudaMemcpy(root, d_root, merkle.digest_size * sizeof(uint32_t),
cudaMemcpyDeviceToHost);
std::cout << "Tree generated successfully!\n\n";
// Generate proof for a random leaf
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<uint32_t> dis(0, N - 1);
uint32_t proof_leaf_index = dis(gen);
proof<MERKLE::Size, HASH, uint32_t> proof_obj;
proof_obj.allocate_proof();
std::cout << "Generating proof for leaf " << proof_leaf_index << "...\n";
generate_proof_for_leaf(proof_leaf_index, merkle, proof_obj);
std::cout << "Proof generated!\n\n";
// Verify the proof
std::cout << "Verifying proof...\n";
bool is_valid = verify_proof(proof_obj, proof_leaf_index, merkle, d_root);
if (is_valid) {
std::cout << "✓ Proof verification: VALID\n";
std::cout << " The leaf is confirmed to be a member of the tree.\n";
} else {
std::cout << "✗ Proof verification: INVALID\n";
}
// Cleanup
delete[] root;
cudaFree(d_root);
return 0;
}
Step 9: Build and Run
The Makefile will build all examples in the folder. Run the Merkle tree membership proof example:
Customization Tips
Adjust Tree Size
Change the tree size by modifying the MERKLE type. For multi-block mode (using Generate Sub Tree), supported tree sizes range from 216 to 222 leaves (65,536 to 4,194,304 leaves) with subtree sizes from 210 to 212 leaves (1,024 to 4,096 leaves):
// For smaller trees (2^18 = 262,144 leaves)
using MERKLE = decltype(MERKLE_FIELD_262144() + BlockDim<256>());
using SUB_MERKLE = decltype(MERKLE_FIELD_2048() + BlockDim<256>());
using FINAL_MERKLE = decltype(MERKLE_FIELD_1024() + BlockDim<256>());
// For larger trees (2^22 = 4,194,304 leaves)
using MERKLE = decltype(MERKLE_FIELD_4194304() + BlockDim<256>());
using SUB_MERKLE = decltype(MERKLE_FIELD_4096() + BlockDim<256>());
using FINAL_MERKLE = decltype(MERKLE_FIELD_1024() + BlockDim<256>());
Tree Size Configuration
Ensure that SUB_MERKLE size × FINAL_MERKLE size = MERKLE size, and that all sizes are within the supported ranges. For multi-block mode, tree sizes must be between 216 to 222 leaves with subtree sizes from 210 to 212 leaves.
A complete list of supported Merkle tree configurations, including single-block mode (Generate Tree) which supports 22 to 221 leaves, is available here: Supported Merkle Tree Configurations
Change Input Data Size
Modify the in_len constant to change the size of each leaf's input data:
Use Different Hash Functions
You can use different hash functions supported for Merkle Tree operations. Note that Poseidon2 uses uint32_t precision while SHA-2, SHA-3, and SHAKE use uint8_t precision:
// SHA-2 (uint8_t precision)
using HASH = decltype(SHA2_256() + Thread());
// SHA-3 (uint8_t precision)
using HASH = decltype(SHA3_256() + Thread());
// SHAKE (uint8_t precision)
using HASH = decltype(SHAKE128() + Thread());
// Poseidon2 (uint32_t precision) - recommended for ZK applications
using HASH = decltype(POSEIDON2_BB_8_16() + Thread());
A complete list of supported Merkle Tree hash functions is available here: Supported Merkle Tree Hash Functions
Learn More
- Example Source Code - Complete Merkle tree membership proof example implementation
- cuPQC-Hash User Guide - Usage guide with examples
- Merkle Tree API Reference - Complete Merkle tree API documentation
- Hash Functions API Reference - Complete hash functions API documentation