Storage and Access#

NCore V4 components (see Data Formats) can be persisted in two storage formats and accessed from local or remote storage backends. Each group of component is represented as a zarr group, which can be stored as a directory-based zarr stores or as single-file indexed tar archives.

Indexed Tar Archive Format (.itar)#

NCore defines a custom .itar (indexed tar) container format specifically tailored to dataset and use-case characteristics in robotics and autonomous vehicle applications. The .itar format packages zarr chunks as sequential tar members in a single file and appends a compressed index at the end of a regular tar archive, combining the streaming efficiency of tar with random-access capability.

../_images/itar.svg

Comparison of regular tar files with it’s 512 byte blocks (as used by, e.g., WebDataset, supporting linear streaming but no random access) with NCore’s indexed tar format, which appends a compressed index enabling O(1) key lookups and direct seeks to any chunk.#

The .itar store implements the abstract zarr Store interface, so it can be used as a drop-in replacement for directory stores in all NCore APIs. Via UPath, .itar containers can also be accessed transparently from cloud storage backends (e.g., S3, GCS) without requiring a local copy.

Tradeoffs:

  • .itar (container file) – efficient for distribution, cloud storage, and atomic transfers; supports both sequential streaming and random access via the appended index

  • directory store – individual chunk files on disk; simpler for debugging and incremental updates

Both formats are accessed through the same SequenceComponentGroupsReader and SequenceComponentGroupsWriter APIs.

Read Performance#

The chart below compares .itar read throughput against four alternative storage formats on a synthetic dataset of the same 1k JPEG images (2k and 4k resolutions, ~4.5 GB on local SSD) with associated per-image meta-data (poses, timestamps, etc.). Throughput is measured with init cost excluded (formats pre-opened); init cost is reported separately as time-to-first-read.

../_images/read_performance.png

Read throughput across five storage formats. Full bar = sequential; inner bar = random access.#

Format

Seq (MB/s)

Rand (MB/s)

Seq latency

Rand latency

Time-to-first-read

.itar

9847

9492

0.46 ms

0.48 ms

2.1 ms (decompress index)

tarfile (pre-parsed)

8470

7402

0.54 ms

0.62 ms

82.6 ms (scan all file headers)

tarfile (linear scan)

119

38.2 ms

per-access (no index)

WebDataset

1 557

2.91 ms

3.2 ms (pipeline build)

Parquet

1552

1542

2.92 ms

2.98 ms

2 929 ms (materialise table)

HDF5

754

1027

6.01 ms

4.48 ms

1.0 ms (B-tree open)

Even with tarfile’s headers pre-parsed, .itar is 16% faster sequential and 28% faster random. The gap comes from tarfile’s extractfile() wrapping data in three Python layers (ExFileObject_FileInFileBufferedReader), while .itar does a single seek + read.

.itar’s time-to-first-read is 2.1 ms (decompressing a ~14 KB CBOR/LZMA trailer index) vs tarfile’s 82.6 ms (sequential scan of all tar headers). The tarfile linear-scan scenario (119 MB/s) shows the 80 × penalty of random access without any index.

Note

All reads are from OS page cache (local SSD). On cold disk or network I/O the differences narrow.

Loading V4 Data#

V4 sequences are loaded by specifying one or more component store paths:

from ncore.data.v4 import SequenceComponentGroupsReader
from pathlib import Path

# Load sequence from multiple component stores
reader = SequenceComponentGroupsReader([
    Path("ncore4.zarr.itar"),           # default components
    Path("ncore4-calibv2.zarr.itar"),   # alternative calibration
])

# Access specific components
poses_readers = reader.open_component_readers(PosesComponent.Reader)
camera_readers = reader.open_component_readers(CameraSensorComponent.Reader)

Cloud and Remote Storage Access#

NCore accesses all data paths through UPath (universal_pathlib), a drop-in pathlib.Path replacement built on top of fsspec. This means component stores can be read transparently from cloud storage backends – the same SequenceComponentGroupsReader API works for local files and remote URLs alike.

Supported URL Schemes#

Any protocol that fsspec supports can be used as a component store path. Common examples:

Protocol

Example URL

S3

s3://my-bucket/sequences/seq01/ncore4.zarr.itar

GCS

gs://my-bucket/sequences/seq01/ncore4.zarr.itar

Azure Blob

az://my-container/sequences/seq01/ncore4.zarr.itar

HTTP(S)

https://example.com/data/ncore4.zarr.itar

Local

/data/sequences/seq01/ncore4.zarr.itar

Required Dependencies#

nvidia-ncore ships with universal_pathlib (and its transitive dependency fsspec), which is sufficient for local paths. To access remote storage you need to install the corresponding fsspec filesystem implementation:

Protocol

Extra package

Credentials / configuration

S3

s3fs

AWS credentials (~/.aws/credentials, env vars, or IAM role)

GCS

gcsfs

GOOGLE_APPLICATION_CREDENTIALS or gcloud auth

Azure Blob

adlfs

AZURE_STORAGE_CONNECTION_STRING or az login

HTTP(S)

(built-in)

n/a

Install the extra package for the protocol you need, for example:

pip install nvidia-ncore s3fs          # for S3
pip install nvidia-ncore gcsfs         # for GCS
pip install nvidia-ncore adlfs         # for Azure Blob

Loading Remote Component Stores#

Pass remote URLs directly to SequenceComponentGroupsReader:

from ncore.data.v4 import SequenceComponentGroupsReader
from upath import UPath

reader = SequenceComponentGroupsReader([
    UPath("s3://my-bucket/sequences/seq01/ncore4.zarr.itar"),
    UPath("s3://my-bucket/sequences/seq01/ncore4-labels.zarr.itar"),
])

For S3-compatible endpoints or when a specific AWS profile is needed, pass additional keyword arguments through UPath:

# Use a named AWS profile
store_path = UPath(
    "s3://my-bucket/sequences/seq01/ncore4.zarr.itar",
    profile="my-aws-profile",
)

# Tune download performance
store_path = UPath(
    "s3://my-bucket/sequences/seq01/ncore4.zarr.itar",
    profile="my-aws-profile",
    default_block_size=50 * 1024 * 1024,   # 50 MB download chunks
    default_cache_type="readahead",          # fsspec file-descriptor caching strategy
)

# Point to an S3-compatible endpoint (e.g. MinIO)
store_path = UPath(
    "s3://my-bucket/sequences/seq01/ncore4.zarr.itar",
    client_kwargs={"endpoint_url": "https://minio.example.com"},
)

reader = SequenceComponentGroupsReader([store_path])

All keyword arguments accepted by the underlying S3FileSystem (or the respective fsspec filesystem class for other protocols) can be forwarded this way.

Performance Recommendations#

  • Use the .itar format for cloud-stored data. The indexed tar archive enables random access with a single file, avoiding the large number of small HTTP requests that directory-based zarr stores would incur.

  • Enable consolidated metadata (the default). The open_consolidated parameter on SequenceComponentGroupsReader is True by default, which pre-loads all zarr metadata in a single read. This is especially important for remote stores where each metadata lookup would otherwise be a separate round-trip.

  • Increase the block size for high-bandwidth connections. The default_block_size parameter on UPath controls how much data is fetched per request. Larger values (e.g. 50–100 MB) reduce the number of requests at the cost of higher per-request latency.

  • Consider local caching for repeated access to the same data. fsspec supports transparent caching via the simplecache or filecache protocols:

    # Cache remote files locally on first access
    store_path = UPath(
        "simplecache::s3://my-bucket/sequences/seq01/ncore4.zarr.itar",
        s3={"profile": "my-aws-profile"},
        simplecache={"cache_storage": "/tmp/ncore_cache"},
    )