Data Formats#

The current NCore data format is V4 (Component-based Format) – a modular format that separates data into independent component stores. Each component (poses, intrinsics, sensors, labels, etc.) is stored as a separate zarr component that can be independently managed, versioned, and combined. This format enables:

  • Flexible data composition from multiple sources

  • Independent component updates without reprocessing entire sequences

  • Parallel access and distributed storage optimization

  • Extensibility through custom component types

  • Fine-grained access control and data sharing

The format uses coordinate system conventions and transformations described in Specification.

For details on how component stores are serialized to disk (.zarr.itar indexed tar archives vs. directory-based .zarr stores) and how to access them from local or cloud storage, see Storage and Access.

V4: Component Store Hierarchy (Component-Based Format)#

The component-based V4 data format represents sequences as collections of component groups. V4 distributes data across modular components that can be independently managed, versioned, and combined to form virtual sequences.

Component Architecture#

Each component group is a zarr store containing a specific number of data component instances. The NCore library provides the following default component types:

The component architecture is extensible, allowing custom component types to be defined for application-specific data.

Component Group Structure#

Each component group has the following root-level structure:

ncore4[-{component_group_name}].zarr[.itar]/
│
├── {sequence_meta_data}
│   ├── sequence_id: str
│   ├── version: str (currently "v4")
│   ├── sequence_timestamp_interval_us: {start, stop}
│   ├── generic_meta_data: {...}
│   └── component_group_name: str
│
└── {component_type}/
    └── {component_instance_name}/
         ├── {component_meta_data}
         │   ├── component_name: str
         │   ├── component_instance_name: str
         │   ├── component_version: str
         │   └── generic_meta_data: {...}
         │
         └── {component_specific_data}...

Poses Component#

The poses component stores both static (time-invariant) and dynamic (time-dependent) rigid transformations between named coordinate frames:

poses/
└── {component_instance_name}/
    ├── static_poses/
    │   └── {attrs}
    │       └── ("source_frame", "target_frame"):
    │           ├── pose: [[4,4]] float32/64
    │           └── dtype: str
    │
    └── dynamic_poses/
        └── {attrs}
            └── ("source_frame", "target_frame"):
                ├── poses: [[N,4,4]] float32/64
                ├── timestamps_us: [N] uint64
                └── dtype: str

For ego-vehicle trajectories, the rig-to-world transformation is typically stored as a dynamic pose under the key ("rig", "world"). Transformations from local world to global world frames (like ECEF) are represented by a ("world", "world_global") record, if applicable.

Static poses are used for sensor extrinsic calibrations. For example, a camera-to-rig transformation would be stored under the key ("camera_front_wide_120fov", "rig").

Intrinsics Component#

Camera and lidar intrinsic model parameters:

intrinsics/
└── {component_instance_name}/
     ├── cameras/
     │   └── {camera_id}/
     │       └── {attrs}
     │           ├── camera_model_type: str
     │           ├── camera_model_parameters: {...}
     │           └── external_distortion_type: str  (optional)
    │
    └── lidars/
        └── {lidar_id}/
            └── {attrs}
                ├── lidar_model_type: str
                └── lidar_model_parameters: {...}

Model types include ftheta, opencv-pinhole, and opencv-fisheye for camera sensors, and row-offset-spinning for lidar sensors. For detailed model parameterizations and mathematical specifications, see Sensor Models.

Masks Component#

Static masks for sensors are stored per sensor instance (currently only cameras are supported):

masks/
└── {component_instance_name}/
    └── cameras/
        └── {camera_id}/
            ├── {attrs}
            │   └── mask_names: [str, ...]
            └── {mask_name} () |Sx  (encoded image, attrs: format: str)

Sensor Components#

Sensor components (cameras, lidars, radars) share a common frame-based structure:

{sensor_type}/
└── {sensor_id}/
    ├── {component_meta_data}
    │
    └── frames/
        ├── {attrs}
        │   └── frames_timestamps_us: [N, 2] uint64  (start, end per frame)
        │
        └── {frame_name}/  (keyed by end-of-frame timestamp)
            ├── {sensor_specific_data}
            └── generic_data/
                ├── {attrs: generic_meta_data}
                └── {named datasets}...

Camera Sensor Frames:

cameras/{camera_id}/frames/{frame_name}/
├── image () |Sx  (encoded image, attrs: format: str)
└── generic_data/...

Lidar Sensor Frames:

Lidar and radar data structures separate ray geometry (ray_bundle/) from multi-return properties (ray_bundle_returns/) for flexible data organization. Non-existing values are indicated via NaNs and must be consistent across all return datasets to define a coherent [R,N] valid-return mask. This mask is stored in bit-packed form as ray_bundle_returns_valid_mask_packed.

lidars/{lidar_id}/frames/{frame_name}/
├── ray_bundle/
│   ├── {attrs: n_rays: int}
│   ├── direction: [N,3] float32     (per-ray normalized ray directions in sensor coordinates)
│   ├── timestamp_us: [N] uint64     (per-ray timestamps of ray measurement time in us)
│   └── model_element: [N,2] uint16  (optional: model-element indices of each ray)
│
├── ray_bundle_returns/
│   ├── {attrs: n_returns: int}
│   ├── distance_m: [R,N] float32    (per-return measured metric distances along rays)
│   ├── intensity: [R,N] float32     (per-return measured return intensity values [0,1])
│   └── ...                          (may include additional return datasets)
│
└── ray_bundle_returns_valid_mask_packed () uint8  (bit-packed [R,N] valid mask, attrs: n_returns, n_rays)

Radar Sensor Frames:

radars/{radar_id}/frames/{frame_name}/
├── ray_bundle/
│   ├── {attrs: n_rays: int}
│   ├── direction: [N,3] float32  (per-ray normalized ray directions in sensor coordinates)
│   └── timestamp_us: [N] uint64  (per-ray timestamps of ray measurement time in us)
│
├── ray_bundle_returns/
│   ├── {attrs: n_returns: int}
│   ├── distance_m: [R,N] float32  (per-return measured metric distances along rays)
│   └── ...                        (may include radial velocities, RCS)
│
└── ray_bundle_returns_valid_mask_packed () uint8  (bit-packed [R,N] valid mask, attrs: n_returns, n_rays)

Cuboids Component#

3D cuboid track observations are stored in a structured format:

cuboids/
└── {component_instance_name}/
    └── cuboids/
        └── {attrs}
            └── cuboid_track_observations: [N] (JSON-serialized list)

Each observation is a JSON-serializable object containing:

  • track_id - Unique track identifier (str)

  • class_id - Object class label (str)

  • timestamp_us - Observation timestamp in us (int)

  • reference_frame_id - Reference frame identifier (str)

  • reference_frame_timestamp_us - Reference frame timestamp in us (int)

  • bbox3 - 3D bounding box in reference frame coordinates

  • source - Label source (e.g., AUTOLABEL, GT_SYNTHETIC)

  • source_version - Optional source version identifier (str)

Observations can be transformed between reference frames using the pose graph and support motion compensation across different sensor frames.

Component Groups#

Multiple component instances can coexist using different component instance names. This enables scenarios such as:

  • Multiple calibrations (e.g., “factory”, “online_refined”)

  • Multiple label sources (e.g., “auto_labels”, “human_verified”)

  • Different processing versions (e.g., “v1”, “v2”)

The default component group name is default. Component stores with different group names are stored in separate zarr archives following the naming pattern: ncore4-{component_group_name}.zarr[.itar].

Custom Components#

The component architecture is extensible: define a ComponentWriter / ComponentReader pair with a unique COMPONENT_NAME and version string, then register instances through SequenceComponentGroupsWriter.

To avoid name clashes with built-in or third-party components, use a reverse-domain naming convention for custom component names, e.g. com.myorg.velocity.

A minimal custom component looks like:

from ncore.data.v4 import ComponentWriter, ComponentReader

class VelocityComponent:
    COMPONENT_NAME = "com.myorg.velocity"

    class Writer(ComponentWriter):
        @staticmethod
        def get_component_name() -> str:
            return VelocityComponent.COMPONENT_NAME

        @staticmethod
        def get_component_version() -> str:
            return "v1"

        def store_velocity(self, velocity, timestamp_us):
            ...  # collect data

        def finalize(self):
            ...  # write zarr datasets to self._group

    class Reader(ComponentReader):
        @staticmethod
        def get_component_name() -> str:
            return VelocityComponent.COMPONENT_NAME

        @staticmethod
        def supports_component_version(version: str) -> bool:
            return version == "v1"

        def get_velocities(self):
            return self._group["velocities"][:], self._group["timestamps_us"][:]

Writers must ensure that all stored timestamps fall within the sequence’s sequence_timestamp_interval_us time range. Existing datasets can be extended with new components by creating a writer via SequenceComponentGroupsWriter.from_reader() and finalizing the additional stores. For a complete working example (including component versioning and backward-compatible readers), see TestDataNewComponent.