Earnings21/22 Dataset Processing Pipeline#

This configuration implements a comprehensive 8-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

Dataset Overview

The Earnings21 dataset is a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. The Earnings22 dataset provides similar financial domain content. Both datasets include token-level transcripts with metadata, normalization candidates, and entity tags.

Processing Pipeline

The configuration performs the following 8-step data processing:

CreateInitialAudioAndManifest: Initial audio manifest creation from dataset files
FfmpegConvert: Audio format conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)
CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation
SubRegex: Clean text patterns and remove unwanted characters
NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads
CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting
SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping (optional)
KeepOnlySpecifiedFields: Filter manifest to keep only required fields

Required Arguments

output_directory: Path to the main output directory where all processed files will be stored.
dataset_root: Path to the root directory of Earnings21 or Earnings22 dataset.
dataset_type: Dataset type, should be “earnings21” or “earnings22”.
subset: Dataset subset, should be “full” or “eval10” (earnings21 only). Defaults to “full”.
forced_alignment_model: NeMo ASR model for forced alignment with CTC head. Defaults to “nvidia/parakeet-tdt_ctc-1.1b”.
preserve_punctuation: Whether to preserve punctuation in text. Defaults to true.
preserve_capitalization: Whether to preserve capitalization in text. Defaults to true.
include_speaker_info: Whether to include speaker information in segments. Defaults to true.
include_tags: Whether to include entity tags (earnings21 only). Defaults to false.
use_speaker_metadata_csv: Whether to map speaker IDs to names from speaker-metadata.csv (earnings21 only). Defaults to false.
device: Device for forced alignment (“cuda” or “cpu”). Defaults to “cuda”.
test_mode: Set to true to process only 2 files for testing. Defaults to false.

Output Format

The pipeline generates multiple intermediate manifests and a final filtered manifest:

Step 1 Output (Full audio manifest):

{
  "audio_filepath": "/path/to/dataset/media/file_id.wav",
  "duration": 1800.0,
  "text": "",
  "file_id": "original_file_id"
}

Step 2 Output (Converted audio):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 1800.0,
  "text": "",
  "file_id": "original_file_id"
}

Step 3 Output (Full audio with text):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 1800.0,
  "text": "Complete transcribed text with punctuation and capitalization.",
  "file_id": "original_file_id"
}

Step 6 Output (Sentence-level segments - Primary Output):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 15.2,
  "text": "This is a complete sentence with proper punctuation.",
  "file_id": "original_file_id",
  "segment_id": 0,
  "offset": 45.3,
  "end_time": 60.5,
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Step 7 Output (Speaker-level segments - Optional):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 0,
  "text": "Speaker segment text...",
  "file_id": "original_file_id",
  "segment_id": 0,
  "start_time": null,
  "end_time": null,
  "speaker": "speaker_1"
}

Final Output (Filtered manifest):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation."
}

Usage Examples

Process Earnings21 full dataset:

python main.py --config-path=dataset_configs/english/earnings --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

Process Earnings22 with custom model:

python main.py --config-path=dataset_configs/english/earnings --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

Process Earnings21 Eval-10 subset:

python main.py --config-path=dataset_configs/english/earnings --config-name=config \
  dataset_type=earnings21 \
  subset=eval10 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

Key Features

Supports both Earnings21 and Earnings22 datasets
Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)
Word-level forced alignment using NeMo ASR models
Sentence-level segmentation based on punctuation patterns
Optional speaker-level segmentation with metadata mapping
Entity-aware processing capabilities
Configurable text processing (punctuation/capitalization preservation)
Test mode for development and debugging

Config link: dataset_configs/english/earnings/config.yaml