Earnings21/22 Dataset Processing Pipeline

Earnings21/22 Dataset Processing Pipeline#

This configuration implements a comprehensive 8-step processing pipeline for converting Earnings21 and Earnings22 datasets to NeMo format with advanced forced alignment capabilities. The pipeline supports both full dataset processing and evaluation subsets with optional speaker segmentation.

Dataset Overview

The Earnings21 dataset is a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. The Earnings22 dataset provides similar financial domain content. Both datasets include token-level transcripts with metadata, normalization candidates, and entity tags.

Processing Pipeline

The configuration performs the following 8-step data processing:

  1. CreateInitialAudioAndManifest: Initial audio manifest creation from dataset files

  2. FfmpegConvert: Audio format conversion (MP3 → WAV, multi-channel → mono, any sample rate → 16kHz)

  3. CreateFullAudioManifestEarnings21: Ground truth text reconstruction from NLP token files with punctuation/capitalization preservation

  4. SubRegex: Clean text patterns and remove unwanted characters

  5. NeMoForcedAligner: Word-level forced alignment using NeMo ASR models with CTC heads

  6. CreateSentenceSegmentedManifest: Intelligent sentence-level segmentation based on CTM files with punctuation-aware splitting

  7. SpeakerSegmentedManifest: Speaker-change detection and segmentation with optional metadata mapping (optional)

  8. KeepOnlySpecifiedFields: Filter manifest to keep only required fields

Required Arguments

  • output_directory: Path to the main output directory where all processed files will be stored.

  • dataset_root: Path to the root directory of Earnings21 or Earnings22 dataset.

  • dataset_type: Dataset type, should be “earnings21” or “earnings22”.

  • subset: Dataset subset, should be “full” or “eval10” (earnings21 only). Defaults to “full”.

  • forced_alignment_model: NeMo ASR model for forced alignment with CTC head. Defaults to “nvidia/parakeet-tdt_ctc-1.1b”.

  • preserve_punctuation: Whether to preserve punctuation in text. Defaults to true.

  • preserve_capitalization: Whether to preserve capitalization in text. Defaults to true.

  • include_speaker_info: Whether to include speaker information in segments. Defaults to true.

  • include_tags: Whether to include entity tags (earnings21 only). Defaults to false.

  • use_speaker_metadata_csv: Whether to map speaker IDs to names from speaker-metadata.csv (earnings21 only). Defaults to false.

  • device: Device for forced alignment (“cuda” or “cpu”). Defaults to “cuda”.

  • test_mode: Set to true to process only 2 files for testing. Defaults to false.

Output Format

The pipeline generates multiple intermediate manifests and a final filtered manifest:

Step 1 Output (Full audio manifest):

{
  "audio_filepath": "/path/to/dataset/media/file_id.wav",
  "duration": 1800.0,
  "text": "",
  "file_id": "original_file_id"
}

Step 2 Output (Converted audio):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 1800.0,
  "text": "",
  "file_id": "original_file_id"
}

Step 3 Output (Full audio with text):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 1800.0,
  "text": "Complete transcribed text with punctuation and capitalization.",
  "file_id": "original_file_id"
}

Step 6 Output (Sentence-level segments - Primary Output):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 15.2,
  "text": "This is a complete sentence with proper punctuation.",
  "file_id": "original_file_id",
  "segment_id": 0,
  "offset": 45.3,
  "end_time": 60.5,
  "alignment": [
    {"word": "This", "start": 45.3, "end": 45.6},
    {"word": "is", "start": 45.6, "end": 45.8}
  ]
}

Step 7 Output (Speaker-level segments - Optional):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 0,
  "text": "Speaker segment text...",
  "file_id": "original_file_id",
  "segment_id": 0,
  "start_time": null,
  "end_time": null,
  "speaker": "speaker_1"
}

Final Output (Filtered manifest):

{
  "audio_filepath": "/path/to/output/converted_audio/file_id.wav",
  "duration": 15.2,
  "offset": 45.3,
  "text": "This is a complete sentence with proper punctuation."
}

Usage Examples

Process Earnings21 full dataset:

python main.py --config-path=dataset_configs/english/earnings --config-name=config \
  dataset_type=earnings21 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

Process Earnings22 with custom model:

python main.py --config-path=dataset_configs/english/earnings --config-name=config \
  dataset_type=earnings22 \
  forced_alignment_model=nvidia/parakeet-tdt_ctc-1.1b \
  dataset_root=/path/to/earnings22 \
  output_directory=/path/to/output

Process Earnings21 Eval-10 subset:

python main.py --config-path=dataset_configs/english/earnings --config-name=config \
  dataset_type=earnings21 \
  subset=eval10 \
  dataset_root=/path/to/earnings21 \
  output_directory=/path/to/output

Key Features

  • Supports both Earnings21 and Earnings22 datasets

  • Automatic audio format conversion (MP3/WAV → 16kHz mono WAV)

  • Word-level forced alignment using NeMo ASR models

  • Sentence-level segmentation based on punctuation patterns

  • Optional speaker-level segmentation with metadata mapping

  • Entity-aware processing capabilities

  • Configurable text processing (punctuation/capitalization preservation)

  • Test mode for development and debugging

Config link: dataset_configs/english/earnings/config.yaml