# NV-Ingest: CLI Client Quick Start Guide

This notebook provides a quick start guide to using the NV-Ingest client to interact with a running NV-Ingest cluster. It will walk through the following:

- Explore the CLI client help utility
- Submit a single file NV-Ingest job with the CLI client
- Submit a batch NV-Ingest job with the CLI client
- View NV-Ingest job outputs

Specify a few notional files for testing and parameters to connect with a running NV-Ingest cluster.

In [None]:
import os

# sample input file and output directories
SAMPLE_PDF0 = "/workspace/nv-ingest/data/multimodal_test.pdf"
os.environ["SAMPLE_PDF0"] = SAMPLE_PDF0
SAMPLE_PDF1 = "/workspace/nv-ingest/data/functional_validation.pdf"
BATCH_FILE = "/workspace/client_examples/examples/dataset.json"
os.environ["BATCH_FILE"] = BATCH_FILE
OUTPUT_DIRECTORY_SINGLE = "/workspace/client_examples/examples/processed_docs_single"
OUTPUT_DIRECTORY_BATCH = "/workspace/client_examples/examples/processed_docs_batch"
os.environ["OUTPUT_DIRECTORY_SINGLE"] = OUTPUT_DIRECTORY_SINGLE
os.environ["OUTPUT_DIRECTORY_BATCH"] = OUTPUT_DIRECTORY_BATCH

## The NV-Ingest CLI Client

This section will illustrate usage of the `nv-ingest-cli` client to submit ingest jobs to an up and running NV-Ingest cluster.

### Help Utility

The CLI help utility will provide a description of settings and arguments that can be used to configure ingest jobs.

In [None]:
!nv-ingest-cli --help

### Submitting a Single File Job

This section will demonstrate a CLI example that submits a single file extraction oriented NV-Ingest job and save outputs locally.

In [None]:
!nv-ingest-cli \
  --doc ${SAMPLE_PDF0} \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \
  --task='dedup:{"content_type": "image", "filter": true}' \
  --task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \
  --client_host=${REDIS_HOST} \
  --client_port=${REDIS_PORT} \
  --output_directory=${OUTPUT_DIRECTORY_SINGLE}

The outputs will be saved locally for usage after job completion.

In [None]:
!tree processed_docs_single

### Submitting a Batch Job

Alternatively, a batch job can be submitted using on a json file that includes list documents to be ingested. This json file will need to include the following keys:

- `sampled_files` - A list of paths to files for the ingest job.
- `metadata` - Requires a `file_type_proportions` key with a sub-array that can be empty (this requirement may be deprecated in future versions).

All files included have the same ingest task configuration applied to them.

Create a notional json file to demonstrate usage of the CLI batch job configuration.

In [None]:
import json

batch_files = {"sampled_files": [SAMPLE_PDF0, SAMPLE_PDF1], "metadata": {"file_type_proportions": {}}}

with open(BATCH_FILE, "w") as f:
    json.dump(batch_files, f, indent=2)

In [None]:
!cat $BATCH_FILE

The results of this job will be stored locally in the same file hirearchy. The names of each file will map to the file name in the dataset file.

In [None]:
!nv-ingest-cli \
  --dataset ${BATCH_FILE} \
  --task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}' \
  --task='dedup:{"content_type": "image", "filter": true}' \
  --task='filter:{"content_type": "image", "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": true}' \
  --client_host=${REDIS_HOST} \
  --client_port=${REDIS_PORT} \
  --output_directory=${OUTPUT_DIRECTORY_BATCH}

The outputs will be saved locally for usage after job completion.

In [None]:
!tree processed_docs_batch