nv_ingest_api.internal.schemas.extract package#

Submodules#

nv_ingest_api.internal.schemas.extract.extract_audio_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_audio_schema.AudioConfigSchema[source]#

Bases: BaseModel

Configuration schema for audio extraction endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • audio_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the audio_retriever endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "AudioConfigSchema",
   "description": "Configuration schema for audio extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\naudio_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the audio_retriever endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "audio_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Audio Endpoints",
         "type": "array"
      },
      "audio_infer_protocol": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Audio Infer Protocol"
      },
      "function_id": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Function Id"
      },
      "use_ssl": {
         "anyOf": [
            {
               "type": "boolean"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Use Ssl"
      },
      "ssl_cert": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Ssl Cert"
      },
      "segment_audio": {
         "anyOf": [
            {
               "type": "boolean"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Segment Audio"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field audio_endpoints: Tuple[str | None, str | None] = (None, None)#
field audio_infer_protocol: str | None = None#
field auth_token: str | None = None#
field function_id: str | None = None#
field segment_audio: bool | None = None#
field ssl_cert: str | None = None#
field use_ssl: bool | None = None#
class Config[source]#

Bases: object

extra = 'forbid'#
classmethod validate_endpoints(values)[source]#

Validates the gRPC and HTTP services for all endpoints.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_audio_schema.AudioExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the PDF extractor settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=16) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.

  • audio_extraction_config (Optional[AudioConfigSchema], default=None) – Configuration schema for the audio extraction stage.

Show JSON schema
{
   "title": "AudioExtractorSchema",
   "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception on processing failure.\n\naudio_extraction_config: Optional[AudioConfigSchema], default=None\n    Configuration schema for the audio extraction stage.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 16,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "audio_extraction_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/AudioConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "AudioConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for audio extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\naudio_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the audio_retriever endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "audio_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Audio Endpoints",
               "type": "array"
            },
            "audio_infer_protocol": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Audio Infer Protocol"
            },
            "function_id": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Function Id"
            },
            "use_ssl": {
               "anyOf": [
                  {
                     "type": "boolean"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Use Ssl"
            },
            "ssl_cert": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Ssl Cert"
            },
            "segment_audio": {
               "anyOf": [
                  {
                     "type": "boolean"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Segment Audio"
            }
         },
         "title": "AudioConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field audio_extraction_config: AudioConfigSchema | None = None#
field max_queue_size: int = 1#
field n_workers: int = 16#
field raise_on_failure: bool = False#
class Config[source]#

Bases: object

extra = 'forbid'#

nv_ingest_api.internal.schemas.extract.extract_chart_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_chart_schema.ChartExtractorConfigSchema[source]#

Bases: BaseModel

Configuration schema for chart extraction service endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • yolox_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.

  • ocr_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the ocr endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "ChartExtractorConfigSchema",
   "description": "Configuration schema for chart extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the ocr endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      },
      "ocr_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Ocr Endpoints",
         "type": "array"
      },
      "ocr_infer_protocol": {
         "default": "",
         "title": "Ocr Infer Protocol",
         "type": "string"
      },
      "nim_batch_size": {
         "default": 2,
         "title": "Nim Batch Size",
         "type": "integer"
      },
      "workers_per_progress_engine": {
         "default": 5,
         "title": "Workers Per Progress Engine",
         "type": "integer"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field nim_batch_size: int = 2#
Validated by:
field ocr_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field ocr_infer_protocol: str = ''#
Validated by:
field workers_per_progress_engine: int = 5#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Ensures that at least one service (either gRPC or HTTP) is provided for each endpoint in the configuration.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_chart_schema.ChartExtractorSchema[source]#

Bases: BaseModel

Configuration schema for chart extraction processing settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=2) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception if a failure occurs during chart extraction.

  • extraction_config (Optional[ChartExtractorConfigSchema], default=None) – Configuration for the chart extraction stage, including yolox and ocr service endpoints.

Show JSON schema
{
   "title": "ChartExtractorSchema",
   "description": "Configuration schema for chart extraction processing settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=2\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception if a failure occurs during chart extraction.\n\nextraction_config: Optional[ChartExtractorConfigSchema], default=None\n    Configuration for the chart extraction stage, including yolox and ocr service endpoints.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 2,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "endpoint_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/ChartExtractorConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "ChartExtractorConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for chart extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the ocr endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            },
            "ocr_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Ocr Endpoints",
               "type": "array"
            },
            "ocr_infer_protocol": {
               "default": "",
               "title": "Ocr Infer Protocol",
               "type": "string"
            },
            "nim_batch_size": {
               "default": 2,
               "title": "Nim Batch Size",
               "type": "integer"
            },
            "workers_per_progress_engine": {
               "default": 5,
               "title": "Workers Per Progress Engine",
               "type": "integer"
            }
         },
         "title": "ChartExtractorConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field endpoint_config: ChartExtractorConfigSchema | None = None#
field max_queue_size: int = 1#
Validated by:
field n_workers: int = 2#
Validated by:
field raise_on_failure: bool = False#
validator check_positive  »  max_queue_size, n_workers[source]#

nv_ingest_api.internal.schemas.extract.extract_docx_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_docx_schema.DocxConfigSchema[source]#

Bases: BaseModel

Configuration schema for docx extraction endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "DocxConfigSchema",
   "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_docx_schema.DocxExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the PDF extractor settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=16) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.

  • image_extraction_config (Optional[ImageConfigSchema], default=None) – Configuration schema for the image extraction stage.

Show JSON schema
{
   "title": "DocxExtractorSchema",
   "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception on processing failure.\n\nimage_extraction_config: Optional[ImageConfigSchema], default=None\n    Configuration schema for the image extraction stage.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 16,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "docx_extraction_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/DocxConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "DocxConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            }
         },
         "title": "DocxConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field docx_extraction_config: DocxConfigSchema | None = None#
field max_queue_size: int = 1#
field n_workers: int = 16#
field raise_on_failure: bool = False#

nv_ingest_api.internal.schemas.extract.extract_html_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_html_schema.HtmlExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the Html extractor settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=16) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.

Show JSON schema
{
   "title": "HtmlExtractorSchema",
   "description": "Configuration schema for the Html extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception on processing failure.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 16,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field max_queue_size: int = 1#
field n_workers: int = 16#
field raise_on_failure: bool = False#

nv_ingest_api.internal.schemas.extract.extract_image_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_image_schema.ImageConfigSchema[source]#

Bases: BaseModel

Configuration schema for image extraction endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "ImageConfigSchema",
   "description": "Configuration schema for image extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_image_schema.ImageExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the PDF extractor settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=16) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.

  • image_extraction_config (Optional[ImageConfigSchema], default=None) – Configuration schema for the image extraction stage.

Show JSON schema
{
   "title": "ImageExtractorSchema",
   "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception on processing failure.\n\nimage_extraction_config: Optional[ImageConfigSchema], default=None\n    Configuration schema for the image extraction stage.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 16,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "image_extraction_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/ImageConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "ImageConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for image extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            }
         },
         "title": "ImageConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field image_extraction_config: ImageConfigSchema | None = None#
field max_queue_size: int = 1#
field n_workers: int = 16#
field raise_on_failure: bool = False#

nv_ingest_api.internal.schemas.extract.extract_infographic_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_infographic_schema.InfographicExtractorConfigSchema[source]#

Bases: BaseModel

Configuration schema for infographic extraction service endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • ocr_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the ocr endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "InfographicExtractorConfigSchema",
   "description": "Configuration schema for infographic extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the ocr endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "ocr_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Ocr Endpoints",
         "type": "array"
      },
      "ocr_infer_protocol": {
         "default": "",
         "title": "Ocr Infer Protocol",
         "type": "string"
      },
      "nim_batch_size": {
         "default": 2,
         "title": "Nim Batch Size",
         "type": "integer"
      },
      "workers_per_progress_engine": {
         "default": 5,
         "title": "Workers Per Progress Engine",
         "type": "integer"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field nim_batch_size: int = 2#
Validated by:
field ocr_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field ocr_infer_protocol: str = ''#
Validated by:
field workers_per_progress_engine: int = 5#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Ensures that at least one service (either gRPC or HTTP) is provided for each endpoint in the configuration.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_infographic_schema.InfographicExtractorSchema[source]#

Bases: BaseModel

Configuration schema for infographic extraction processing settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=2) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception if a failure occurs during infographic extraction.

  • stage_config (Optional[InfographicExtractorConfigSchema], default=None) – Configuration for the infographic extraction stage, including yolox and ocr service endpoints.

Show JSON schema
{
   "title": "InfographicExtractorSchema",
   "description": "Configuration schema for infographic extraction processing settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=2\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception if a failure occurs during infographic extraction.\n\nstage_config : Optional[InfographicExtractorConfigSchema], default=None\n    Configuration for the infographic extraction stage, including yolox and ocr service endpoints.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 2,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "endpoint_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/InfographicExtractorConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "InfographicExtractorConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for infographic extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the ocr endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "ocr_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Ocr Endpoints",
               "type": "array"
            },
            "ocr_infer_protocol": {
               "default": "",
               "title": "Ocr Infer Protocol",
               "type": "string"
            },
            "nim_batch_size": {
               "default": 2,
               "title": "Nim Batch Size",
               "type": "integer"
            },
            "workers_per_progress_engine": {
               "default": 5,
               "title": "Workers Per Progress Engine",
               "type": "integer"
            }
         },
         "title": "InfographicExtractorConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field endpoint_config: InfographicExtractorConfigSchema | None = None#
field max_queue_size: int = 1#
Validated by:
field n_workers: int = 2#
Validated by:
field raise_on_failure: bool = False#
validator check_positive  »  max_queue_size, n_workers[source]#

nv_ingest_api.internal.schemas.extract.extract_pdf_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_pdf_schema.NemoRetrieverParseConfigSchema[source]#

Bases: BaseModel

Configuration schema for NemoRetrieverParse endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • nemoretriever_parse_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the nemoretriever_parse endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "NemoRetrieverParseConfigSchema",
   "description": "Configuration schema for NemoRetrieverParse endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nnemoretriever_parse_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the nemoretriever_parse endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      },
      "nemoretriever_parse_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Nemoretriever Parse Endpoints",
         "type": "array"
      },
      "nemoretriever_parse_infer_protocol": {
         "default": "",
         "title": "Nemoretriever Parse Infer Protocol",
         "type": "string"
      },
      "nemoretriever_parse_model_name": {
         "default": "nvidia/nemoretriever-parse",
         "title": "Nemoretriever Parse Model Name",
         "type": "string"
      },
      "timeout": {
         "default": 300.0,
         "title": "Timeout",
         "type": "number"
      },
      "workers_per_progress_engine": {
         "default": 5,
         "title": "Workers Per Progress Engine",
         "type": "integer"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field nemoretriever_parse_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field nemoretriever_parse_infer_protocol: str = ''#
Validated by:
field nemoretriever_parse_model_name: str = 'nvidia/nemoretriever-parse'#
Validated by:
field timeout: float = 300.0#
Validated by:
field workers_per_progress_engine: int = 5#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_pdf_schema.PDFExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the PDF extractor settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=16) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.

  • pdfium_config (Optional[PDFiumConfigSchema], default=None) – Configuration for the PDFium service endpoints.

Show JSON schema
{
   "title": "PDFExtractorSchema",
   "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception on processing failure.\n\npdfium_config : Optional[PDFiumConfigSchema], default=None\n    Configuration for the PDFium service endpoints.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 16,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "pdfium_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/PDFiumConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      },
      "nemoretriever_parse_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/NemoRetrieverParseConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "NemoRetrieverParseConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for NemoRetrieverParse endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nnemoretriever_parse_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the nemoretriever_parse endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            },
            "nemoretriever_parse_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Nemoretriever Parse Endpoints",
               "type": "array"
            },
            "nemoretriever_parse_infer_protocol": {
               "default": "",
               "title": "Nemoretriever Parse Infer Protocol",
               "type": "string"
            },
            "nemoretriever_parse_model_name": {
               "default": "nvidia/nemoretriever-parse",
               "title": "Nemoretriever Parse Model Name",
               "type": "string"
            },
            "timeout": {
               "default": 300.0,
               "title": "Timeout",
               "type": "number"
            },
            "workers_per_progress_engine": {
               "default": 5,
               "title": "Workers Per Progress Engine",
               "type": "integer"
            }
         },
         "title": "NemoRetrieverParseConfigSchema",
         "type": "object"
      },
      "PDFiumConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for PDFium endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            },
            "nim_batch_size": {
               "default": 4,
               "title": "Nim Batch Size",
               "type": "integer"
            },
            "workers_per_progress_engine": {
               "default": 5,
               "title": "Workers Per Progress Engine",
               "type": "integer"
            }
         },
         "title": "PDFiumConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field max_queue_size: int = 1#
field n_workers: int = 16#
field nemoretriever_parse_config: NemoRetrieverParseConfigSchema | None = None#
field pdfium_config: PDFiumConfigSchema | None = None#
field raise_on_failure: bool = False#
pydantic model nv_ingest_api.internal.schemas.extract.extract_pdf_schema.PDFiumConfigSchema[source]#

Bases: BaseModel

Configuration schema for PDFium endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "PDFiumConfigSchema",
   "description": "Configuration schema for PDFium endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      },
      "nim_batch_size": {
         "default": 4,
         "title": "Nim Batch Size",
         "type": "integer"
      },
      "workers_per_progress_engine": {
         "default": 5,
         "title": "Workers Per Progress Engine",
         "type": "integer"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field nim_batch_size: int = 4#
Validated by:
field workers_per_progress_engine: int = 5#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

nv_ingest_api.internal.schemas.extract.extract_pptx_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_pptx_schema.PPTXConfigSchema[source]#

Bases: BaseModel

Configuration schema for docx extraction endpoints and options.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for each endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for any endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "PPTXConfigSchema",
   "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for all endpoints.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for any endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_pptx_schema.PPTXExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the PDF extractor settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=16) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.

  • image_extraction_config (Optional[ImageConfigSchema], default=None) – Configuration schema for the image extraction stage.

Show JSON schema
{
   "title": "PPTXExtractorSchema",
   "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception on processing failure.\n\nimage_extraction_config: Optional[ImageConfigSchema], default=None\n    Configuration schema for the image extraction stage.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 16,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "pptx_extraction_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/PPTXConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "PPTXConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n    A tuple containing the gRPC and HTTP services for the yolox endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            }
         },
         "title": "PPTXConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
field max_queue_size: int = 1#
field n_workers: int = 16#
field pptx_extraction_config: PPTXConfigSchema | None = None#
field raise_on_failure: bool = False#

nv_ingest_api.internal.schemas.extract.extract_table_schema module#

pydantic model nv_ingest_api.internal.schemas.extract.extract_table_schema.TableExtractorConfigSchema[source]#

Bases: BaseModel

Configuration schema for the table extraction stage settings.

Parameters:
  • auth_token (Optional[str], default=None) – Authentication token required for secure services.

  • ocr_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the ocr endpoint. Either the gRPC or HTTP service can be empty, but not both.

validate_endpoints(values)[source]#

Validates that at least one of the gRPC or HTTP services is provided for the yolox endpoint.

Raises:
  • ValueError – If both gRPC and HTTP services are empty for the yolox endpoint.

  • Config

  • ------

:raises extra : str: Pydantic config option to forbid extra fields.

Show JSON schema
{
   "title": "TableExtractorConfigSchema",
   "description": "Configuration schema for the table extraction stage settings.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the ocr endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for the yolox endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for the yolox endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
   "type": "object",
   "properties": {
      "auth_token": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Auth Token"
      },
      "yolox_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Yolox Endpoints",
         "type": "array"
      },
      "yolox_infer_protocol": {
         "default": "",
         "title": "Yolox Infer Protocol",
         "type": "string"
      },
      "ocr_endpoints": {
         "default": [
            null,
            null
         ],
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            },
            {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ]
            }
         ],
         "title": "Ocr Endpoints",
         "type": "array"
      },
      "ocr_infer_protocol": {
         "default": "",
         "title": "Ocr Infer Protocol",
         "type": "string"
      },
      "nim_batch_size": {
         "default": 2,
         "title": "Nim Batch Size",
         "type": "integer"
      },
      "workers_per_progress_engine": {
         "default": 5,
         "title": "Workers Per Progress Engine",
         "type": "integer"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field auth_token: str | None = None#
Validated by:
field nim_batch_size: int = 2#
Validated by:
field ocr_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field ocr_infer_protocol: str = ''#
Validated by:
field workers_per_progress_engine: int = 5#
Validated by:
field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
Validated by:
field yolox_infer_protocol: str = ''#
Validated by:
validator validate_endpoints  »  all fields[source]#

Validates the gRPC and HTTP services for the yolox endpoint.

Parameters:

values (dict) – Dictionary containing the values of the attributes for the class.

Returns:

The validated dictionary of values.

Return type:

dict

Raises:

ValueError – If both gRPC and HTTP services are empty for the yolox endpoint.

pydantic model nv_ingest_api.internal.schemas.extract.extract_table_schema.TableExtractorSchema[source]#

Bases: BaseModel

Configuration schema for the table extraction processing settings.

Parameters:
  • max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.

  • n_workers (int, default=2) – The number of worker threads to use for processing.

  • raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception if a failure occurs during table extraction.

  • stage_config (Optional[TableExtractorConfigSchema], default=None) – Configuration for the table extraction stage, including yolox service endpoints.

Show JSON schema
{
   "title": "TableExtractorSchema",
   "description": "Configuration schema for the table extraction processing settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n    The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=2\n    The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n    A flag indicating whether to raise an exception if a failure occurs during table extraction.\n\nstage_config : Optional[TableExtractorConfigSchema], default=None\n    Configuration for the table extraction stage, including yolox service endpoints.",
   "type": "object",
   "properties": {
      "max_queue_size": {
         "default": 1,
         "title": "Max Queue Size",
         "type": "integer"
      },
      "n_workers": {
         "default": 2,
         "title": "N Workers",
         "type": "integer"
      },
      "raise_on_failure": {
         "default": false,
         "title": "Raise On Failure",
         "type": "boolean"
      },
      "endpoint_config": {
         "anyOf": [
            {
               "$ref": "#/$defs/TableExtractorConfigSchema"
            },
            {
               "type": "null"
            }
         ],
         "default": null
      }
   },
   "$defs": {
      "TableExtractorConfigSchema": {
         "additionalProperties": false,
         "description": "Configuration schema for the table extraction stage settings.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n    Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n    A tuple containing the gRPC and HTTP services for the ocr endpoint.\n    Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n    Validates that at least one of the gRPC or HTTP services is provided for the yolox endpoint.\n\nRaises\n------\nValueError\n    If both gRPC and HTTP services are empty for the yolox endpoint.\n\nConfig\n------\nextra : str\n    Pydantic config option to forbid extra fields.",
         "properties": {
            "auth_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Auth Token"
            },
            "yolox_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Yolox Endpoints",
               "type": "array"
            },
            "yolox_infer_protocol": {
               "default": "",
               "title": "Yolox Infer Protocol",
               "type": "string"
            },
            "ocr_endpoints": {
               "default": [
                  null,
                  null
               ],
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  },
                  {
                     "anyOf": [
                        {
                           "type": "string"
                        },
                        {
                           "type": "null"
                        }
                     ]
                  }
               ],
               "title": "Ocr Endpoints",
               "type": "array"
            },
            "ocr_infer_protocol": {
               "default": "",
               "title": "Ocr Infer Protocol",
               "type": "string"
            },
            "nim_batch_size": {
               "default": 2,
               "title": "Nim Batch Size",
               "type": "integer"
            },
            "workers_per_progress_engine": {
               "default": 5,
               "title": "Workers Per Progress Engine",
               "type": "integer"
            }
         },
         "title": "TableExtractorConfigSchema",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

Fields:
Validators:
field endpoint_config: TableExtractorConfigSchema | None = None#
field max_queue_size: int = 1#
Validated by:
field n_workers: int = 2#
Validated by:
field raise_on_failure: bool = False#
validator check_positive  »  max_queue_size, n_workers[source]#

Module contents#