nv_ingest_api.internal.schemas.extract package#
Submodules#
nv_ingest_api.internal.schemas.extract.extract_audio_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_audio_schema.AudioConfigSchema[source]#
Bases:
BaseModel
Configuration schema for audio extraction endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
audio_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the audio_retriever endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "AudioConfigSchema", "description": "Configuration schema for audio extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\naudio_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the audio_retriever endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "audio_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Audio Endpoints", "type": "array" }, "audio_infer_protocol": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Audio Infer Protocol" }, "function_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Function Id" }, "use_ssl": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Use Ssl" }, "ssl_cert": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Ssl Cert" }, "segment_audio": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Segment Audio" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field audio_endpoints: Tuple[str | None, str | None] = (None, None)#
- field audio_infer_protocol: str | None = None#
- field auth_token: str | None = None#
- field function_id: str | None = None#
- field segment_audio: bool | None = None#
- field ssl_cert: str | None = None#
- field use_ssl: bool | None = None#
- classmethod validate_endpoints(values)[source]#
Validates the gRPC and HTTP services for all endpoints.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_audio_schema.AudioExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the PDF extractor settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=16) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.
audio_extraction_config (Optional[AudioConfigSchema], default=None) – Configuration schema for the audio extraction stage.
Show JSON schema
{ "title": "AudioExtractorSchema", "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception on processing failure.\n\naudio_extraction_config: Optional[AudioConfigSchema], default=None\n Configuration schema for the audio extraction stage.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 16, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "audio_extraction_config": { "anyOf": [ { "$ref": "#/$defs/AudioConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "AudioConfigSchema": { "additionalProperties": false, "description": "Configuration schema for audio extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\naudio_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the audio_retriever endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "audio_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Audio Endpoints", "type": "array" }, "audio_infer_protocol": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Audio Infer Protocol" }, "function_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Function Id" }, "use_ssl": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Use Ssl" }, "ssl_cert": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Ssl Cert" }, "segment_audio": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Segment Audio" } }, "title": "AudioConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field audio_extraction_config: AudioConfigSchema | None = None#
- field max_queue_size: int = 1#
- field n_workers: int = 16#
- field raise_on_failure: bool = False#
nv_ingest_api.internal.schemas.extract.extract_chart_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_chart_schema.ChartExtractorConfigSchema[source]#
Bases:
BaseModel
Configuration schema for chart extraction service endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
yolox_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.
ocr_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the ocr endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "ChartExtractorConfigSchema", "description": "Configuration schema for chart extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the ocr endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "ocr_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Ocr Endpoints", "type": "array" }, "ocr_infer_protocol": { "default": "", "title": "Ocr Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 2, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field nim_batch_size: int = 2#
- Validated by:
- field ocr_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field ocr_infer_protocol: str = ''#
- Validated by:
- field workers_per_progress_engine: int = 5#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
Ensures that at least one service (either gRPC or HTTP) is provided for each endpoint in the configuration.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_chart_schema.ChartExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for chart extraction processing settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=2) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception if a failure occurs during chart extraction.
extraction_config (Optional[ChartExtractorConfigSchema], default=None) – Configuration for the chart extraction stage, including yolox and ocr service endpoints.
Show JSON schema
{ "title": "ChartExtractorSchema", "description": "Configuration schema for chart extraction processing settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=2\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception if a failure occurs during chart extraction.\n\nextraction_config: Optional[ChartExtractorConfigSchema], default=None\n Configuration for the chart extraction stage, including yolox and ocr service endpoints.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 2, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "endpoint_config": { "anyOf": [ { "$ref": "#/$defs/ChartExtractorConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "ChartExtractorConfigSchema": { "additionalProperties": false, "description": "Configuration schema for chart extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the ocr endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "ocr_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Ocr Endpoints", "type": "array" }, "ocr_infer_protocol": { "default": "", "title": "Ocr Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 2, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "title": "ChartExtractorConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field endpoint_config: ChartExtractorConfigSchema | None = None#
- field max_queue_size: int = 1#
- Validated by:
- field n_workers: int = 2#
- Validated by:
- field raise_on_failure: bool = False#
- validator check_positive » max_queue_size, n_workers[source]#
nv_ingest_api.internal.schemas.extract.extract_docx_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_docx_schema.DocxConfigSchema[source]#
Bases:
BaseModel
Configuration schema for docx extraction endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "DocxConfigSchema", "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_docx_schema.DocxExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the PDF extractor settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=16) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.
image_extraction_config (Optional[ImageConfigSchema], default=None) – Configuration schema for the image extraction stage.
Show JSON schema
{ "title": "DocxExtractorSchema", "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception on processing failure.\n\nimage_extraction_config: Optional[ImageConfigSchema], default=None\n Configuration schema for the image extraction stage.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 16, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "docx_extraction_config": { "anyOf": [ { "$ref": "#/$defs/DocxConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "DocxConfigSchema": { "additionalProperties": false, "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" } }, "title": "DocxConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field docx_extraction_config: DocxConfigSchema | None = None#
- field max_queue_size: int = 1#
- field n_workers: int = 16#
- field raise_on_failure: bool = False#
nv_ingest_api.internal.schemas.extract.extract_html_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_html_schema.HtmlExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the Html extractor settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=16) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.
Show JSON schema
{ "title": "HtmlExtractorSchema", "description": "Configuration schema for the Html extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception on processing failure.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 16, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field max_queue_size: int = 1#
- field n_workers: int = 16#
- field raise_on_failure: bool = False#
nv_ingest_api.internal.schemas.extract.extract_image_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_image_schema.ImageConfigSchema[source]#
Bases:
BaseModel
Configuration schema for image extraction endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "ImageConfigSchema", "description": "Configuration schema for image extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_image_schema.ImageExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the PDF extractor settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=16) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.
image_extraction_config (Optional[ImageConfigSchema], default=None) – Configuration schema for the image extraction stage.
Show JSON schema
{ "title": "ImageExtractorSchema", "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception on processing failure.\n\nimage_extraction_config: Optional[ImageConfigSchema], default=None\n Configuration schema for the image extraction stage.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 16, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "image_extraction_config": { "anyOf": [ { "$ref": "#/$defs/ImageConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "ImageConfigSchema": { "additionalProperties": false, "description": "Configuration schema for image extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" } }, "title": "ImageConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field image_extraction_config: ImageConfigSchema | None = None#
- field max_queue_size: int = 1#
- field n_workers: int = 16#
- field raise_on_failure: bool = False#
nv_ingest_api.internal.schemas.extract.extract_infographic_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_infographic_schema.InfographicExtractorConfigSchema[source]#
Bases:
BaseModel
Configuration schema for infographic extraction service endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
ocr_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the ocr endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "InfographicExtractorConfigSchema", "description": "Configuration schema for infographic extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the ocr endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "ocr_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Ocr Endpoints", "type": "array" }, "ocr_infer_protocol": { "default": "", "title": "Ocr Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 2, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field nim_batch_size: int = 2#
- Validated by:
- field ocr_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field ocr_infer_protocol: str = ''#
- Validated by:
- field workers_per_progress_engine: int = 5#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
Ensures that at least one service (either gRPC or HTTP) is provided for each endpoint in the configuration.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_infographic_schema.InfographicExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for infographic extraction processing settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=2) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception if a failure occurs during infographic extraction.
stage_config (Optional[InfographicExtractorConfigSchema], default=None) – Configuration for the infographic extraction stage, including yolox and ocr service endpoints.
Show JSON schema
{ "title": "InfographicExtractorSchema", "description": "Configuration schema for infographic extraction processing settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=2\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception if a failure occurs during infographic extraction.\n\nstage_config : Optional[InfographicExtractorConfigSchema], default=None\n Configuration for the infographic extraction stage, including yolox and ocr service endpoints.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 2, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "endpoint_config": { "anyOf": [ { "$ref": "#/$defs/InfographicExtractorConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "InfographicExtractorConfigSchema": { "additionalProperties": false, "description": "Configuration schema for infographic extraction service endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the ocr endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "ocr_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Ocr Endpoints", "type": "array" }, "ocr_infer_protocol": { "default": "", "title": "Ocr Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 2, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "title": "InfographicExtractorConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field endpoint_config: InfographicExtractorConfigSchema | None = None#
- field max_queue_size: int = 1#
- Validated by:
- field n_workers: int = 2#
- Validated by:
- field raise_on_failure: bool = False#
- validator check_positive » max_queue_size, n_workers[source]#
nv_ingest_api.internal.schemas.extract.extract_pdf_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_pdf_schema.NemoRetrieverParseConfigSchema[source]#
Bases:
BaseModel
Configuration schema for NemoRetrieverParse endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
nemoretriever_parse_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the nemoretriever_parse endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "NemoRetrieverParseConfigSchema", "description": "Configuration schema for NemoRetrieverParse endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nnemoretriever_parse_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the nemoretriever_parse endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "nemoretriever_parse_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Nemoretriever Parse Endpoints", "type": "array" }, "nemoretriever_parse_infer_protocol": { "default": "", "title": "Nemoretriever Parse Infer Protocol", "type": "string" }, "nemoretriever_parse_model_name": { "default": "nvidia/nemoretriever-parse", "title": "Nemoretriever Parse Model Name", "type": "string" }, "timeout": { "default": 300.0, "title": "Timeout", "type": "number" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field nemoretriever_parse_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field nemoretriever_parse_infer_protocol: str = ''#
- Validated by:
- field nemoretriever_parse_model_name: str = 'nvidia/nemoretriever-parse'#
- Validated by:
- field timeout: float = 300.0#
- Validated by:
- field workers_per_progress_engine: int = 5#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_pdf_schema.PDFExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the PDF extractor settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=16) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.
pdfium_config (Optional[PDFiumConfigSchema], default=None) – Configuration for the PDFium service endpoints.
Show JSON schema
{ "title": "PDFExtractorSchema", "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception on processing failure.\n\npdfium_config : Optional[PDFiumConfigSchema], default=None\n Configuration for the PDFium service endpoints.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 16, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "pdfium_config": { "anyOf": [ { "$ref": "#/$defs/PDFiumConfigSchema" }, { "type": "null" } ], "default": null }, "nemoretriever_parse_config": { "anyOf": [ { "$ref": "#/$defs/NemoRetrieverParseConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "NemoRetrieverParseConfigSchema": { "additionalProperties": false, "description": "Configuration schema for NemoRetrieverParse endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nnemoretriever_parse_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the nemoretriever_parse endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "nemoretriever_parse_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Nemoretriever Parse Endpoints", "type": "array" }, "nemoretriever_parse_infer_protocol": { "default": "", "title": "Nemoretriever Parse Infer Protocol", "type": "string" }, "nemoretriever_parse_model_name": { "default": "nvidia/nemoretriever-parse", "title": "Nemoretriever Parse Model Name", "type": "string" }, "timeout": { "default": 300.0, "title": "Timeout", "type": "number" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "title": "NemoRetrieverParseConfigSchema", "type": "object" }, "PDFiumConfigSchema": { "additionalProperties": false, "description": "Configuration schema for PDFium endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 4, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "title": "PDFiumConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field max_queue_size: int = 1#
- field n_workers: int = 16#
- field nemoretriever_parse_config: NemoRetrieverParseConfigSchema | None = None#
- field pdfium_config: PDFiumConfigSchema | None = None#
- field raise_on_failure: bool = False#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_pdf_schema.PDFiumConfigSchema[source]#
Bases:
BaseModel
Configuration schema for PDFium endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "PDFiumConfigSchema", "description": "Configuration schema for PDFium endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 4, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field nim_batch_size: int = 4#
- Validated by:
- field workers_per_progress_engine: int = 5#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
nv_ingest_api.internal.schemas.extract.extract_pptx_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_pptx_schema.PPTXConfigSchema[source]#
Bases:
BaseModel
Configuration schema for docx extraction endpoints and options.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
yolox_endpoints (Tuple[str, str]) – A tuple containing the gRPC and HTTP services for the yolox endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for each endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "PPTXConfigSchema", "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for all endpoints.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for any endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_pptx_schema.PPTXExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the PDF extractor settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=16) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception on processing failure.
image_extraction_config (Optional[ImageConfigSchema], default=None) – Configuration schema for the image extraction stage.
Show JSON schema
{ "title": "PPTXExtractorSchema", "description": "Configuration schema for the PDF extractor settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=16\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception on processing failure.\n\nimage_extraction_config: Optional[ImageConfigSchema], default=None\n Configuration schema for the image extraction stage.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 16, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "pptx_extraction_config": { "anyOf": [ { "$ref": "#/$defs/PPTXConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "PPTXConfigSchema": { "additionalProperties": false, "description": "Configuration schema for docx extraction endpoints and options.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nyolox_endpoints : Tuple[str, str]\n A tuple containing the gRPC and HTTP services for the yolox endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for each endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for any endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" } }, "title": "PPTXConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field max_queue_size: int = 1#
- field n_workers: int = 16#
- field pptx_extraction_config: PPTXConfigSchema | None = None#
- field raise_on_failure: bool = False#
nv_ingest_api.internal.schemas.extract.extract_table_schema module#
- pydantic model nv_ingest_api.internal.schemas.extract.extract_table_schema.TableExtractorConfigSchema[source]#
Bases:
BaseModel
Configuration schema for the table extraction stage settings.
- Parameters:
auth_token (Optional[str], default=None) – Authentication token required for secure services.
ocr_endpoints (Tuple[Optional[str], Optional[str]], default=(None, None)) – A tuple containing the gRPC and HTTP services for the ocr endpoint. Either the gRPC or HTTP service can be empty, but not both.
- validate_endpoints(values)[source]#
Validates that at least one of the gRPC or HTTP services is provided for the yolox endpoint.
- Raises:
ValueError – If both gRPC and HTTP services are empty for the yolox endpoint.
Config –
------ –
:raises extra : str: Pydantic config option to forbid extra fields.
Show JSON schema
{ "title": "TableExtractorConfigSchema", "description": "Configuration schema for the table extraction stage settings.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the ocr endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for the yolox endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for the yolox endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "ocr_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Ocr Endpoints", "type": "array" }, "ocr_infer_protocol": { "default": "", "title": "Ocr Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 2, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_endpoints
»all fields
- field auth_token: str | None = None#
- Validated by:
- field nim_batch_size: int = 2#
- Validated by:
- field ocr_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field ocr_infer_protocol: str = ''#
- Validated by:
- field workers_per_progress_engine: int = 5#
- Validated by:
- field yolox_endpoints: Tuple[str | None, str | None] = (None, None)#
- Validated by:
- field yolox_infer_protocol: str = ''#
- Validated by:
- validator validate_endpoints » all fields[source]#
Validates the gRPC and HTTP services for the yolox endpoint.
- Parameters:
values (dict) – Dictionary containing the values of the attributes for the class.
- Returns:
The validated dictionary of values.
- Return type:
dict
- Raises:
ValueError – If both gRPC and HTTP services are empty for the yolox endpoint.
- pydantic model nv_ingest_api.internal.schemas.extract.extract_table_schema.TableExtractorSchema[source]#
Bases:
BaseModel
Configuration schema for the table extraction processing settings.
- Parameters:
max_queue_size (int, default=1) – The maximum number of items allowed in the processing queue.
n_workers (int, default=2) – The number of worker threads to use for processing.
raise_on_failure (bool, default=False) – A flag indicating whether to raise an exception if a failure occurs during table extraction.
stage_config (Optional[TableExtractorConfigSchema], default=None) – Configuration for the table extraction stage, including yolox service endpoints.
Show JSON schema
{ "title": "TableExtractorSchema", "description": "Configuration schema for the table extraction processing settings.\n\nParameters\n----------\nmax_queue_size : int, default=1\n The maximum number of items allowed in the processing queue.\n\nn_workers : int, default=2\n The number of worker threads to use for processing.\n\nraise_on_failure : bool, default=False\n A flag indicating whether to raise an exception if a failure occurs during table extraction.\n\nstage_config : Optional[TableExtractorConfigSchema], default=None\n Configuration for the table extraction stage, including yolox service endpoints.", "type": "object", "properties": { "max_queue_size": { "default": 1, "title": "Max Queue Size", "type": "integer" }, "n_workers": { "default": 2, "title": "N Workers", "type": "integer" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "endpoint_config": { "anyOf": [ { "$ref": "#/$defs/TableExtractorConfigSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "TableExtractorConfigSchema": { "additionalProperties": false, "description": "Configuration schema for the table extraction stage settings.\n\nParameters\n----------\nauth_token : Optional[str], default=None\n Authentication token required for secure services.\n\nocr_endpoints : Tuple[Optional[str], Optional[str]], default=(None, None)\n A tuple containing the gRPC and HTTP services for the ocr endpoint.\n Either the gRPC or HTTP service can be empty, but not both.\n\nMethods\n-------\nvalidate_endpoints(values)\n Validates that at least one of the gRPC or HTTP services is provided for the yolox endpoint.\n\nRaises\n------\nValueError\n If both gRPC and HTTP services are empty for the yolox endpoint.\n\nConfig\n------\nextra : str\n Pydantic config option to forbid extra fields.", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "yolox_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Yolox Endpoints", "type": "array" }, "yolox_infer_protocol": { "default": "", "title": "Yolox Infer Protocol", "type": "string" }, "ocr_endpoints": { "default": [ null, null ], "maxItems": 2, "minItems": 2, "prefixItems": [ { "anyOf": [ { "type": "string" }, { "type": "null" } ] }, { "anyOf": [ { "type": "string" }, { "type": "null" } ] } ], "title": "Ocr Endpoints", "type": "array" }, "ocr_infer_protocol": { "default": "", "title": "Ocr Infer Protocol", "type": "string" }, "nim_batch_size": { "default": 2, "title": "Nim Batch Size", "type": "integer" }, "workers_per_progress_engine": { "default": 5, "title": "Workers Per Progress Engine", "type": "integer" } }, "title": "TableExtractorConfigSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field endpoint_config: TableExtractorConfigSchema | None = None#
- field max_queue_size: int = 1#
- Validated by:
- field n_workers: int = 2#
- Validated by:
- field raise_on_failure: bool = False#
- validator check_positive » max_queue_size, n_workers[source]#