nv_ingest_api.internal.schemas.meta package#
Submodules#
nv_ingest_api.internal.schemas.meta.base_model_noext module#
nv_ingest_api.internal.schemas.meta.ingest_job_schema module#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestJobSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestJobSchema", "type": "object", "properties": { "job_payload": { "$ref": "#/$defs/JobPayloadSchema" }, "job_id": { "anyOf": [ { "type": "string" }, { "type": "integer" } ], "title": "Job Id" }, "tasks": { "items": { "$ref": "#/$defs/IngestTaskSchema" }, "title": "Tasks", "type": "array" }, "tracing_options": { "anyOf": [ { "$ref": "#/$defs/TracingOptionsSchema" }, { "type": "null" } ], "default": null } }, "$defs": { "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "DocumentTypeEnum": { "description": "Enum for representing various document file types.\n\nNote: Document type refers to the specific file format of the content, such as PDF, DOCX, etc.\nThis is not equivalent to the Content type, which is a broad category of the content.\n\nAttributes\n----------\nBMP: str\n BMP image format.\nDOCX: str\n Microsoft Word document format.\nHTML: str\n HTML document.\nJPEG: str\n JPEG image format.\nPDF: str\n PDF document format.\nPNG: str\n PNG image format.\nPPTX: str\n PowerPoint presentation format.\nSVG: str\n SVG image format.\nTIFF: str\n TIFF image format.\nTXT: str\n Plain text file.\nMP3: str\n MP3 audio format.\nWAV: str\n WAV audio format.", "enum": [ "bmp", "docx", "html", "jpeg", "pdf", "png", "pptx", "svg", "tiff", "text", "text", "mp3", "wav", "unknown" ], "title": "DocumentTypeEnum", "type": "string" }, "IngestTaskAudioExtraction": { "additionalProperties": false, "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "grpc_endpoint": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Grpc Endpoint" }, "http_endpoint": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Http Endpoint" }, "infer_protocol": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Infer Protocol" }, "function_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Function Id" }, "use_ssl": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Use Ssl" }, "ssl_cert": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Ssl Cert" }, "segment_audio": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Segment Audio" } }, "title": "IngestTaskAudioExtraction", "type": "object" }, "IngestTaskCaptionSchema": { "additionalProperties": false, "properties": { "api_key": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Api Key" }, "endpoint_url": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Endpoint Url" }, "prompt": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Prompt" }, "model_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Model Name" } }, "title": "IngestTaskCaptionSchema", "type": "object" }, "IngestTaskChartExtraction": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskChartExtraction", "type": "object" }, "IngestTaskDedupParams": { "additionalProperties": false, "properties": { "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "title": "IngestTaskDedupParams", "type": "object" }, "IngestTaskDedupSchema": { "additionalProperties": false, "properties": { "content_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "image" }, "params": { "$ref": "#/$defs/IngestTaskDedupParams", "default": { "filter": false } } }, "title": "IngestTaskDedupSchema", "type": "object" }, "IngestTaskEmbedSchema": { "additionalProperties": false, "properties": { "endpoint_url": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Endpoint Url" }, "model_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Model Name" }, "api_key": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Api Key" }, "filter_errors": { "default": false, "title": "Filter Errors", "type": "boolean" }, "text_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Text Elements Modality" }, "image_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Image Elements Modality" }, "structured_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Structured Elements Modality" }, "audio_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Audio Elements Modality" } }, "title": "IngestTaskEmbedSchema", "type": "object" }, "IngestTaskExtractSchema": { "additionalProperties": false, "properties": { "document_type": { "$ref": "#/$defs/DocumentTypeEnum" }, "method": { "title": "Method", "type": "string" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "required": [ "document_type", "method" ], "title": "IngestTaskExtractSchema", "type": "object" }, "IngestTaskFilterParamsSchema": { "additionalProperties": false, "properties": { "min_size": { "default": 128, "title": "Min Size", "type": "integer" }, "max_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 5.0, "title": "Max Aspect Ratio" }, "min_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 0.2, "title": "Min Aspect Ratio" }, "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "title": "IngestTaskFilterParamsSchema", "type": "object" }, "IngestTaskFilterSchema": { "additionalProperties": false, "properties": { "content_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "image" }, "params": { "$ref": "#/$defs/IngestTaskFilterParamsSchema", "default": { "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": false } } }, "title": "IngestTaskFilterSchema", "type": "object" }, "IngestTaskInfographicExtraction": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskInfographicExtraction", "type": "object" }, "IngestTaskSchema": { "additionalProperties": false, "properties": { "type": { "$ref": "#/$defs/TaskTypeEnum" }, "task_properties": { "anyOf": [ { "$ref": "#/$defs/IngestTaskSplitSchema" }, { "$ref": "#/$defs/IngestTaskExtractSchema" }, { "$ref": "#/$defs/IngestTaskStoreEmbedSchema" }, { "$ref": "#/$defs/IngestTaskStoreSchema" }, { "$ref": "#/$defs/IngestTaskEmbedSchema" }, { "$ref": "#/$defs/IngestTaskCaptionSchema" }, { "$ref": "#/$defs/IngestTaskDedupSchema" }, { "$ref": "#/$defs/IngestTaskFilterSchema" }, { "$ref": "#/$defs/IngestTaskVdbUploadSchema" }, { "$ref": "#/$defs/IngestTaskAudioExtraction" }, { "$ref": "#/$defs/IngestTaskTableExtraction" }, { "$ref": "#/$defs/IngestTaskChartExtraction" }, { "$ref": "#/$defs/IngestTaskInfographicExtraction" }, { "$ref": "#/$defs/IngestTaskUDFSchema" } ], "title": "Task Properties" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" } }, "required": [ "type", "task_properties" ], "title": "IngestTaskSchema", "type": "object" }, "IngestTaskSplitSchema": { "additionalProperties": false, "properties": { "tokenizer": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Tokenizer" }, "chunk_size": { "default": 1024, "exclusiveMinimum": 0, "title": "Chunk Size", "type": "integer" }, "chunk_overlap": { "default": 150, "minimum": 0, "title": "Chunk Overlap", "type": "integer" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskSplitSchema", "type": "object" }, "IngestTaskStoreEmbedSchema": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskStoreEmbedSchema", "type": "object" }, "IngestTaskStoreSchema": { "additionalProperties": false, "properties": { "structured": { "default": true, "title": "Structured", "type": "boolean" }, "images": { "default": false, "title": "Images", "type": "boolean" }, "method": { "title": "Method", "type": "string" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "required": [ "method" ], "title": "IngestTaskStoreSchema", "type": "object" }, "IngestTaskTableExtraction": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskTableExtraction", "type": "object" }, "IngestTaskUDFSchema": { "additionalProperties": false, "properties": { "udf_function": { "title": "Udf Function", "type": "string" }, "udf_function_name": { "title": "Udf Function Name", "type": "string" }, "phase": { "anyOf": [ { "maximum": 5, "minimum": 1, "type": "integer" }, { "type": "null" } ], "default": null, "title": "Phase" }, "run_before": { "default": false, "description": "Execute UDF before the target stage", "title": "Run Before", "type": "boolean" }, "run_after": { "default": false, "description": "Execute UDF after the target stage", "title": "Run After", "type": "boolean" }, "target_stage": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name of the stage to target (e.g., 'image_dedup', 'text_extract')", "title": "Target Stage" } }, "required": [ "udf_function", "udf_function_name" ], "title": "IngestTaskUDFSchema", "type": "object" }, "IngestTaskVdbUploadSchema": { "additionalProperties": false, "properties": { "bulk_ingest": { "default": false, "title": "Bulk Ingest", "type": "boolean" }, "bulk_ingest_path": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Bulk Ingest Path" }, "params": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Params" }, "filter_errors": { "default": true, "title": "Filter Errors", "type": "boolean" } }, "title": "IngestTaskVdbUploadSchema", "type": "object" }, "JobPayloadSchema": { "additionalProperties": false, "properties": { "content": { "items": { "anyOf": [ { "type": "string" }, { "format": "binary", "type": "string" } ] }, "title": "Content", "type": "array" }, "source_name": { "items": { "type": "string" }, "title": "Source Name", "type": "array" }, "source_id": { "items": { "anyOf": [ { "type": "string" }, { "type": "integer" } ] }, "title": "Source Id", "type": "array" }, "document_type": { "items": { "type": "string" }, "title": "Document Type", "type": "array" } }, "required": [ "content", "source_name", "source_id", "document_type" ], "title": "JobPayloadSchema", "type": "object" }, "TaskTypeEnum": { "description": "Enum for representing various task types.\n\nAttributes\n----------\nCAPTION : str\n Represents a caption task.\nDEDUP : str\n Represents a deduplication task.\nEMBED : str\n Represents an embedding task.\nEXTRACT : str\n Represents an extraction task.\nFILTER : str\n Represents a filtering task.\nSPLIT : str\n Represents a splitting task.\nSTORE : str\n Represents a storing task.\nSTORE_EMBEDDING : str\n Represents a task for storing embeddings.\nVDB_UPLOAD : str\n Represents a task for uploading to a vector database.\nAUDIO_DATA_EXTRACT : str\n Represents a task for extracting audio data.\nTABLE_DATA_EXTRACT : str\n Represents a task for extracting table data.\nCHART_DATA_EXTRACT : str\n Represents a task for extracting chart data.\nINFOGRAPHIC_DATA_EXTRACT : str\n Represents a task for extracting infographic data.\nUDF : str\n Represents a user-defined function task.", "enum": [ "audio_data_extract", "caption", "chart_data_extract", "dedup", "embed", "extract", "filter", "infographic_data_extract", "split", "store_embedding", "store", "table_data_extract", "udf", "vdb_upload" ], "title": "TaskTypeEnum", "type": "string" }, "TracingOptionsSchema": { "additionalProperties": false, "properties": { "trace": { "default": false, "title": "Trace", "type": "boolean" }, "ts_send": { "title": "Ts Send", "type": "integer" }, "trace_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Trace Id" } }, "required": [ "ts_send" ], "title": "TracingOptionsSchema", "type": "object" } }, "additionalProperties": false, "required": [ "job_payload", "job_id", "tasks" ] }
- Config:
extra: str = forbid
- Fields:
- field job_id: str | int [Required]#
- field job_payload: JobPayloadSchema [Required]#
- field tasks: List[IngestTaskSchema] [Required]#
- field tracing_options: TracingOptionsSchema | None = None#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskAudioExtraction[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskAudioExtraction", "type": "object", "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "grpc_endpoint": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Grpc Endpoint" }, "http_endpoint": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Http Endpoint" }, "infer_protocol": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Infer Protocol" }, "function_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Function Id" }, "use_ssl": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Use Ssl" }, "ssl_cert": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Ssl Cert" }, "segment_audio": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Segment Audio" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field auth_token: str | None = None#
- field function_id: str | None = None#
- field grpc_endpoint: str | None = None#
- field http_endpoint: str | None = None#
- field infer_protocol: str | None = None#
- field segment_audio: bool | None = None#
- field ssl_cert: str | None = None#
- field use_ssl: bool | None = None#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskCaptionSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskCaptionSchema", "type": "object", "properties": { "api_key": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Api Key" }, "endpoint_url": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Endpoint Url" }, "prompt": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Prompt" }, "model_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Model Name" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field api_key: str | None = None#
- field endpoint_url: str | None = None#
- field model_name: str | None = None#
- field prompt: str | None = None#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskChartExtraction[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskChartExtraction", "type": "object", "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field params: dict [Optional]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskDedupParams[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskDedupParams", "type": "object", "properties": { "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field filter: bool = False#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskDedupSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskDedupSchema", "type": "object", "properties": { "content_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "image" }, "params": { "$ref": "#/$defs/IngestTaskDedupParams", "default": { "filter": false } } }, "$defs": { "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "IngestTaskDedupParams": { "additionalProperties": false, "properties": { "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "title": "IngestTaskDedupParams", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field content_type: ContentTypeEnum = ContentTypeEnum.IMAGE#
- field params: IngestTaskDedupParams = IngestTaskDedupParams(filter=False)#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskEmbedSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskEmbedSchema", "type": "object", "properties": { "endpoint_url": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Endpoint Url" }, "model_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Model Name" }, "api_key": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Api Key" }, "filter_errors": { "default": false, "title": "Filter Errors", "type": "boolean" }, "text_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Text Elements Modality" }, "image_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Image Elements Modality" }, "structured_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Structured Elements Modality" }, "audio_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Audio Elements Modality" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field api_key: str | None = None#
- field audio_elements_modality: str | None = None#
- field endpoint_url: str | None = None#
- field filter_errors: bool = False#
- field image_elements_modality: str | None = None#
- field model_name: str | None = None#
- field structured_elements_modality: str | None = None#
- field text_elements_modality: str | None = None#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskExtractSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskExtractSchema", "type": "object", "properties": { "document_type": { "$ref": "#/$defs/DocumentTypeEnum" }, "method": { "title": "Method", "type": "string" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "$defs": { "DocumentTypeEnum": { "description": "Enum for representing various document file types.\n\nNote: Document type refers to the specific file format of the content, such as PDF, DOCX, etc.\nThis is not equivalent to the Content type, which is a broad category of the content.\n\nAttributes\n----------\nBMP: str\n BMP image format.\nDOCX: str\n Microsoft Word document format.\nHTML: str\n HTML document.\nJPEG: str\n JPEG image format.\nPDF: str\n PDF document format.\nPNG: str\n PNG image format.\nPPTX: str\n PowerPoint presentation format.\nSVG: str\n SVG image format.\nTIFF: str\n TIFF image format.\nTXT: str\n Plain text file.\nMP3: str\n MP3 audio format.\nWAV: str\n WAV audio format.", "enum": [ "bmp", "docx", "html", "jpeg", "pdf", "png", "pptx", "svg", "tiff", "text", "text", "mp3", "wav", "unknown" ], "title": "DocumentTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "document_type", "method" ] }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field document_type: DocumentTypeEnum [Required]#
- Validated by:
- field method: str [Required]#
- field params: dict [Optional]#
- validator case_insensitive_document_type » document_type[source]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskFilterParamsSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskFilterParamsSchema", "type": "object", "properties": { "min_size": { "default": 128, "title": "Min Size", "type": "integer" }, "max_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 5.0, "title": "Max Aspect Ratio" }, "min_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 0.2, "title": "Min Aspect Ratio" }, "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field filter: bool = False#
- field max_aspect_ratio: float | int = 5.0#
- field min_aspect_ratio: float | int = 0.2#
- field min_size: int = 128#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskFilterSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskFilterSchema", "type": "object", "properties": { "content_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "image" }, "params": { "$ref": "#/$defs/IngestTaskFilterParamsSchema", "default": { "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": false } } }, "$defs": { "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "IngestTaskFilterParamsSchema": { "additionalProperties": false, "properties": { "min_size": { "default": 128, "title": "Min Size", "type": "integer" }, "max_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 5.0, "title": "Max Aspect Ratio" }, "min_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 0.2, "title": "Min Aspect Ratio" }, "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "title": "IngestTaskFilterParamsSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field content_type: ContentTypeEnum = ContentTypeEnum.IMAGE#
- field params: IngestTaskFilterParamsSchema = IngestTaskFilterParamsSchema(min_size=128, max_aspect_ratio=5.0, min_aspect_ratio=0.2, filter=False)#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskInfographicExtraction[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskInfographicExtraction", "type": "object", "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field params: dict [Optional]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskSchema", "type": "object", "properties": { "type": { "$ref": "#/$defs/TaskTypeEnum" }, "task_properties": { "anyOf": [ { "$ref": "#/$defs/IngestTaskSplitSchema" }, { "$ref": "#/$defs/IngestTaskExtractSchema" }, { "$ref": "#/$defs/IngestTaskStoreEmbedSchema" }, { "$ref": "#/$defs/IngestTaskStoreSchema" }, { "$ref": "#/$defs/IngestTaskEmbedSchema" }, { "$ref": "#/$defs/IngestTaskCaptionSchema" }, { "$ref": "#/$defs/IngestTaskDedupSchema" }, { "$ref": "#/$defs/IngestTaskFilterSchema" }, { "$ref": "#/$defs/IngestTaskVdbUploadSchema" }, { "$ref": "#/$defs/IngestTaskAudioExtraction" }, { "$ref": "#/$defs/IngestTaskTableExtraction" }, { "$ref": "#/$defs/IngestTaskChartExtraction" }, { "$ref": "#/$defs/IngestTaskInfographicExtraction" }, { "$ref": "#/$defs/IngestTaskUDFSchema" } ], "title": "Task Properties" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" } }, "$defs": { "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "DocumentTypeEnum": { "description": "Enum for representing various document file types.\n\nNote: Document type refers to the specific file format of the content, such as PDF, DOCX, etc.\nThis is not equivalent to the Content type, which is a broad category of the content.\n\nAttributes\n----------\nBMP: str\n BMP image format.\nDOCX: str\n Microsoft Word document format.\nHTML: str\n HTML document.\nJPEG: str\n JPEG image format.\nPDF: str\n PDF document format.\nPNG: str\n PNG image format.\nPPTX: str\n PowerPoint presentation format.\nSVG: str\n SVG image format.\nTIFF: str\n TIFF image format.\nTXT: str\n Plain text file.\nMP3: str\n MP3 audio format.\nWAV: str\n WAV audio format.", "enum": [ "bmp", "docx", "html", "jpeg", "pdf", "png", "pptx", "svg", "tiff", "text", "text", "mp3", "wav", "unknown" ], "title": "DocumentTypeEnum", "type": "string" }, "IngestTaskAudioExtraction": { "additionalProperties": false, "properties": { "auth_token": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Auth Token" }, "grpc_endpoint": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Grpc Endpoint" }, "http_endpoint": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Http Endpoint" }, "infer_protocol": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Infer Protocol" }, "function_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Function Id" }, "use_ssl": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Use Ssl" }, "ssl_cert": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Ssl Cert" }, "segment_audio": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "title": "Segment Audio" } }, "title": "IngestTaskAudioExtraction", "type": "object" }, "IngestTaskCaptionSchema": { "additionalProperties": false, "properties": { "api_key": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Api Key" }, "endpoint_url": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Endpoint Url" }, "prompt": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Prompt" }, "model_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Model Name" } }, "title": "IngestTaskCaptionSchema", "type": "object" }, "IngestTaskChartExtraction": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskChartExtraction", "type": "object" }, "IngestTaskDedupParams": { "additionalProperties": false, "properties": { "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "title": "IngestTaskDedupParams", "type": "object" }, "IngestTaskDedupSchema": { "additionalProperties": false, "properties": { "content_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "image" }, "params": { "$ref": "#/$defs/IngestTaskDedupParams", "default": { "filter": false } } }, "title": "IngestTaskDedupSchema", "type": "object" }, "IngestTaskEmbedSchema": { "additionalProperties": false, "properties": { "endpoint_url": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Endpoint Url" }, "model_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Model Name" }, "api_key": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Api Key" }, "filter_errors": { "default": false, "title": "Filter Errors", "type": "boolean" }, "text_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Text Elements Modality" }, "image_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Image Elements Modality" }, "structured_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Structured Elements Modality" }, "audio_elements_modality": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Audio Elements Modality" } }, "title": "IngestTaskEmbedSchema", "type": "object" }, "IngestTaskExtractSchema": { "additionalProperties": false, "properties": { "document_type": { "$ref": "#/$defs/DocumentTypeEnum" }, "method": { "title": "Method", "type": "string" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "required": [ "document_type", "method" ], "title": "IngestTaskExtractSchema", "type": "object" }, "IngestTaskFilterParamsSchema": { "additionalProperties": false, "properties": { "min_size": { "default": 128, "title": "Min Size", "type": "integer" }, "max_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 5.0, "title": "Max Aspect Ratio" }, "min_aspect_ratio": { "anyOf": [ { "type": "number" }, { "type": "integer" } ], "default": 0.2, "title": "Min Aspect Ratio" }, "filter": { "default": false, "title": "Filter", "type": "boolean" } }, "title": "IngestTaskFilterParamsSchema", "type": "object" }, "IngestTaskFilterSchema": { "additionalProperties": false, "properties": { "content_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "image" }, "params": { "$ref": "#/$defs/IngestTaskFilterParamsSchema", "default": { "min_size": 128, "max_aspect_ratio": 5.0, "min_aspect_ratio": 0.2, "filter": false } } }, "title": "IngestTaskFilterSchema", "type": "object" }, "IngestTaskInfographicExtraction": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskInfographicExtraction", "type": "object" }, "IngestTaskSplitSchema": { "additionalProperties": false, "properties": { "tokenizer": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Tokenizer" }, "chunk_size": { "default": 1024, "exclusiveMinimum": 0, "title": "Chunk Size", "type": "integer" }, "chunk_overlap": { "default": 150, "minimum": 0, "title": "Chunk Overlap", "type": "integer" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskSplitSchema", "type": "object" }, "IngestTaskStoreEmbedSchema": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskStoreEmbedSchema", "type": "object" }, "IngestTaskStoreSchema": { "additionalProperties": false, "properties": { "structured": { "default": true, "title": "Structured", "type": "boolean" }, "images": { "default": false, "title": "Images", "type": "boolean" }, "method": { "title": "Method", "type": "string" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "required": [ "method" ], "title": "IngestTaskStoreSchema", "type": "object" }, "IngestTaskTableExtraction": { "additionalProperties": false, "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "title": "IngestTaskTableExtraction", "type": "object" }, "IngestTaskUDFSchema": { "additionalProperties": false, "properties": { "udf_function": { "title": "Udf Function", "type": "string" }, "udf_function_name": { "title": "Udf Function Name", "type": "string" }, "phase": { "anyOf": [ { "maximum": 5, "minimum": 1, "type": "integer" }, { "type": "null" } ], "default": null, "title": "Phase" }, "run_before": { "default": false, "description": "Execute UDF before the target stage", "title": "Run Before", "type": "boolean" }, "run_after": { "default": false, "description": "Execute UDF after the target stage", "title": "Run After", "type": "boolean" }, "target_stage": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name of the stage to target (e.g., 'image_dedup', 'text_extract')", "title": "Target Stage" } }, "required": [ "udf_function", "udf_function_name" ], "title": "IngestTaskUDFSchema", "type": "object" }, "IngestTaskVdbUploadSchema": { "additionalProperties": false, "properties": { "bulk_ingest": { "default": false, "title": "Bulk Ingest", "type": "boolean" }, "bulk_ingest_path": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Bulk Ingest Path" }, "params": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Params" }, "filter_errors": { "default": true, "title": "Filter Errors", "type": "boolean" } }, "title": "IngestTaskVdbUploadSchema", "type": "object" }, "TaskTypeEnum": { "description": "Enum for representing various task types.\n\nAttributes\n----------\nCAPTION : str\n Represents a caption task.\nDEDUP : str\n Represents a deduplication task.\nEMBED : str\n Represents an embedding task.\nEXTRACT : str\n Represents an extraction task.\nFILTER : str\n Represents a filtering task.\nSPLIT : str\n Represents a splitting task.\nSTORE : str\n Represents a storing task.\nSTORE_EMBEDDING : str\n Represents a task for storing embeddings.\nVDB_UPLOAD : str\n Represents a task for uploading to a vector database.\nAUDIO_DATA_EXTRACT : str\n Represents a task for extracting audio data.\nTABLE_DATA_EXTRACT : str\n Represents a task for extracting table data.\nCHART_DATA_EXTRACT : str\n Represents a task for extracting chart data.\nINFOGRAPHIC_DATA_EXTRACT : str\n Represents a task for extracting infographic data.\nUDF : str\n Represents a user-defined function task.", "enum": [ "audio_data_extract", "caption", "chart_data_extract", "dedup", "embed", "extract", "filter", "infographic_data_extract", "split", "store_embedding", "store", "table_data_extract", "udf", "vdb_upload" ], "title": "TaskTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "type", "task_properties" ] }
- Config:
extra: str = forbid
- Fields:
- Validators:
check_task_properties_type
»all fields
- field raise_on_failure: bool = False#
- Validated by:
- field task_properties: IngestTaskSplitSchema | IngestTaskExtractSchema | IngestTaskStoreEmbedSchema | IngestTaskStoreSchema | IngestTaskEmbedSchema | IngestTaskCaptionSchema | IngestTaskDedupSchema | IngestTaskFilterSchema | IngestTaskVdbUploadSchema | IngestTaskAudioExtraction | IngestTaskTableExtraction | IngestTaskChartExtraction | IngestTaskInfographicExtraction | IngestTaskUDFSchema [Required]#
- Validated by:
- field type: TaskTypeEnum [Required]#
- Validated by:
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskSplitSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskSplitSchema", "type": "object", "properties": { "tokenizer": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Tokenizer" }, "chunk_size": { "default": 1024, "exclusiveMinimum": 0, "title": "Chunk Size", "type": "integer" }, "chunk_overlap": { "default": 150, "minimum": 0, "title": "Chunk Overlap", "type": "integer" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field chunk_overlap: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0)])] = 150#
- Constraints:
ge = 0
- Validated by:
- field chunk_size: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Gt(gt=0)])] = 1024#
- Constraints:
gt = 0
- field params: dict [Optional]#
- field tokenizer: str | None = None#
- validator check_chunk_overlap » chunk_overlap[source]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskStoreEmbedSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskStoreEmbedSchema", "type": "object", "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field params: dict [Optional]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskStoreSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskStoreSchema", "type": "object", "properties": { "structured": { "default": true, "title": "Structured", "type": "boolean" }, "images": { "default": false, "title": "Images", "type": "boolean" }, "method": { "title": "Method", "type": "string" }, "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "additionalProperties": false, "required": [ "method" ] }
- Config:
extra: str = forbid
- Fields:
- field images: bool = False#
- field method: str [Required]#
- field params: dict [Optional]#
- field structured: bool = True#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskTableExtraction[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskTableExtraction", "type": "object", "properties": { "params": { "additionalProperties": true, "title": "Params", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field params: dict [Optional]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskUDFSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskUDFSchema", "type": "object", "properties": { "udf_function": { "title": "Udf Function", "type": "string" }, "udf_function_name": { "title": "Udf Function Name", "type": "string" }, "phase": { "anyOf": [ { "maximum": 5, "minimum": 1, "type": "integer" }, { "type": "null" } ], "default": null, "title": "Phase" }, "run_before": { "default": false, "description": "Execute UDF before the target stage", "title": "Run Before", "type": "boolean" }, "run_after": { "default": false, "description": "Execute UDF after the target stage", "title": "Run After", "type": "boolean" }, "target_stage": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name of the stage to target (e.g., 'image_dedup', 'text_extract')", "title": "Target Stage" } }, "additionalProperties": false, "required": [ "udf_function", "udf_function_name" ] }
- Config:
extra: str = forbid
- Fields:
- Validators:
validate_stage_targeting
»all fields
- field phase: int | None = None#
- Constraints:
ge = 1
le = 5
- Validated by:
- field run_after: bool = False#
Execute UDF after the target stage
- Validated by:
- field run_before: bool = False#
Execute UDF before the target stage
- Validated by:
- field target_stage: str | None = None#
Name of the stage to target (e.g., ‘image_dedup’, ‘text_extract’)
- Validated by:
- field udf_function: str [Required]#
- Validated by:
- field udf_function_name: str [Required]#
- Validated by:
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.IngestTaskVdbUploadSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "IngestTaskVdbUploadSchema", "type": "object", "properties": { "bulk_ingest": { "default": false, "title": "Bulk Ingest", "type": "boolean" }, "bulk_ingest_path": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Bulk Ingest Path" }, "params": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Params" }, "filter_errors": { "default": true, "title": "Filter Errors", "type": "boolean" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field bulk_ingest: bool = False#
- field bulk_ingest_path: str | None = None#
- field filter_errors: bool = True#
- field params: dict | None = None#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.JobPayloadSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "JobPayloadSchema", "type": "object", "properties": { "content": { "items": { "anyOf": [ { "type": "string" }, { "format": "binary", "type": "string" } ] }, "title": "Content", "type": "array" }, "source_name": { "items": { "type": "string" }, "title": "Source Name", "type": "array" }, "source_id": { "items": { "anyOf": [ { "type": "string" }, { "type": "integer" } ] }, "title": "Source Id", "type": "array" }, "document_type": { "items": { "type": "string" }, "title": "Document Type", "type": "array" } }, "additionalProperties": false, "required": [ "content", "source_name", "source_id", "document_type" ] }
- Config:
extra: str = forbid
- Fields:
- field content: List[str | bytes] [Required]#
- field document_type: List[str] [Required]#
- field source_id: List[str | int] [Required]#
- field source_name: List[str] [Required]#
- pydantic model nv_ingest_api.internal.schemas.meta.ingest_job_schema.TracingOptionsSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "TracingOptionsSchema", "type": "object", "properties": { "trace": { "default": false, "title": "Trace", "type": "boolean" }, "ts_send": { "title": "Ts Send", "type": "integer" }, "trace_id": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "title": "Trace Id" } }, "additionalProperties": false, "required": [ "ts_send" ] }
- Config:
extra: str = forbid
- Fields:
- field trace: bool = False#
- field trace_id: str | None = None#
- field ts_send: int [Required]#
- nv_ingest_api.internal.schemas.meta.ingest_job_schema.validate_ingest_job(
- job_data: Dict[str, Any],
Validates a dictionary representing an ingest_job using the IngestJobSchema.
Parameters: - job_data: Dictionary representing an ingest job.
Returns: - IngestJobSchema: The validated ingest job.
Raises: - ValidationError: If the input data does not conform to the IngestJobSchema.
nv_ingest_api.internal.schemas.meta.metadata_schema module#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.AudioMetadataSchema[source]#
Bases:
BaseModelNoExt
The schema for extracted audio content.
Show JSON schema
{ "title": "AudioMetadataSchema", "description": "The schema for extracted audio content.", "type": "object", "properties": { "audio_transcript": { "default": "", "title": "Audio Transcript", "type": "string" }, "audio_type": { "default": "", "title": "Audio Type", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field audio_transcript: str = ''#
A transcript of the audio content.
- field audio_type: str = ''#
The type or format of the audio, such as mp3, wav.
- field custom_content: Dict[str, Any] | None = None#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.ChartMetadataSchema[source]#
Bases:
BaseModelNoExt
The schema for extracted chart content.
Show JSON schema
{ "title": "ChartMetadataSchema", "description": "The schema for extracted chart content.", "type": "object", "properties": { "caption": { "default": "", "title": "Caption", "type": "string" }, "table_format": { "$ref": "#/$defs/TableFormatEnum" }, "table_content": { "default": "", "title": "Table Content", "type": "string" }, "table_content_format": { "anyOf": [ { "$ref": "#/$defs/TableFormatEnum" }, { "type": "string" } ], "default": "", "title": "Table Content Format" }, "table_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Table Location", "type": "array" }, "table_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Table Location Max Dimensions", "type": "array" }, "uploaded_image_uri": { "default": "", "title": "Uploaded Image Uri", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "TableFormatEnum": { "description": "Enum for representing table formats.\n\nAttributes\n----------\nHTML : str\n Represents HTML table format.\nIMAGE : str\n Represents image table format.\nLATEX : str\n Represents LaTeX table format.\nMARKDOWN : str\n Represents Markdown table format.\nPSEUDO_MARKDOWN : str\n Represents pseudo Markdown table format.\nSIMPLE : str\n Represents simple table format.", "enum": [ "html", "image", "latex", "markdown", "pseudo_markdown", "simple" ], "title": "TableFormatEnum", "type": "string" } }, "additionalProperties": false, "required": [ "table_format" ] }
- Config:
extra: str = forbid
- Fields:
- field caption: str = ''#
The caption for the chart.
- field custom_content: Dict[str, Any] | None = None#
- field table_content: str = ''#
Extracted text content, formatted according to chart_metadata.table_format.
- field table_content_format: TableFormatEnum | str = ''#
- field table_format: TableFormatEnum [Required]#
The format of the table. One of Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces).
- field table_location: tuple = (0, 0, 0, 0)#
The bounding box of the chart, in the format (x1,y1,x2,y2).
- field table_location_max_dimensions: tuple = (0, 0)#
The maximum dimensions of the bounding box of the chart, in the format (x_max,y_max).
- field uploaded_image_uri: str = ''#
A mirror of source_metadata.source_location.
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.ContentHierarchySchema[source]#
Bases:
BaseModelNoExt
Schema for the extracted content hierarchy.
Show JSON schema
{ "title": "ContentHierarchySchema", "description": "Schema for the extracted content hierarchy.", "type": "object", "properties": { "page_count": { "default": -1, "title": "Page Count", "type": "integer" }, "page": { "default": -1, "title": "Page", "type": "integer" }, "block": { "default": -1, "title": "Block", "type": "integer" }, "line": { "default": -1, "title": "Line", "type": "integer" }, "span": { "default": -1, "title": "Span", "type": "integer" }, "nearby_objects": { "$ref": "#/$defs/NearbyObjectsSchema", "default": { "text": { "bbox": [], "content": [], "type": [] }, "images": { "bbox": [], "content": [], "type": [] }, "structured": { "bbox": [], "content": [], "type": [] } } } }, "$defs": { "NearbyObjectsSchema": { "additionalProperties": false, "description": "Schema to hold types of related extracted objects.", "properties": { "text": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "images": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "structured": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } } }, "title": "NearbyObjectsSchema", "type": "object" }, "NearbyObjectsSubSchema": { "additionalProperties": false, "description": "Schema to hold related extracted object.", "properties": { "content": { "items": { "type": "string" }, "title": "Content", "type": "array" }, "bbox": { "items": { "items": {}, "type": "array" }, "title": "Bbox", "type": "array" }, "type": { "items": { "type": "string" }, "title": "Type", "type": "array" } }, "title": "NearbyObjectsSubSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field block: int = -1#
- field line: int = -1#
- field nearby_objects: NearbyObjectsSchema = NearbyObjectsSchema(text=NearbyObjectsSubSchema(content=[], bbox=[], type=[]), images=NearbyObjectsSubSchema(content=[], bbox=[], type=[]), structured=NearbyObjectsSubSchema(content=[], bbox=[], type=[]))#
- field page: int = -1#
- field page_count: int = -1#
- field span: int = -1#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.ContentMetadataSchema[source]#
Bases:
BaseModelNoExt
Data extracted from a source; generally Text or Image.
Show JSON schema
{ "title": "ContentMetadataSchema", "description": "Data extracted from a source; generally Text or Image.", "type": "object", "properties": { "type": { "$ref": "#/$defs/ContentTypeEnum" }, "description": { "default": "", "title": "Description", "type": "string" }, "page_number": { "default": -1, "title": "Page Number", "type": "integer" }, "hierarchy": { "$ref": "#/$defs/ContentHierarchySchema", "default": { "page_count": -1, "page": -1, "block": -1, "line": -1, "span": -1, "nearby_objects": { "images": { "bbox": [], "content": [], "type": [] }, "structured": { "bbox": [], "content": [], "type": [] }, "text": { "bbox": [], "content": [], "type": [] } } } }, "subtype": { "anyOf": [ { "$ref": "#/$defs/ContentTypeEnum" }, { "type": "string" } ], "default": "", "title": "Subtype" }, "start_time": { "default": -1, "title": "Start Time", "type": "integer" }, "end_time": { "default": -1, "title": "End Time", "type": "integer" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "ContentHierarchySchema": { "additionalProperties": false, "description": "Schema for the extracted content hierarchy.", "properties": { "page_count": { "default": -1, "title": "Page Count", "type": "integer" }, "page": { "default": -1, "title": "Page", "type": "integer" }, "block": { "default": -1, "title": "Block", "type": "integer" }, "line": { "default": -1, "title": "Line", "type": "integer" }, "span": { "default": -1, "title": "Span", "type": "integer" }, "nearby_objects": { "$ref": "#/$defs/NearbyObjectsSchema", "default": { "text": { "bbox": [], "content": [], "type": [] }, "images": { "bbox": [], "content": [], "type": [] }, "structured": { "bbox": [], "content": [], "type": [] } } } }, "title": "ContentHierarchySchema", "type": "object" }, "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "NearbyObjectsSchema": { "additionalProperties": false, "description": "Schema to hold types of related extracted objects.", "properties": { "text": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "images": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "structured": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } } }, "title": "NearbyObjectsSchema", "type": "object" }, "NearbyObjectsSubSchema": { "additionalProperties": false, "description": "Schema to hold related extracted object.", "properties": { "content": { "items": { "type": "string" }, "title": "Content", "type": "array" }, "bbox": { "items": { "items": {}, "type": "array" }, "title": "Bbox", "type": "array" }, "type": { "items": { "type": "string" }, "title": "Type", "type": "array" } }, "title": "NearbyObjectsSubSchema", "type": "object" } }, "additionalProperties": false, "required": [ "type" ] }
- Config:
extra: str = forbid
- Fields:
- field custom_content: Dict[str, Any] | None = None#
- field description: str = ''#
A text description of the content object.
- field end_time: int = -1#
The timestamp of the end of a piece of audio content.
- field hierarchy: ContentHierarchySchema = ContentHierarchySchema(page_count=-1, page=-1, block=-1, line=-1, span=-1, nearby_objects=NearbyObjectsSchema(text=NearbyObjectsSubSchema(content=[], bbox=[], type=[]), images=NearbyObjectsSubSchema(content=[], bbox=[], type=[]), structured=NearbyObjectsSubSchema(content=[], bbox=[], type=[])))#
The location or order of the content within the source.
- field page_number: int = -1#
The page number of the content in the source.
- field start_time: int = -1#
The timestamp of the start of a piece of audio content.
- field subtype: ContentTypeEnum | str = ''#
The type of the content for structured data types, such as table or chart.
- field type: ContentTypeEnum [Required]#
The type of the content. Text, Image, Structured, Table, or Chart.
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.ErrorMetadataSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "ErrorMetadataSchema", "type": "object", "properties": { "task": { "$ref": "#/$defs/TaskTypeEnum" }, "status": { "$ref": "#/$defs/StatusEnum" }, "source_id": { "default": "", "title": "Source Id", "type": "string" }, "error_msg": { "title": "Error Msg", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "StatusEnum": { "description": "Enum for representing status messages.\n\nAttributes\n----------\nERROR : str\n Represents an error status.\nSUCCESS : str\n Represents a success status.", "enum": [ "error", "success" ], "title": "StatusEnum", "type": "string" }, "TaskTypeEnum": { "description": "Enum for representing various task types.\n\nAttributes\n----------\nCAPTION : str\n Represents a caption task.\nDEDUP : str\n Represents a deduplication task.\nEMBED : str\n Represents an embedding task.\nEXTRACT : str\n Represents an extraction task.\nFILTER : str\n Represents a filtering task.\nSPLIT : str\n Represents a splitting task.\nSTORE : str\n Represents a storing task.\nSTORE_EMBEDDING : str\n Represents a task for storing embeddings.\nVDB_UPLOAD : str\n Represents a task for uploading to a vector database.\nAUDIO_DATA_EXTRACT : str\n Represents a task for extracting audio data.\nTABLE_DATA_EXTRACT : str\n Represents a task for extracting table data.\nCHART_DATA_EXTRACT : str\n Represents a task for extracting chart data.\nINFOGRAPHIC_DATA_EXTRACT : str\n Represents a task for extracting infographic data.\nUDF : str\n Represents a user-defined function task.", "enum": [ "audio_data_extract", "caption", "chart_data_extract", "dedup", "embed", "extract", "filter", "infographic_data_extract", "split", "store_embedding", "store", "table_data_extract", "udf", "vdb_upload" ], "title": "TaskTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "task", "status", "error_msg" ] }
- Config:
extra: str = forbid
- Fields:
- field custom_content: Dict[str, Any] | None = None#
- field error_msg: str [Required]#
- field source_id: str = ''#
- field status: StatusEnum [Required]#
- field task: TaskTypeEnum [Required]#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.ImageMetadataSchema[source]#
Bases:
BaseModelNoExt
The schema for the extracted image content.
Show JSON schema
{ "title": "ImageMetadataSchema", "description": "The schema for the extracted image content.", "type": "object", "properties": { "image_type": { "anyOf": [ { "$ref": "#/$defs/DocumentTypeEnum" }, { "type": "string" } ], "title": "Image Type" }, "structured_image_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "none" }, "caption": { "default": "", "title": "Caption", "type": "string" }, "text": { "default": "", "title": "Text", "type": "string" }, "image_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Image Location", "type": "array" }, "image_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Image Location Max Dimensions", "type": "array" }, "uploaded_image_url": { "default": "", "title": "Uploaded Image Url", "type": "string" }, "width": { "default": 0, "title": "Width", "type": "integer" }, "height": { "default": 0, "title": "Height", "type": "integer" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "DocumentTypeEnum": { "description": "Enum for representing various document file types.\n\nNote: Document type refers to the specific file format of the content, such as PDF, DOCX, etc.\nThis is not equivalent to the Content type, which is a broad category of the content.\n\nAttributes\n----------\nBMP: str\n BMP image format.\nDOCX: str\n Microsoft Word document format.\nHTML: str\n HTML document.\nJPEG: str\n JPEG image format.\nPDF: str\n PDF document format.\nPNG: str\n PNG image format.\nPPTX: str\n PowerPoint presentation format.\nSVG: str\n SVG image format.\nTIFF: str\n TIFF image format.\nTXT: str\n Plain text file.\nMP3: str\n MP3 audio format.\nWAV: str\n WAV audio format.", "enum": [ "bmp", "docx", "html", "jpeg", "pdf", "png", "pptx", "svg", "tiff", "text", "text", "mp3", "wav", "unknown" ], "title": "DocumentTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "image_type" ] }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field caption: str = ''#
A caption or subheading associated with the image.
- field custom_content: Dict[str, Any] | None = None#
- field height: int = 0#
The height of the image.
- Validated by:
- field image_location: tuple = (0, 0, 0, 0)#
The bounding box of the image, in the format (x1,y1,x2,y2).
- field image_location_max_dimensions: tuple = (0, 0)#
The maximum dimensions of the bounding box of the image, in the format (x_max,y_max).
- field image_type: DocumentTypeEnum | str [Required]#
The type of the image, such as structured, natural, hybrid, and others.
- Validated by:
- field structured_image_type: ContentTypeEnum = ContentTypeEnum.NONE#
The type of the content for structured data types, such as bar chart, pie chart, and others.
- field text: str = ''#
Extracted text from a structured chart.
- field uploaded_image_url: str = ''#
A mirror of source_metadata.source_location.
- field width: int = 0#
The width of the image.
- Validated by:
- validator validate_image_type » image_type[source]#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.InfoMessageMetadataSchema[source]#
Bases:
BaseModelNoExt
Show JSON schema
{ "title": "InfoMessageMetadataSchema", "type": "object", "properties": { "task": { "$ref": "#/$defs/TaskTypeEnum" }, "status": { "$ref": "#/$defs/StatusEnum" }, "message": { "title": "Message", "type": "string" }, "filter": { "title": "Filter", "type": "boolean" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "StatusEnum": { "description": "Enum for representing status messages.\n\nAttributes\n----------\nERROR : str\n Represents an error status.\nSUCCESS : str\n Represents a success status.", "enum": [ "error", "success" ], "title": "StatusEnum", "type": "string" }, "TaskTypeEnum": { "description": "Enum for representing various task types.\n\nAttributes\n----------\nCAPTION : str\n Represents a caption task.\nDEDUP : str\n Represents a deduplication task.\nEMBED : str\n Represents an embedding task.\nEXTRACT : str\n Represents an extraction task.\nFILTER : str\n Represents a filtering task.\nSPLIT : str\n Represents a splitting task.\nSTORE : str\n Represents a storing task.\nSTORE_EMBEDDING : str\n Represents a task for storing embeddings.\nVDB_UPLOAD : str\n Represents a task for uploading to a vector database.\nAUDIO_DATA_EXTRACT : str\n Represents a task for extracting audio data.\nTABLE_DATA_EXTRACT : str\n Represents a task for extracting table data.\nCHART_DATA_EXTRACT : str\n Represents a task for extracting chart data.\nINFOGRAPHIC_DATA_EXTRACT : str\n Represents a task for extracting infographic data.\nUDF : str\n Represents a user-defined function task.", "enum": [ "audio_data_extract", "caption", "chart_data_extract", "dedup", "embed", "extract", "filter", "infographic_data_extract", "split", "store_embedding", "store", "table_data_extract", "udf", "vdb_upload" ], "title": "TaskTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "task", "status", "message", "filter" ] }
- Config:
extra: str = forbid
- Fields:
- field custom_content: Dict[str, Any] | None = None#
- field filter: bool [Required]#
- field message: str [Required]#
- field status: StatusEnum [Required]#
- field task: TaskTypeEnum [Required]#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.MetadataSchema[source]#
Bases:
BaseModelNoExt
The primary container schema for extraction results.
Show JSON schema
{ "title": "MetadataSchema", "description": "The primary container schema for extraction results.", "type": "object", "properties": { "content": { "default": "", "title": "Content", "type": "string" }, "content_url": { "default": "", "title": "Content Url", "type": "string" }, "embedding": { "anyOf": [ { "items": { "type": "number" }, "type": "array" }, { "type": "null" } ], "default": null, "title": "Embedding" }, "source_metadata": { "anyOf": [ { "$ref": "#/$defs/SourceMetadataSchema" }, { "type": "null" } ], "default": null }, "content_metadata": { "anyOf": [ { "$ref": "#/$defs/ContentMetadataSchema" }, { "type": "null" } ], "default": null }, "audio_metadata": { "anyOf": [ { "$ref": "#/$defs/AudioMetadataSchema" }, { "type": "null" } ], "default": null }, "text_metadata": { "anyOf": [ { "$ref": "#/$defs/TextMetadataSchema" }, { "type": "null" } ], "default": null }, "image_metadata": { "anyOf": [ { "$ref": "#/$defs/ImageMetadataSchema" }, { "type": "null" } ], "default": null }, "table_metadata": { "anyOf": [ { "$ref": "#/$defs/TableMetadataSchema" }, { "type": "null" } ], "default": null }, "chart_metadata": { "anyOf": [ { "$ref": "#/$defs/ChartMetadataSchema" }, { "type": "null" } ], "default": null }, "error_metadata": { "anyOf": [ { "$ref": "#/$defs/ErrorMetadataSchema" }, { "type": "null" } ], "default": null }, "info_message_metadata": { "anyOf": [ { "$ref": "#/$defs/InfoMessageMetadataSchema" }, { "type": "null" } ], "default": null }, "debug_metadata": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Debug Metadata" }, "raise_on_failure": { "default": false, "title": "Raise On Failure", "type": "boolean" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "AccessLevelEnum": { "description": "Note\n----\nThis is for future use, and currently has no functional use case.\n\nEnum for representing different access levels.\n\nAttributes\n----------\nLEVEL_1 : int\n Represents access level 1.\nLEVEL_2 : int\n Represents access level 2.\nLEVEL_3 : int\n Represents access level 3.", "enum": [ -1, 1, 2, 3 ], "title": "AccessLevelEnum", "type": "integer" }, "AudioMetadataSchema": { "additionalProperties": false, "description": "The schema for extracted audio content.", "properties": { "audio_transcript": { "default": "", "title": "Audio Transcript", "type": "string" }, "audio_type": { "default": "", "title": "Audio Type", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "title": "AudioMetadataSchema", "type": "object" }, "ChartMetadataSchema": { "additionalProperties": false, "description": "The schema for extracted chart content.", "properties": { "caption": { "default": "", "title": "Caption", "type": "string" }, "table_format": { "$ref": "#/$defs/TableFormatEnum" }, "table_content": { "default": "", "title": "Table Content", "type": "string" }, "table_content_format": { "anyOf": [ { "$ref": "#/$defs/TableFormatEnum" }, { "type": "string" } ], "default": "", "title": "Table Content Format" }, "table_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Table Location", "type": "array" }, "table_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Table Location Max Dimensions", "type": "array" }, "uploaded_image_uri": { "default": "", "title": "Uploaded Image Uri", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "table_format" ], "title": "ChartMetadataSchema", "type": "object" }, "ContentHierarchySchema": { "additionalProperties": false, "description": "Schema for the extracted content hierarchy.", "properties": { "page_count": { "default": -1, "title": "Page Count", "type": "integer" }, "page": { "default": -1, "title": "Page", "type": "integer" }, "block": { "default": -1, "title": "Block", "type": "integer" }, "line": { "default": -1, "title": "Line", "type": "integer" }, "span": { "default": -1, "title": "Span", "type": "integer" }, "nearby_objects": { "$ref": "#/$defs/NearbyObjectsSchema", "default": { "text": { "bbox": [], "content": [], "type": [] }, "images": { "bbox": [], "content": [], "type": [] }, "structured": { "bbox": [], "content": [], "type": [] } } } }, "title": "ContentHierarchySchema", "type": "object" }, "ContentMetadataSchema": { "additionalProperties": false, "description": "Data extracted from a source; generally Text or Image.", "properties": { "type": { "$ref": "#/$defs/ContentTypeEnum" }, "description": { "default": "", "title": "Description", "type": "string" }, "page_number": { "default": -1, "title": "Page Number", "type": "integer" }, "hierarchy": { "$ref": "#/$defs/ContentHierarchySchema", "default": { "page_count": -1, "page": -1, "block": -1, "line": -1, "span": -1, "nearby_objects": { "images": { "bbox": [], "content": [], "type": [] }, "structured": { "bbox": [], "content": [], "type": [] }, "text": { "bbox": [], "content": [], "type": [] } } } }, "subtype": { "anyOf": [ { "$ref": "#/$defs/ContentTypeEnum" }, { "type": "string" } ], "default": "", "title": "Subtype" }, "start_time": { "default": -1, "title": "Start Time", "type": "integer" }, "end_time": { "default": -1, "title": "End Time", "type": "integer" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "type" ], "title": "ContentMetadataSchema", "type": "object" }, "ContentTypeEnum": { "description": "Enum for representing various content types.\n\nNote: Content type declares the broad category of the content, such as text, image, audio, etc.\nThis is not equivalent to the Document type, which is a specific file format.\n\nAttributes\n----------\nAUDIO : str\n Represents audio content.\nEMBEDDING : str\n Represents embedding content.\nIMAGE : str\n Represents image content.\nINFO_MSG : str\n Represents an informational message.\nPAGE_IMAGE : str\n Represents a full-page image rendered from a document.\nSTRUCTURED : str\n Represents structured content.\nTEXT : str\n Represents text content.\nUNSTRUCTURED : str\n Represents unstructured content.\nVIDEO : str\n Represents video content.", "enum": [ "audio", "chart", "embedding", "image", "infographic", "info_message", "none", "page_image", "structured", "table", "text", "unknown", "video" ], "title": "ContentTypeEnum", "type": "string" }, "DocumentTypeEnum": { "description": "Enum for representing various document file types.\n\nNote: Document type refers to the specific file format of the content, such as PDF, DOCX, etc.\nThis is not equivalent to the Content type, which is a broad category of the content.\n\nAttributes\n----------\nBMP: str\n BMP image format.\nDOCX: str\n Microsoft Word document format.\nHTML: str\n HTML document.\nJPEG: str\n JPEG image format.\nPDF: str\n PDF document format.\nPNG: str\n PNG image format.\nPPTX: str\n PowerPoint presentation format.\nSVG: str\n SVG image format.\nTIFF: str\n TIFF image format.\nTXT: str\n Plain text file.\nMP3: str\n MP3 audio format.\nWAV: str\n WAV audio format.", "enum": [ "bmp", "docx", "html", "jpeg", "pdf", "png", "pptx", "svg", "tiff", "text", "text", "mp3", "wav", "unknown" ], "title": "DocumentTypeEnum", "type": "string" }, "ErrorMetadataSchema": { "additionalProperties": false, "properties": { "task": { "$ref": "#/$defs/TaskTypeEnum" }, "status": { "$ref": "#/$defs/StatusEnum" }, "source_id": { "default": "", "title": "Source Id", "type": "string" }, "error_msg": { "title": "Error Msg", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "task", "status", "error_msg" ], "title": "ErrorMetadataSchema", "type": "object" }, "ImageMetadataSchema": { "additionalProperties": false, "description": "The schema for the extracted image content.", "properties": { "image_type": { "anyOf": [ { "$ref": "#/$defs/DocumentTypeEnum" }, { "type": "string" } ], "title": "Image Type" }, "structured_image_type": { "$ref": "#/$defs/ContentTypeEnum", "default": "none" }, "caption": { "default": "", "title": "Caption", "type": "string" }, "text": { "default": "", "title": "Text", "type": "string" }, "image_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Image Location", "type": "array" }, "image_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Image Location Max Dimensions", "type": "array" }, "uploaded_image_url": { "default": "", "title": "Uploaded Image Url", "type": "string" }, "width": { "default": 0, "title": "Width", "type": "integer" }, "height": { "default": 0, "title": "Height", "type": "integer" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "image_type" ], "title": "ImageMetadataSchema", "type": "object" }, "InfoMessageMetadataSchema": { "additionalProperties": false, "properties": { "task": { "$ref": "#/$defs/TaskTypeEnum" }, "status": { "$ref": "#/$defs/StatusEnum" }, "message": { "title": "Message", "type": "string" }, "filter": { "title": "Filter", "type": "boolean" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "task", "status", "message", "filter" ], "title": "InfoMessageMetadataSchema", "type": "object" }, "LanguageEnum": { "description": "Enum for representing various language codes.\n\nAttributes\n----------\nAF : str\n Afrikaans language code.\nAR : str\n Arabic language code.\nBG : str\n Bulgarian language code.\nBN : str\n Bengali language code.\nCA : str\n Catalan language code.\nCS : str\n Czech language code.\nCY : str\n Welsh language code.\nDA : str\n Danish language code.\nDE : str\n German language code.\nEL : str\n Greek language code.\nEN : str\n English language code.\nES : str\n Spanish language code.\nET : str\n Estonian language code.\nFA : str\n Persian language code.\nFI : str\n Finnish language code.\nFR : str\n French language code.\nGU : str\n Gujarati language code.\nHE : str\n Hebrew language code.\nHI : str\n Hindi language code.\nHR : str\n Croatian language code.\nHU : str\n Hungarian language code.\nID : str\n Indonesian language code.\nIT : str\n Italian language code.\nJA : str\n Japanese language code.\nKN : str\n Kannada language code.\nKO : str\n Korean language code.\nLT : str\n Lithuanian language code.\nLV : str\n Latvian language code.\nMK : str\n Macedonian language code.\nML : str\n Malayalam language code.\nMR : str\n Marathi language code.\nNE : str\n Nepali language code.\nNL : str\n Dutch language code.\nNO : str\n Norwegian language code.\nPA : str\n Punjabi language code.\nPL : str\n Polish language code.\nPT : str\n Portuguese language code.\nRO : str\n Romanian language code.\nRU : str\n Russian language code.\nSK : str\n Slovak language code.\nSL : str\n Slovenian language code.\nSO : str\n Somali language code.\nSQ : str\n Albanian language code.\nSV : str\n Swedish language code.\nSW : str\n Swahili language code.\nTA : str\n Tamil language code.\nTE : str\n Telugu language code.\nTH : str\n Thai language code.\nTL : str\n Tagalog language code.\nTR : str\n Turkish language code.\nUK : str\n Ukrainian language code.\nUR : str\n Urdu language code.\nVI : str\n Vietnamese language code.\nZH_CN : str\n Chinese (Simplified) language code.\nZH_TW : str\n Chinese (Traditional) language code.\nUNKNOWN : str\n Represents an unknown language.", "enum": [ "af", "ar", "bg", "bn", "ca", "cs", "cy", "da", "de", "el", "en", "es", "et", "fa", "fi", "fr", "gu", "he", "hi", "hr", "hu", "id", "it", "ja", "kn", "ko", "lt", "lv", "mk", "ml", "mr", "ne", "nl", "no", "pa", "pl", "pt", "ro", "ru", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "th", "tl", "tr", "uk", "ur", "vi", "zh-cn", "zh-tw", "unknown" ], "title": "LanguageEnum", "type": "string" }, "NearbyObjectsSchema": { "additionalProperties": false, "description": "Schema to hold types of related extracted objects.", "properties": { "text": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "images": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "structured": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } } }, "title": "NearbyObjectsSchema", "type": "object" }, "NearbyObjectsSubSchema": { "additionalProperties": false, "description": "Schema to hold related extracted object.", "properties": { "content": { "items": { "type": "string" }, "title": "Content", "type": "array" }, "bbox": { "items": { "items": {}, "type": "array" }, "title": "Bbox", "type": "array" }, "type": { "items": { "type": "string" }, "title": "Type", "type": "array" } }, "title": "NearbyObjectsSubSchema", "type": "object" }, "SourceMetadataSchema": { "additionalProperties": false, "description": "Schema for the knowledge base file from which content\nand metadata is extracted.", "properties": { "source_name": { "title": "Source Name", "type": "string" }, "source_id": { "title": "Source Id", "type": "string" }, "source_location": { "default": "", "title": "Source Location", "type": "string" }, "source_type": { "anyOf": [ { "$ref": "#/$defs/DocumentTypeEnum" }, { "type": "string" } ], "title": "Source Type" }, "collection_id": { "default": "", "title": "Collection Id", "type": "string" }, "date_created": { "default": "2025-09-17T20:05:30.782143", "title": "Date Created", "type": "string" }, "last_modified": { "default": "2025-09-17T20:05:30.782152", "title": "Last Modified", "type": "string" }, "summary": { "default": "", "title": "Summary", "type": "string" }, "partition_id": { "default": -1, "title": "Partition Id", "type": "integer" }, "access_level": { "anyOf": [ { "$ref": "#/$defs/AccessLevelEnum" }, { "type": "integer" } ], "default": -1, "title": "Access Level" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "source_name", "source_id", "source_type" ], "title": "SourceMetadataSchema", "type": "object" }, "StatusEnum": { "description": "Enum for representing status messages.\n\nAttributes\n----------\nERROR : str\n Represents an error status.\nSUCCESS : str\n Represents a success status.", "enum": [ "error", "success" ], "title": "StatusEnum", "type": "string" }, "TableFormatEnum": { "description": "Enum for representing table formats.\n\nAttributes\n----------\nHTML : str\n Represents HTML table format.\nIMAGE : str\n Represents image table format.\nLATEX : str\n Represents LaTeX table format.\nMARKDOWN : str\n Represents Markdown table format.\nPSEUDO_MARKDOWN : str\n Represents pseudo Markdown table format.\nSIMPLE : str\n Represents simple table format.", "enum": [ "html", "image", "latex", "markdown", "pseudo_markdown", "simple" ], "title": "TableFormatEnum", "type": "string" }, "TableMetadataSchema": { "additionalProperties": false, "description": "The schema for the extracted table content.", "properties": { "caption": { "default": "", "title": "Caption", "type": "string" }, "table_format": { "$ref": "#/$defs/TableFormatEnum" }, "table_content": { "default": "", "title": "Table Content", "type": "string" }, "table_content_format": { "anyOf": [ { "$ref": "#/$defs/TableFormatEnum" }, { "type": "string" } ], "default": "", "title": "Table Content Format" }, "table_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Table Location", "type": "array" }, "table_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Table Location Max Dimensions", "type": "array" }, "uploaded_image_uri": { "default": "", "title": "Uploaded Image Uri", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "table_format" ], "title": "TableMetadataSchema", "type": "object" }, "TaskTypeEnum": { "description": "Enum for representing various task types.\n\nAttributes\n----------\nCAPTION : str\n Represents a caption task.\nDEDUP : str\n Represents a deduplication task.\nEMBED : str\n Represents an embedding task.\nEXTRACT : str\n Represents an extraction task.\nFILTER : str\n Represents a filtering task.\nSPLIT : str\n Represents a splitting task.\nSTORE : str\n Represents a storing task.\nSTORE_EMBEDDING : str\n Represents a task for storing embeddings.\nVDB_UPLOAD : str\n Represents a task for uploading to a vector database.\nAUDIO_DATA_EXTRACT : str\n Represents a task for extracting audio data.\nTABLE_DATA_EXTRACT : str\n Represents a task for extracting table data.\nCHART_DATA_EXTRACT : str\n Represents a task for extracting chart data.\nINFOGRAPHIC_DATA_EXTRACT : str\n Represents a task for extracting infographic data.\nUDF : str\n Represents a user-defined function task.", "enum": [ "audio_data_extract", "caption", "chart_data_extract", "dedup", "embed", "extract", "filter", "infographic_data_extract", "split", "store_embedding", "store", "table_data_extract", "udf", "vdb_upload" ], "title": "TaskTypeEnum", "type": "string" }, "TextMetadataSchema": { "additionalProperties": false, "description": "The schema for the extracted text content.", "properties": { "text_type": { "$ref": "#/$defs/TextTypeEnum" }, "summary": { "default": "", "title": "Summary", "type": "string" }, "keywords": { "anyOf": [ { "type": "string" }, { "items": { "type": "string" }, "type": "array" }, { "additionalProperties": true, "type": "object" } ], "default": "", "title": "Keywords" }, "language": { "$ref": "#/$defs/LanguageEnum", "default": "en" }, "text_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Text Location", "type": "array" }, "text_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Text Location Max Dimensions", "type": "array" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "required": [ "text_type" ], "title": "TextMetadataSchema", "type": "object" }, "TextTypeEnum": { "description": "Enum for representing different types of text segments.\n\nAttributes\n----------\nBLOCK : str\n Represents a text block.\nBODY : str\n Represents body text.\nDOCUMENT : str\n Represents an entire document.\nHEADER : str\n Represents a header text.\nLINE : str\n Represents a single line of text.\nNEARBY_BLOCK : str\n Represents a block of text in close proximity to another.\nOTHER : str\n Represents other unspecified text type.\nPAGE : str\n Represents a page of text.\nSPAN : str\n Represents an inline text span.", "enum": [ "block", "body", "document", "header", "line", "nearby_block", "other", "page", "span" ], "title": "TextTypeEnum", "type": "string" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
audio_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.AudioMetadataSchema | None)
chart_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.ChartMetadataSchema | None)
content_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.ContentMetadataSchema | None)
error_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.ErrorMetadataSchema | None)
image_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.ImageMetadataSchema | None)
source_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.SourceMetadataSchema | None)
table_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.TableMetadataSchema | None)
text_metadata (nv_ingest_api.internal.schemas.meta.metadata_schema.TextMetadataSchema | None)
- Validators:
check_metadata_type
»all fields
- field audio_metadata: AudioMetadataSchema | None = None#
Specific metadata for audio content. Automatically set to None if content_metadata.type is not AUDIO.
- Validated by:
- field chart_metadata: ChartMetadataSchema | None = None#
Specific metadata for chart content. Automatically set to None if content_metadata.type is not STRUCTURED.
- Validated by:
- field content: str = ''#
The actual textual content extracted from the source.
- Validated by:
- field content_metadata: ContentMetadataSchema | None = None#
General metadata about the extracted content itself.
- Validated by:
- field content_url: str = ''#
A URL that points to the location of the content, if applicable.
- Validated by:
- field custom_content: Dict[str, Any] | None = None#
- Validated by:
- field debug_metadata: Dict[str, Any] | None = None#
A dictionary for storing any arbitrary debug information.
- Validated by:
- field embedding: List[float] | None = None#
An optional numerical vector representation (embedding) of the content.
- Validated by:
- field error_metadata: ErrorMetadataSchema | None = None#
Metadata that describes any errors encountered during processing.
- Validated by:
- field image_metadata: ImageMetadataSchema | None = None#
Specific metadata for image content. Automatically set to None if content_metadata.type is not IMAGE.
- Validated by:
- field info_message_metadata: InfoMessageMetadataSchema | None = None#
Informational messages related to the processing.
- Validated by:
- field raise_on_failure: bool = False#
If True, indicates that processing should halt on failure.
- Validated by:
- field source_metadata: SourceMetadataSchema | None = None#
Metadata about the original source of the content.
- Validated by:
- field table_metadata: TableMetadataSchema | None = None#
Specific metadata for tabular content. Automatically set to None if content_metadata.type is not STRUCTURED.
- Validated by:
- field text_metadata: TextMetadataSchema | None = None#
Specific metadata for text content. Automatically set to None if content_metadata.type is not TEXT.
- Validated by:
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.NearbyObjectsSchema[source]#
Bases:
BaseModelNoExt
Schema to hold types of related extracted objects.
Show JSON schema
{ "title": "NearbyObjectsSchema", "description": "Schema to hold types of related extracted objects.", "type": "object", "properties": { "text": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "images": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } }, "structured": { "$ref": "#/$defs/NearbyObjectsSubSchema", "default": { "content": [], "bbox": [], "type": [] } } }, "$defs": { "NearbyObjectsSubSchema": { "additionalProperties": false, "description": "Schema to hold related extracted object.", "properties": { "content": { "items": { "type": "string" }, "title": "Content", "type": "array" }, "bbox": { "items": { "items": {}, "type": "array" }, "title": "Bbox", "type": "array" }, "type": { "items": { "type": "string" }, "title": "Type", "type": "array" } }, "title": "NearbyObjectsSubSchema", "type": "object" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field images: NearbyObjectsSubSchema = NearbyObjectsSubSchema(content=[], bbox=[], type=[])#
- field structured: NearbyObjectsSubSchema = NearbyObjectsSubSchema(content=[], bbox=[], type=[])#
- field text: NearbyObjectsSubSchema = NearbyObjectsSubSchema(content=[], bbox=[], type=[])#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.NearbyObjectsSubSchema[source]#
Bases:
BaseModelNoExt
Schema to hold related extracted object.
Show JSON schema
{ "title": "NearbyObjectsSubSchema", "description": "Schema to hold related extracted object.", "type": "object", "properties": { "content": { "items": { "type": "string" }, "title": "Content", "type": "array" }, "bbox": { "items": { "items": {}, "type": "array" }, "title": "Bbox", "type": "array" }, "type": { "items": { "type": "string" }, "title": "Type", "type": "array" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field bbox: List[tuple] [Optional]#
- field content: List[str] [Optional]#
- field type: List[str] [Optional]#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.SourceMetadataSchema[source]#
Bases:
BaseModelNoExt
Schema for the knowledge base file from which content and metadata is extracted.
Show JSON schema
{ "title": "SourceMetadataSchema", "description": "Schema for the knowledge base file from which content\nand metadata is extracted.", "type": "object", "properties": { "source_name": { "title": "Source Name", "type": "string" }, "source_id": { "title": "Source Id", "type": "string" }, "source_location": { "default": "", "title": "Source Location", "type": "string" }, "source_type": { "anyOf": [ { "$ref": "#/$defs/DocumentTypeEnum" }, { "type": "string" } ], "title": "Source Type" }, "collection_id": { "default": "", "title": "Collection Id", "type": "string" }, "date_created": { "default": "2025-09-17T20:05:30.782143", "title": "Date Created", "type": "string" }, "last_modified": { "default": "2025-09-17T20:05:30.782152", "title": "Last Modified", "type": "string" }, "summary": { "default": "", "title": "Summary", "type": "string" }, "partition_id": { "default": -1, "title": "Partition Id", "type": "integer" }, "access_level": { "anyOf": [ { "$ref": "#/$defs/AccessLevelEnum" }, { "type": "integer" } ], "default": -1, "title": "Access Level" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "AccessLevelEnum": { "description": "Note\n----\nThis is for future use, and currently has no functional use case.\n\nEnum for representing different access levels.\n\nAttributes\n----------\nLEVEL_1 : int\n Represents access level 1.\nLEVEL_2 : int\n Represents access level 2.\nLEVEL_3 : int\n Represents access level 3.", "enum": [ -1, 1, 2, 3 ], "title": "AccessLevelEnum", "type": "integer" }, "DocumentTypeEnum": { "description": "Enum for representing various document file types.\n\nNote: Document type refers to the specific file format of the content, such as PDF, DOCX, etc.\nThis is not equivalent to the Content type, which is a broad category of the content.\n\nAttributes\n----------\nBMP: str\n BMP image format.\nDOCX: str\n Microsoft Word document format.\nHTML: str\n HTML document.\nJPEG: str\n JPEG image format.\nPDF: str\n PDF document format.\nPNG: str\n PNG image format.\nPPTX: str\n PowerPoint presentation format.\nSVG: str\n SVG image format.\nTIFF: str\n TIFF image format.\nTXT: str\n Plain text file.\nMP3: str\n MP3 audio format.\nWAV: str\n WAV audio format.", "enum": [ "bmp", "docx", "html", "jpeg", "pdf", "png", "pptx", "svg", "tiff", "text", "text", "mp3", "wav", "unknown" ], "title": "DocumentTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "source_name", "source_id", "source_type" ] }
- Config:
extra: str = forbid
- Fields:
- Validators:
- field access_level: AccessLevelEnum | int = AccessLevelEnum.UNKNOWN#
The role-based access control for the source.
- field collection_id: str = ''#
The ID of the collection in which the source is contained.
- field custom_content: Dict[str, Any] | None = None#
- field date_created: str = '2025-09-17T20:05:30.782143'#
The date the source was created.
- Validated by:
- field last_modified: str = '2025-09-17T20:05:30.782152'#
The date the source was last modified.
- Validated by:
- field partition_id: int = -1#
The offset of this data fragment within a larger set of fragments.
- field source_id: str [Required]#
The ID of the source file.
- field source_location: str = ''#
The URL, URI, or pointer to the storage location of the source file.
- field source_name: str [Required]#
The name of the source file.
- field source_type: DocumentTypeEnum | str [Required]#
The type of the source file, such as pdf, docx, pptx, or txt.
- field summary: str = ''#
A summary of the source.
- validator validate_fields » last_modified, date_created[source]#
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.TableMetadataSchema[source]#
Bases:
BaseModelNoExt
The schema for the extracted table content.
Show JSON schema
{ "title": "TableMetadataSchema", "description": "The schema for the extracted table content.", "type": "object", "properties": { "caption": { "default": "", "title": "Caption", "type": "string" }, "table_format": { "$ref": "#/$defs/TableFormatEnum" }, "table_content": { "default": "", "title": "Table Content", "type": "string" }, "table_content_format": { "anyOf": [ { "$ref": "#/$defs/TableFormatEnum" }, { "type": "string" } ], "default": "", "title": "Table Content Format" }, "table_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Table Location", "type": "array" }, "table_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Table Location Max Dimensions", "type": "array" }, "uploaded_image_uri": { "default": "", "title": "Uploaded Image Uri", "type": "string" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "TableFormatEnum": { "description": "Enum for representing table formats.\n\nAttributes\n----------\nHTML : str\n Represents HTML table format.\nIMAGE : str\n Represents image table format.\nLATEX : str\n Represents LaTeX table format.\nMARKDOWN : str\n Represents Markdown table format.\nPSEUDO_MARKDOWN : str\n Represents pseudo Markdown table format.\nSIMPLE : str\n Represents simple table format.", "enum": [ "html", "image", "latex", "markdown", "pseudo_markdown", "simple" ], "title": "TableFormatEnum", "type": "string" } }, "additionalProperties": false, "required": [ "table_format" ] }
- Config:
extra: str = forbid
- Fields:
- field caption: str = ''#
The caption for the table.
- field custom_content: Dict[str, Any] | None = None#
- field table_content: str = ''#
Extracted text content, formatted according to table_metadata.table_format.
- field table_content_format: TableFormatEnum | str = ''#
- field table_format: TableFormatEnum [Required]#
The format of the table. One of Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces).
- field table_location: tuple = (0, 0, 0, 0)#
The bounding box of the table, in the format (x1,y1,x2,y2).
- field table_location_max_dimensions: tuple = (0, 0)#
The maximum dimensions of the bounding box of the table, in the format (x_max,y_max).
- field uploaded_image_uri: str = ''#
A mirror of source_metadata.source_location.
- pydantic model nv_ingest_api.internal.schemas.meta.metadata_schema.TextMetadataSchema[source]#
Bases:
BaseModelNoExt
The schema for the extracted text content.
Show JSON schema
{ "title": "TextMetadataSchema", "description": "The schema for the extracted text content.", "type": "object", "properties": { "text_type": { "$ref": "#/$defs/TextTypeEnum" }, "summary": { "default": "", "title": "Summary", "type": "string" }, "keywords": { "anyOf": [ { "type": "string" }, { "items": { "type": "string" }, "type": "array" }, { "additionalProperties": true, "type": "object" } ], "default": "", "title": "Keywords" }, "language": { "$ref": "#/$defs/LanguageEnum", "default": "en" }, "text_location": { "default": [ 0, 0, 0, 0 ], "items": {}, "title": "Text Location", "type": "array" }, "text_location_max_dimensions": { "default": [ 0, 0 ], "items": {}, "title": "Text Location Max Dimensions", "type": "array" }, "custom_content": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "title": "Custom Content" } }, "$defs": { "LanguageEnum": { "description": "Enum for representing various language codes.\n\nAttributes\n----------\nAF : str\n Afrikaans language code.\nAR : str\n Arabic language code.\nBG : str\n Bulgarian language code.\nBN : str\n Bengali language code.\nCA : str\n Catalan language code.\nCS : str\n Czech language code.\nCY : str\n Welsh language code.\nDA : str\n Danish language code.\nDE : str\n German language code.\nEL : str\n Greek language code.\nEN : str\n English language code.\nES : str\n Spanish language code.\nET : str\n Estonian language code.\nFA : str\n Persian language code.\nFI : str\n Finnish language code.\nFR : str\n French language code.\nGU : str\n Gujarati language code.\nHE : str\n Hebrew language code.\nHI : str\n Hindi language code.\nHR : str\n Croatian language code.\nHU : str\n Hungarian language code.\nID : str\n Indonesian language code.\nIT : str\n Italian language code.\nJA : str\n Japanese language code.\nKN : str\n Kannada language code.\nKO : str\n Korean language code.\nLT : str\n Lithuanian language code.\nLV : str\n Latvian language code.\nMK : str\n Macedonian language code.\nML : str\n Malayalam language code.\nMR : str\n Marathi language code.\nNE : str\n Nepali language code.\nNL : str\n Dutch language code.\nNO : str\n Norwegian language code.\nPA : str\n Punjabi language code.\nPL : str\n Polish language code.\nPT : str\n Portuguese language code.\nRO : str\n Romanian language code.\nRU : str\n Russian language code.\nSK : str\n Slovak language code.\nSL : str\n Slovenian language code.\nSO : str\n Somali language code.\nSQ : str\n Albanian language code.\nSV : str\n Swedish language code.\nSW : str\n Swahili language code.\nTA : str\n Tamil language code.\nTE : str\n Telugu language code.\nTH : str\n Thai language code.\nTL : str\n Tagalog language code.\nTR : str\n Turkish language code.\nUK : str\n Ukrainian language code.\nUR : str\n Urdu language code.\nVI : str\n Vietnamese language code.\nZH_CN : str\n Chinese (Simplified) language code.\nZH_TW : str\n Chinese (Traditional) language code.\nUNKNOWN : str\n Represents an unknown language.", "enum": [ "af", "ar", "bg", "bn", "ca", "cs", "cy", "da", "de", "el", "en", "es", "et", "fa", "fi", "fr", "gu", "he", "hi", "hr", "hu", "id", "it", "ja", "kn", "ko", "lt", "lv", "mk", "ml", "mr", "ne", "nl", "no", "pa", "pl", "pt", "ro", "ru", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "th", "tl", "tr", "uk", "ur", "vi", "zh-cn", "zh-tw", "unknown" ], "title": "LanguageEnum", "type": "string" }, "TextTypeEnum": { "description": "Enum for representing different types of text segments.\n\nAttributes\n----------\nBLOCK : str\n Represents a text block.\nBODY : str\n Represents body text.\nDOCUMENT : str\n Represents an entire document.\nHEADER : str\n Represents a header text.\nLINE : str\n Represents a single line of text.\nNEARBY_BLOCK : str\n Represents a block of text in close proximity to another.\nOTHER : str\n Represents other unspecified text type.\nPAGE : str\n Represents a page of text.\nSPAN : str\n Represents an inline text span.", "enum": [ "block", "body", "document", "header", "line", "nearby_block", "other", "page", "span" ], "title": "TextTypeEnum", "type": "string" } }, "additionalProperties": false, "required": [ "text_type" ] }
- Config:
extra: str = forbid
- Fields:
- field custom_content: Dict[str, Any] | None = None#
- field keywords: str | List[str] | Dict = ''#
Keywords, named entities, or other phrases.
- field language: LanguageEnum = 'en'#
The language of the content.
- field summary: str = ''#
An abbreviated summary of the content.
- field text_location: tuple = (0, 0, 0, 0)#
The bounding box of the text, in the format (x1,y1,x2,y2).
- field text_location_max_dimensions: tuple = (0, 0)#
The maximum dimensions of the bounding box of the text, in the format (x_max,y_max).
- field text_type: TextTypeEnum [Required]#
The type of the text, such as header or body.
- nv_ingest_api.internal.schemas.meta.metadata_schema.validate_metadata(
- metadata: Dict[str, Any],
Validates the given metadata dictionary against the MetadataSchema.
Parameters: - metadata: A dictionary representing metadata to be validated.
Returns: - An instance of MetadataSchema if validation is successful.
Raises: - ValidationError: If the metadata does not conform to the schema.
nv_ingest_api.internal.schemas.meta.udf module#
- pydantic model nv_ingest_api.internal.schemas.meta.udf.UDFStageSchema[source]#
Bases:
BaseModel
Schema for UDF stage configuration.
The UDF function string should be provided in the task config. If no UDF function is provided and ignore_empty_udf is True, the message is returned unchanged. If ignore_empty_udf is False, an error is raised when no UDF function is provided.
Show JSON schema
{ "title": "UDFStageSchema", "description": "Schema for UDF stage configuration.\n\nThe UDF function string should be provided in the task config. If no UDF function\nis provided and ignore_empty_udf is True, the message is returned unchanged.\nIf ignore_empty_udf is False, an error is raised when no UDF function is provided.", "type": "object", "properties": { "ignore_empty_udf": { "default": false, "description": "If True, ignore UDF tasks without udf_function and return message unchanged. If False, raise error.", "title": "Ignore Empty Udf", "type": "boolean" } }, "additionalProperties": false }
- Config:
extra: str = forbid
- Fields:
- field ignore_empty_udf: bool = False#
If True, ignore UDF tasks without udf_function and return message unchanged. If False, raise error.