nv_ingest_api.internal.extract.html package#

Submodules#

nv_ingest_api.internal.extract.html.html_extractor module#

nv_ingest_api.internal.extract.html.html_extractor.extract_markdown_from_html_internal(
df_extraction_ledger: DataFrame,
task_config: Dict[str, Any],
extraction_config: HtmlExtractorSchema,
execution_trace_log: Dict[str, Any] | None = None,
) Tuple[DataFrame, Dict | None][source]#

Processes a pandas DataFrame containing HTML file content, extracting html as text from each document and converting it to markdown.

Parameters:
  • df_extraction_ledger (pd.DataFrame) – The input DataFrame containing html files as raw text. Expected columns include ‘source_id’ and ‘content’.

  • task_config (Union[Dict[str, Any], BaseModel]) – Configuration instructions for the document processing task. This can be provided as a dictionary or a Pydantic model.

  • extraction_config (Any) – A configuration object for document extraction that guides the extraction process.

  • execution_trace_log (Optional[Dict[str, Any]], default=None) – An optional dictionary containing trace information for debugging or logging.

Returns:

A DataFrame with the original html content converted to markdown. The resulting DataFrame contains the columns “document_type”, “metadata”, and “uuid”.

Return type:

pd.DataFrame

Raises:

Exception – If an error occurs during the document extraction process, the exception is logged and re-raised.

Module contents#