Checking the Schema of the Parquet File

NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option. If you encounter the error

RuntimeError: Schemas are inconsistent, try using to_parquet(..., schema="infer"),
or pass an explicit pyarrow schema. Such as to_parquet(..., schema={"column1": pa.string()})

when you load the dataset as shown below, one of your parquet files might have a different schema:

ds = nvt.Dataset(PATH, engine="parquet", part_size="1000MB")

The easiest way to fix this is to load your dataset with dask_cudf and save it again using the parquet format ( dask_cudf.read_parquet("INPUT_FOLDER").to_parquet("OUTPUT_FOLDER")), so that the parquet file is standardized and the _metadata file is generated.

If you want to identify which parquet files contain columns with different schemas, you can run one of these scripts:

These scripts check for schema consistency and generate only the _metadata file instead of converting all the parquet files. If the schema is inconsistent across all files, the script will raise an exception. For additional information, see this issue.

Setting the Row Group Size for the Parquet Files

You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the row_group_size is the number of rows that will be stored in each row group (internal structure within the parquet file):

pandas_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
cudf_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)

The row group memory size of the parquet files should be smaller than the part_size that you set for the NVTabular dataset such as nvt.Dataset(TRAIN_DIR, engine="parquet", part_size="1000MB"). To determine how much memory a row group will hold, you can slice your dataframe to a specific number of rows and use the following function to get the memory usage in bytes. You can then set the row_group_size (number of rows) accordingly when you save the parquet file. A row group memory size that is close to 128MB is recommended.

def _memory_usage(df):
    """this function is a workaround for obtaining memory usage lists
    in cudf0.16. This can be deleted and replaced with `df.memory_usage(deep= True, index=True).sum()`
    when using cudf 0.17, which has been fixed as noted on"""
    size = 0
    for col in df._data.columns:
        if cudf.utils.dtypes.is_list_dtype(col.dtype):
            for child in col.base_children:
                size += child.__sizeof__()
            size += col._memory_usage(deep=True)
    size += df.index.memory_usage(deep=True)
    return size