Skip to content

[QST] Can I ignore the is_ragged property of the categorical features when exporting the Workflow ?  #386

Open
@Azilyss

Description

@Azilyss

** Can I ignore the is_ragged property of the categorical features when exporting the Workflow ? **

Setup :
nvtabular version : 23.6.0
merlin-systems version : 23.6.0

The NvTabular workflow is defined as follows :

input_features = ["item_id-list"]
max_len = 20
cat_features = (
    ColumnSelector(input_features)
    >> ops.Categorify()
    >> nvt.ops.AddMetadata(tags=[Tags.CATEGORICAL])
)
seq_feats_list = (
    cat_features["item_id-list"]
    >> nvt.ops.ListSlice(-max_len, pad=True, pad_value=0)
    >> nvt.ops.Rename(postfix="_seq")
    >> nvt.ops.AddMetadata(tags=[Tags.LIST])
)
features = seq_feats_list >> nvt.ops.AddMetadata(tags=[Tags.ITEM, Tags.ID])
workflow = nvt.Workflow(features)

The dataset typically has sequences of items of different length and the workflow slice and pads them to the specified sequence_length.

The workflow is exported as follows:

transform_workflow_op = workflow.input_schema.column_names >> TransformWorkflow(workflow)
ensemble = Ensemble(transform_workflow_op, workflow.input_schema)
ens_config, node_configs = ensemble.export(preprocessing_path)

When exporting the workflow using the Ensemble module, the NvTabular triton config file creates two parameters for each ragged feature: "feature_name___offsets" and "feature_name___values" for both the inputs and outputs.

Is there a solution to avoid creating these new parameters and keep the inputs as is ?
Any workaround appreciated.

Code to reproduce
  import dask.dataframe as dd
  import nvtabular as nvt
  import pandas as pd
  from merlin.schema import Tags
  from merlin.systems.dag import Ensemble
  from merlin.systems.dag.ops.workflow import TransformWorkflow
  from nvtabular import ColumnSelector

  tmp_path = "tmp"

  d = {
      "item_id-list": [
          [28, 12, 44],
          [12, 28, 73],
          [24, 35, 6, 12],
          [74, 28, 9, 12, 44],
          [101, 102, 103, 104, 105],
      ],
  }

  df = pd.DataFrame(data=d)
  ddf = dd.from_pandas(df, npartitions=1)
  train_set = nvt.Dataset(ddf)

  input_features = ["item_id-list"]
  max_len = 20
  cat_features = (
          ColumnSelector(input_features)
          >> nvt.ops.Categorify()
          >> nvt.ops.AddMetadata(tags=[Tags.CATEGORICAL])
  )
  seq_feats_list = (
          cat_features["item_id-list"]
          >> nvt.ops.ListSlice(-max_len, pad=True, pad_value=0)
          >> nvt.ops.Rename(postfix="_seq")
          >> nvt.ops.AddMetadata(tags=[Tags.LIST])
  )
  features = seq_feats_list >> nvt.ops.AddMetadata(tags=[Tags.ITEM, Tags.ID])
  workflow = nvt.Workflow(features)

  workflow.fit(train_set)

  transform_workflow_op = workflow.input_schema.column_names >> TransformWorkflow(workflow)

  ensemble = Ensemble(transform_workflow_op, workflow.input_schema)
  ens_config, node_configs = ensemble.export(tmp_path)

  print(ens_config)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions