PipeOpDecode #835

mb706 · 2024-10-03T19:35:39Z

Inverses one-hot-encoding: creates a factor-column that indicates which of the input numeric columns has the maximal value.

Should have argument 'treatment_encoding' (init: FALSE): if TRUE, it includes an additional level if all cols are 0 (becoming the inverse of PipeOpEncode with method == treatment).

Should also have an argument group_pattern, a regular expression. The group_pattern is applied to all col names and the first regex group is extracted. All columns that have different value here are treated separately from each other. The levels that are created then correspond to gsub(group_pattern, "", colnames()). Initialized as "^([^.]*)\\.".

The point here is that we may have columns x.a, x.b, x.c, y.a, y.b. The "^([^.]*)\\."-match matches "x" for the first three cols and creates levels a, b, and c. It then matches "y" for the last two cols, creating the factor cols with levels a and b. Should the user e.g. have columns x_a, x_b, ..., then this would need to be changed to "^([^_]*)_". Should the user not want any groups, and instead get levels x.a, x.b, ..., y.b in a single result column, the pattern would be "".

If the pattern is not "", we ignore all columns that do not match the group_pattern; I am assuming that this is what a user wants basically all of the time, even though it unfortunately undermines the affect_columns argument somewhat.

The text was updated successfully, but these errors were encountered:

mb706 · 2024-10-08T10:17:51Z

suggestion for state:

named list, named by columns that are being created, with content for each such column:
- named character, named by the name of input columns, containing in each entry the name of the resulting factor.
- for treatment_encoding, maybe also include an entry with empty name, containing the label of the reference factor.

also maybe the content of the treatment_encoding flag for prediction, since changing the hyperparamter after training is not allowed to have an effect.

probably good idea to use PipeOpTaskPreprocSimple.

mb706 · 2024-10-08T10:19:31Z

in the x.a, x.b, x.c, y.a, y.b, the state would be

list(
  colmaps = list(
    x = c(x.a = "a", x.b = "b", x.c = "c"),
    y = c(y.a = "a", y.b = "b")
  ),
  treatment_encoding = FALSE
)

mb706 · 2024-10-08T10:22:19Z

maybe treatment_encoding flag is not necessary, since we can see this from the fact that there are entries in the col-maps with empty name.

mb706 added the Type: New PipeOp Issue suggests a new PipeOp label Oct 3, 2024

mb706 assigned advieser Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipeOpDecode #835

PipeOpDecode #835

mb706 commented Oct 3, 2024

mb706 commented Oct 8, 2024 •

edited

Loading

mb706 commented Oct 8, 2024

mb706 commented Oct 8, 2024

PipeOpDecode #835

PipeOpDecode #835

Comments

mb706 commented Oct 3, 2024

mb706 commented Oct 8, 2024 • edited Loading

mb706 commented Oct 8, 2024

mb706 commented Oct 8, 2024

mb706 commented Oct 8, 2024 •

edited

Loading