You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inverses one-hot-encoding: creates a factor-column that indicates which of the input numeric columns has the maximal value.
Should have argument 'treatment_encoding' (init: FALSE): if TRUE, it includes an additional level if all cols are 0 (becoming the inverse of PipeOpEncode with method == treatment).
Should also have an argument group_pattern, a regular expression. The group_pattern is applied to all col names and the first regex group is extracted. All columns that have different value here are treated separately from each other. The levels that are created then correspond to gsub(group_pattern, "", colnames()). Initialized as "^([^.]*)\\.".
The point here is that we may have columns x.a, x.b, x.c, y.a, y.b. The "^([^.]*)\\."-match matches "x" for the first three cols and creates levels a, b, and c. It then matches "y" for the last two cols, creating the factor cols with levels a and b. Should the user e.g. have columns x_a, x_b, ..., then this would need to be changed to "^([^_]*)_". Should the user not want any groups, and instead get levels x.a, x.b, ..., y.b in a single result column, the pattern would be "".
If the pattern is not "", we ignore all columns that do not match the group_pattern; I am assuming that this is what a user wants basically all of the time, even though it unfortunately undermines the affect_columns argument somewhat.
The text was updated successfully, but these errors were encountered:
named list, named by columns that are being created, with content for each such column:
named character, named by the name of input columns, containing in each entry the name of the resulting factor.
for treatment_encoding, maybe also include an entry with empty name, containing the label of the reference factor.
also maybe the content of the treatment_encoding flag for prediction, since changing the hyperparamter after training is not allowed to have an effect.
probably good idea to use PipeOpTaskPreprocSimple.
Inverses one-hot-encoding: creates a factor-column that indicates which of the input numeric columns has the maximal value.
Should have argument 'treatment_encoding' (init: FALSE): if TRUE, it includes an additional level if all cols are 0 (becoming the inverse of PipeOpEncode with method == treatment).
Should also have an argument
group_pattern
, a regular expression. Thegroup_pattern
is applied to all col names and the first regex group is extracted. All columns that have different value here are treated separately from each other. The levels that are created then correspond togsub(group_pattern, "", colnames())
. Initialized as"^([^.]*)\\."
.The point here is that we may have columns
x.a
,x.b
,x.c
,y.a
,y.b
. The"^([^.]*)\\."
-match matches"x"
for the first three cols and creates levelsa
,b
, andc
. It then matches"y"
for the last two cols, creating the factor cols with levelsa
andb
. Should the user e.g. have columnsx_a
,x_b
, ..., then this would need to be changed to"^([^_]*)_"
. Should the user not want any groups, and instead get levelsx.a
,x.b
, ...,y.b
in a single result column, the pattern would be""
.If the pattern is not
""
, we ignore all columns that do not match thegroup_pattern
; I am assuming that this is what a user wants basically all of the time, even though it unfortunately undermines theaffect_columns
argument somewhat.The text was updated successfully, but these errors were encountered: