Add new transformer MatchCategories to preprocessing module#458
Add new transformer MatchCategories to preprocessing module#458solegalli merged 17 commits intofeature-engine:mainfrom
MatchCategories to preprocessing module#458Conversation
|
Of the top of my head, the .copy() is intentional and not to over-write the users dataframe. I am away for 2 weeks, so will have a look when I am back. Apologies. |
solegalli
left a comment
There was a problem hiding this comment.
Thanks a lot for the contribution.
I think the new class looks good. I am not completely sure that I would like to modify the base categorical class much, that would have a knock on effect on all encoders. Could we move the methods and params to the new transformer instead?
If there are bits of the base class that should be avoided, maybe we can think of breaking the methods down more and see how to make it fit.
Also, CategoryEncoder is a bit misleading, this transformer does not change the values of the variables, just the type. So I would call it MatchCategories or something on those line to keep the spirit of the preprocessing module and not to induce users to think that they will have the values of their categories changed into numbers (aka encoding).
Let me know your thoughts.
| self, | ||
| variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
| ignore_format: bool = False, | ||
| allow_missing: bool = False |
There was a problem hiding this comment.
If we add a parameter to this class, this parameter will be available to all encoders. Can we not add the param to the class that will use it only?
The least parameters in a class the better (more user friendly) so I would not want to add unneeded param to all transformers.
There was a problem hiding this comment.
Issue is, most of what the transformers do happens in the base class. If it were to be done outside of the base class, if would involve duplicating a lot of code.
There was a problem hiding this comment.
Can we not make this new class inherit the BaseCategorical and modify the functionality in the new class instead of in the base class? and if not, maybe make a child class of BaseCategorical and then modify that one instead?
BaseCategorical is used by all categorical transformers. So we can't really add a parameter that is only going to be used by one class. We need to find an alternative way.
There was a problem hiding this comment.
Issue is, since the part that differs is right in the middle of the base class' methods, adding a new subclass would involve duplicating lots of the code. Not sure if that's Ok for you.
| self._check_nas_in_result(X) | ||
| return X | ||
|
|
||
| def _check_nas_in_result(self, X): |
There was a problem hiding this comment.
this method is only used in the new class, correct? can we not place it there instead?
There was a problem hiding this comment.
this method needs to be moved to MatchVariables
| """ | ||
|
|
||
| from .match_columns import MatchVariables | ||
| from .category_encoder import CategoryEncoder |
There was a problem hiding this comment.
Regarding your question of relative vs absolute imports, originally I started with absolute, but then the codebase grew a lot, i divided scripts into folders, and in order to affect the users as little as possible and not make the import commands extra long, I adopted this style. There are pros and cons, but because of legacy we went with it.
|
|
||
| {ignore_format} | ||
|
|
||
| {errors} |
There was a problem hiding this comment.
do we need this param in this class?
There was a problem hiding this comment.
Yes, it also has to decide whether to output missing values for unseen categories or to fail when it encounters them.
| X = self._check_transform_input_and_state(X) | ||
|
|
||
| for feature, levels in self.encoder_dict_.items(): | ||
| X = X.assign(**{feature: pd.Categorical(X[feature], levels)}) |
There was a problem hiding this comment.
I was once told that this ** think is not a nice way to code. Personally I am not sure that is true, so I would like to ask you for more info on why you selected this implementation?
There was a problem hiding this comment.
In this particular case, there's little difference, but if the check_X function were changed so as not to copy the data, this would avoid modifying the input in place. See #457
|
FYI this PR: david-cortes#2 restructures the base encoder so that you can inherit in this new class without adding the extra parameters. Hope this helps Cheers! |
|
Updated, but I'm again not very sure if this is how you intended the changes to be. |
solegalli
left a comment
There was a problem hiding this comment.
Thank you for the changes.
We are almost there.
I introduced some changes in the base_encoder, so that you can implement the check for missing data only in the class you are creating.
So please, do not add more parameters or methods to the base_encoder. Whatever MatchColumns needs, needs to be just in that class.
Do you think you could do a few more changes to accommodate that?
See my comments throughout the code.
Thank you!
|
|
||
| {ignore_format} | ||
|
|
||
| allow_missing: bool |
There was a problem hiding this comment.
this parameter should only be added to the MatchColumns class
|
|
||
| # check if dataset contains na | ||
| _check_contains_na(X, self.variables_) | ||
| if not self.allow_missing: |
There was a problem hiding this comment.
I created the _fit method and removed the check_na from this method, so you can use it in MatchVariables.
In short, this logic that you are using here, you need to move it to MatchVariables.
| self._check_nas_in_result(X) | ||
| return X | ||
|
|
||
| def _check_nas_in_result(self, X): |
There was a problem hiding this comment.
this method needs to be moved to MatchVariables
| errors: str = "raise", | ||
| ) -> None: | ||
| check_parameter_errors(errors, ["ignore", "raise"]) | ||
| super().__init__(variables, ignore_format, allow_missing=errors != "raise") |
There was a problem hiding this comment.
the logic for missing values needs to be moved here.
Also, we normally call this parameter "missing" and takes values "raise" and "ignore". So for consistency, could we use the same names?
There was a problem hiding this comment.
In the other categorical encoders (such as CountFrequency), it it called "errors".
| y is not needed in this encoder. You can pass y or None. | ||
| """ | ||
| X = check_X(X) | ||
| self._check_or_select_variables(X) |
There was a problem hiding this comment.
after this method, you need to add the logic if self.missing=="raise": then do something
386e3e5 to
64e9c56
Compare
|
Redid the PR based on the current master branch - please take a look to see if this is how the relationships between classes were meant to be. |
|
Actually, from a look at the other commits that you had pushed for this PR, I'm now noticing that it differed a bit from the current master branch and have some confusion over which changes should be kept and from where did changes come (master branch, my old PR, your commits to my PR). Were methods like |
|
Did you had a chance to have the look at the changes I made in the PR I made to your repo? Because there you would have a better view of what was "new". I modified the base_encoder so that you could accommodate the allow_missing functionality. For this reason, I created the _fit() method, that would be used by all encoders, and left the _check_categorical_variables() without the check for na, so that you could use it in this class, without re-writing all the code again. Most of my changes have disappeared now :_( I suggest you take a look at the PR you merged to your repo, hopefully you will see what was meant to be retained. If not, maybe the best would be to checkout the branch from which I made the PR to your repo and start from there. Hope this helps |
64e9c56 to
7e76846
Compare
|
Redone based on the earlier branch, but this time removing the additional parameter in the base class. Although, I am still not sure that |
|
It's looking good. We need to add the documentation now. We need to call this class from:
and within the api and user_guide folders we need to add an rst file, in the first one just to retrieve the docstrings, and in the second one with a demo of the class. Would that be ok? You could look at the ones from MatchColumns as inspiration. |
|
Added docs and an example guide. |
|
|
||
| This transformer is useful when creating custom transformers for categorical columns, | ||
| as well as when passing categorical columns to modeling packages that support them | ||
| natively but which leaving their encoding to the user. |
There was a problem hiding this comment.
Could you mention which package(s) support categorical variables without encoding them to numbers?
|
Thank you so much for the code update. I made some minor changes here: david-cortes#3 Would you have a look and merge? Also, after merging, would you mind adding the names of the packages that support categorical variables without encoding here: docs/user_guide/preprocessing/MatchCategories.rst at around line 171? I left a comment. Thanks a lot! |
MatchCategoriesMatchCategories to preprocessing module
|
Added examples. |
Fixes #418
This PR introduces a transformer
CategoryEncoderwhich simply encodes columns as categorical dtype remembering the encodings / levels that were used for each variable.As per the issue comments, I put it under
preprocessing, but it is based on the base categorical encoder that the transformers inencodinguse.There's a few things that I was not sure about:
from feature_engine... import ...). I am not sure if that was intentional or not. This PR uses relative imports for internal objects.df[column], creating an extra copy of the data beforehand which is inefficient (opened an issue about it: Potentially unneeded data copying #457). For this PR instead I useddf.assign, but it's still incurring an extra copy due to calling the input-checking methods.