Skip to content

Add new transformer MatchCategories to preprocessing module#458

Merged
solegalli merged 17 commits intofeature-engine:mainfrom
david-cortes:categ_encoder
Jul 5, 2022
Merged

Add new transformer MatchCategories to preprocessing module#458
solegalli merged 17 commits intofeature-engine:mainfrom
david-cortes:categ_encoder

Conversation

@david-cortes
Copy link
Contributor

Fixes #418

This PR introduces a transformer CategoryEncoder which simply encodes columns as categorical dtype remembering the encodings / levels that were used for each variable.

As per the issue comments, I put it under preprocessing, but it is based on the base categorical encoder that the transformers in encoding use.

There's a few things that I was not sure about:

  • I see that some modules use absolute imports (from feature_engine... import ...). I am not sure if that was intentional or not. This PR uses relative imports for internal objects.
  • I see some transformers modify the data directly by using df[column], creating an extra copy of the data beforehand which is inefficient (opened an issue about it: Potentially unneeded data copying #457). For this PR instead I used df.assign, but it's still incurring an extra copy due to calling the input-checking methods.

@solegalli
Copy link
Collaborator

Hi @david-cortes

Of the top of my head, the .copy() is intentional and not to over-write the users dataframe.

I am away for 2 weeks, so will have a look when I am back. Apologies.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-cortes

Thanks a lot for the contribution.

I think the new class looks good. I am not completely sure that I would like to modify the base categorical class much, that would have a knock on effect on all encoders. Could we move the methods and params to the new transformer instead?

If there are bits of the base class that should be avoided, maybe we can think of breaking the methods down more and see how to make it fit.

Also, CategoryEncoder is a bit misleading, this transformer does not change the values of the variables, just the type. So I would call it MatchCategories or something on those line to keep the spirit of the preprocessing module and not to induce users to think that they will have the values of their categories changed into numbers (aka encoding).

Let me know your thoughts.

self,
variables: Union[None, int, str, List[Union[str, int]]] = None,
ignore_format: bool = False,
allow_missing: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a parameter to this class, this parameter will be available to all encoders. Can we not add the param to the class that will use it only?

The least parameters in a class the better (more user friendly) so I would not want to add unneeded param to all transformers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue is, most of what the transformers do happens in the base class. If it were to be done outside of the base class, if would involve duplicating a lot of code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not make this new class inherit the BaseCategorical and modify the functionality in the new class instead of in the base class? and if not, maybe make a child class of BaseCategorical and then modify that one instead?

BaseCategorical is used by all categorical transformers. So we can't really add a parameter that is only going to be used by one class. We need to find an alternative way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue is, since the part that differs is right in the middle of the base class' methods, adding a new subclass would involve duplicating lots of the code. Not sure if that's Ok for you.

self._check_nas_in_result(X)
return X

def _check_nas_in_result(self, X):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is only used in the new class, correct? can we not place it there instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method needs to be moved to MatchVariables

"""

from .match_columns import MatchVariables
from .category_encoder import CategoryEncoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your question of relative vs absolute imports, originally I started with absolute, but then the codebase grew a lot, i divided scripts into folders, and in order to affect the users as little as possible and not make the import commands extra long, I adopted this style. There are pros and cons, but because of legacy we went with it.


{ignore_format}

{errors}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this param in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it also has to decide whether to output missing values for unseen categories or to fail when it encounters them.

X = self._check_transform_input_and_state(X)

for feature, levels in self.encoder_dict_.items():
X = X.assign(**{feature: pd.Categorical(X[feature], levels)})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was once told that this ** think is not a nice way to code. Personally I am not sure that is true, so I would like to ask you for more info on why you selected this implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, there's little difference, but if the check_X function were changed so as not to copy the data, this would avoid modifying the input in place. See #457

@solegalli
Copy link
Collaborator

Hi @david-cortes

FYI this PR: david-cortes#2

restructures the base encoder so that you can inherit in this new class without adding the extra parameters.

Hope this helps

Cheers!

@david-cortes david-cortes changed the title Add simple CategoryEncoder Add simple MatchCategories Jun 26, 2022
@david-cortes
Copy link
Contributor Author

Updated, but I'm again not very sure if this is how you intended the changes to be.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-cortes

Thank you for the changes.

We are almost there.

I introduced some changes in the base_encoder, so that you can implement the check for missing data only in the class you are creating.

So please, do not add more parameters or methods to the base_encoder. Whatever MatchColumns needs, needs to be just in that class.

Do you think you could do a few more changes to accommodate that?

See my comments throughout the code.

Thank you!


{ignore_format}

allow_missing: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this parameter should only be added to the MatchColumns class


# check if dataset contains na
_check_contains_na(X, self.variables_)
if not self.allow_missing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the _fit method and removed the check_na from this method, so you can use it in MatchVariables.

In short, this logic that you are using here, you need to move it to MatchVariables.

self._check_nas_in_result(X)
return X

def _check_nas_in_result(self, X):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method needs to be moved to MatchVariables

errors: str = "raise",
) -> None:
check_parameter_errors(errors, ["ignore", "raise"])
super().__init__(variables, ignore_format, allow_missing=errors != "raise")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic for missing values needs to be moved here.

Also, we normally call this parameter "missing" and takes values "raise" and "ignore". So for consistency, could we use the same names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other categorical encoders (such as CountFrequency), it it called "errors".

y is not needed in this encoder. You can pass y or None.
"""
X = check_X(X)
self._check_or_select_variables(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this method, you need to add the logic if self.missing=="raise": then do something

@david-cortes
Copy link
Contributor Author

Redid the PR based on the current master branch - please take a look to see if this is how the relationships between classes were meant to be.

@david-cortes
Copy link
Contributor Author

david-cortes commented Jun 26, 2022

Actually, from a look at the other commits that you had pushed for this PR, I'm now noticing that it differed a bit from the current master branch and have some confusion over which changes should be kept and from where did changes come (master branch, my old PR, your commits to my PR).

Were methods like _fit intended to be used for this PR, or were they meant to be removed from an earlier PR?

@solegalli
Copy link
Collaborator

Hi @david-cortes

Did you had a chance to have the look at the changes I made in the PR I made to your repo? Because there you would have a better view of what was "new".

I modified the base_encoder so that you could accommodate the allow_missing functionality. For this reason, I created the _fit() method, that would be used by all encoders, and left the _check_categorical_variables() without the check for na, so that you could use it in this class, without re-writing all the code again.

Most of my changes have disappeared now :_(

I suggest you take a look at the PR you merged to your repo, hopefully you will see what was meant to be retained. If not, maybe the best would be to checkout the branch from which I made the PR to your repo and start from there.

Hope this helps
Cheers

@david-cortes
Copy link
Contributor Author

Redone based on the earlier branch, but this time removing the additional parameter in the base class. Although, I am still not sure that _check_nas_in_result is better left as a duplicated piece of code in MatchCategories instead of a method in the base class.

@solegalli
Copy link
Collaborator

Hi @david-cortes

It's looking good. We need to add the documentation now. We need to call this class from:

  • docs/index
  • docs/api/index
  • docs/user_guide/index

and within the api and user_guide folders we need to add an rst file, in the first one just to retrieve the docstrings, and in the second one with a demo of the class.

Would that be ok? You could look at the ones from MatchColumns as inspiration.

@david-cortes
Copy link
Contributor Author

Added docs and an example guide.


This transformer is useful when creating custom transformers for categorical columns,
as well as when passing categorical columns to modeling packages that support them
natively but which leaving their encoding to the user.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention which package(s) support categorical variables without encoding them to numbers?

@solegalli
Copy link
Collaborator

solegalli commented Jul 5, 2022

Hi @david-cortes

Thank you so much for the code update. I made some minor changes here: david-cortes#3

Would you have a look and merge?

Also, after merging, would you mind adding the names of the packages that support categorical variables without encoding here: docs/user_guide/preprocessing/MatchCategories.rst at around line 171? I left a comment.

Thanks a lot!

@solegalli solegalli changed the title Add simple MatchCategories Add new transformer MatchCategories to preprocessing module Jul 5, 2022
@david-cortes
Copy link
Contributor Author

Added examples.

@solegalli solegalli merged commit 60a7045 into feature-engine:main Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Would be useful to have a simple categorical levels encoder

2 participants