Add new transformer `MatchCategories` to preprocessing module by david-cortes · Pull Request #458 · feature-engine/feature_engine

david-cortes · 2022-05-12T20:20:46Z

Fixes #418

This PR introduces a transformer CategoryEncoder which simply encodes columns as categorical dtype remembering the encodings / levels that were used for each variable.

As per the issue comments, I put it under preprocessing, but it is based on the base categorical encoder that the transformers in encoding use.

There's a few things that I was not sure about:

I see that some modules use absolute imports (from feature_engine... import ...). I am not sure if that was intentional or not. This PR uses relative imports for internal objects.
I see some transformers modify the data directly by using df[column], creating an extra copy of the data beforehand which is inefficient (opened an issue about it: Potentially unneeded data copying #457). For this PR instead I used df.assign, but it's still incurring an extra copy due to calling the input-checking methods.

solegalli · 2022-05-13T06:22:13Z

Hi @david-cortes

Of the top of my head, the .copy() is intentional and not to over-write the users dataframe.

I am away for 2 weeks, so will have a look when I am back. Apologies.

solegalli

Hi @david-cortes

Thanks a lot for the contribution.

I think the new class looks good. I am not completely sure that I would like to modify the base categorical class much, that would have a knock on effect on all encoders. Could we move the methods and params to the new transformer instead?

If there are bits of the base class that should be avoided, maybe we can think of breaking the methods down more and see how to make it fit.

Also, CategoryEncoder is a bit misleading, this transformer does not change the values of the variables, just the type. So I would call it MatchCategories or something on those line to keep the spirit of the preprocessing module and not to induce users to think that they will have the values of their categories changed into numbers (aka encoding).

Let me know your thoughts.

solegalli · 2022-05-31T08:14:22Z

feature_engine/encoding/base_encoder.py

        self,
        variables: Union[None, int, str, List[Union[str, int]]] = None,
        ignore_format: bool = False,
+        allow_missing: bool = False


If we add a parameter to this class, this parameter will be available to all encoders. Can we not add the param to the class that will use it only?

The least parameters in a class the better (more user friendly) so I would not want to add unneeded param to all transformers.

Issue is, most of what the transformers do happens in the base class. If it were to be done outside of the base class, if would involve duplicating a lot of code.

Can we not make this new class inherit the BaseCategorical and modify the functionality in the new class instead of in the base class? and if not, maybe make a child class of BaseCategorical and then modify that one instead?

BaseCategorical is used by all categorical transformers. So we can't really add a parameter that is only going to be used by one class. We need to find an alternative way.

Issue is, since the part that differs is right in the middle of the base class' methods, adding a new subclass would involve duplicating lots of the code. Not sure if that's Ok for you.

solegalli · 2022-05-31T08:14:26Z

feature_engine/encoding/base_encoder.py

+        self._check_nas_in_result(X)
+        return X
+
+    def _check_nas_in_result(self, X):


this method is only used in the new class, correct? can we not place it there instead?

this method needs to be moved to MatchVariables

solegalli · 2022-05-31T08:14:42Z

feature_engine/preprocessing/__init__.py

 """

 from .match_columns import MatchVariables
+from .category_encoder import CategoryEncoder


Regarding your question of relative vs absolute imports, originally I started with absolute, but then the codebase grew a lot, i divided scripts into folders, and in order to affect the users as little as possible and not make the import commands extra long, I adopted this style. There are pros and cons, but because of legacy we went with it.

solegalli · 2022-05-31T08:15:03Z

feature_engine/preprocessing/category_encoder.py

+
+    {ignore_format}
+
+    {errors}


do we need this param in this class?

Yes, it also has to decide whether to output missing values for unseen categories or to fail when it encounters them.

solegalli · 2022-05-31T08:15:40Z

feature_engine/preprocessing/category_encoder.py

+        X = self._check_transform_input_and_state(X)
+
+        for feature, levels in self.encoder_dict_.items():
+            X = X.assign(**{feature: pd.Categorical(X[feature], levels)})


I was once told that this ** think is not a nice way to code. Personally I am not sure that is true, so I would like to ask you for more info on why you selected this implementation?

In this particular case, there's little difference, but if the check_X function were changed so as not to copy the data, this would avoid modifying the input in place. See #457

solegalli · 2022-06-22T09:59:59Z

Hi @david-cortes

FYI this PR: david-cortes#2

restructures the base encoder so that you can inherit in this new class without adding the extra parameters.

Hope this helps

Cheers!

david-cortes · 2022-06-26T17:40:22Z

Updated, but I'm again not very sure if this is how you intended the changes to be.

solegalli

Hi @david-cortes

Thank you for the changes.

We are almost there.

I introduced some changes in the base_encoder, so that you can implement the check for missing data only in the class you are creating.

So please, do not add more parameters or methods to the base_encoder. Whatever MatchColumns needs, needs to be just in that class.

Do you think you could do a few more changes to accommodate that?

See my comments throughout the code.

Thank you!

solegalli · 2022-06-26T18:45:57Z

feature_engine/encoding/base_encoder.py


    {ignore_format}
+
+    allow_missing: bool


this parameter should only be added to the MatchColumns class

solegalli · 2022-06-26T18:46:56Z

feature_engine/encoding/base_encoder.py


        # check if dataset contains na
-        _check_contains_na(X, self.variables_)
+        if not self.allow_missing:


I created the _fit method and removed the check_na from this method, so you can use it in MatchVariables.

In short, this logic that you are using here, you need to move it to MatchVariables.

solegalli · 2022-06-26T18:47:17Z

feature_engine/encoding/base_encoder.py

+        self._check_nas_in_result(X)
+        return X
+
+    def _check_nas_in_result(self, X):


this method needs to be moved to MatchVariables

solegalli · 2022-06-26T18:48:49Z

feature_engine/preprocessing/match_categories.py

+        errors: str = "raise",
+    ) -> None:
+        check_parameter_errors(errors, ["ignore", "raise"])
+        super().__init__(variables, ignore_format, allow_missing=errors != "raise")


the logic for missing values needs to be moved here.

Also, we normally call this parameter "missing" and takes values "raise" and "ignore". So for consistency, could we use the same names?

In the other categorical encoders (such as CountFrequency), it it called "errors".

solegalli · 2022-06-26T18:49:36Z

feature_engine/preprocessing/match_categories.py

+            y is not needed in this encoder. You can pass y or None.
+        """
+        X = check_X(X)
+        self._check_or_select_variables(X)


after this method, you need to add the logic if self.missing=="raise": then do something

david-cortes · 2022-06-26T20:10:59Z

Redid the PR based on the current master branch - please take a look to see if this is how the relationships between classes were meant to be.

david-cortes · 2022-06-26T20:15:06Z

Actually, from a look at the other commits that you had pushed for this PR, I'm now noticing that it differed a bit from the current master branch and have some confusion over which changes should be kept and from where did changes come (master branch, my old PR, your commits to my PR).

Were methods like _fit intended to be used for this PR, or were they meant to be removed from an earlier PR?

solegalli · 2022-06-27T08:45:35Z

Hi @david-cortes

Did you had a chance to have the look at the changes I made in the PR I made to your repo? Because there you would have a better view of what was "new".

I modified the base_encoder so that you could accommodate the allow_missing functionality. For this reason, I created the _fit() method, that would be used by all encoders, and left the _check_categorical_variables() without the check for na, so that you could use it in this class, without re-writing all the code again.

Most of my changes have disappeared now :_(

I suggest you take a look at the PR you merged to your repo, hopefully you will see what was meant to be retained. If not, maybe the best would be to checkout the branch from which I made the PR to your repo and start from there.

Hope this helps
Cheers

david-cortes · 2022-06-27T18:49:40Z

Redone based on the earlier branch, but this time removing the additional parameter in the base class. Although, I am still not sure that _check_nas_in_result is better left as a duplicated piece of code in MatchCategories instead of a method in the base class.

solegalli · 2022-06-29T07:38:14Z

Hi @david-cortes

It's looking good. We need to add the documentation now. We need to call this class from:

docs/index
docs/api/index
docs/user_guide/index

and within the api and user_guide folders we need to add an rst file, in the first one just to retrieve the docstrings, and in the second one with a demo of the class.

Would that be ok? You could look at the ones from MatchColumns as inspiration.

david-cortes · 2022-06-29T19:40:23Z

Added docs and an example guide.

solegalli · 2022-07-05T13:49:30Z

docs/user_guide/preprocessing/MatchCategories.rst

+
+This transformer is useful when creating custom transformers for categorical columns,
+as well as when passing categorical columns to modeling packages that support them
+natively but which leaving their encoding to the user.


Could you mention which package(s) support categorical variables without encoding them to numbers?

solegalli · 2022-07-05T13:50:26Z

Hi @david-cortes

Thank you so much for the code update. I made some minor changes here: david-cortes#3

Would you have a look and merge?

Also, after merging, would you mind adding the names of the packages that support categorical variables without encoding here: docs/user_guide/preprocessing/MatchCategories.rst at around line 171? I left a comment.

Thanks a lot!

david-cortes · 2022-07-05T17:17:25Z

Added examples.

add simple categorical levels encoder

d346573

solegalli reviewed May 31, 2022

View reviewed changes

david-cortes and others added 5 commits June 22, 2022 11:10

add simple categorical levels encoder

6db05f5

revert base encoder to original representation

8017b8d

sorts imports init

bdd307e

reorganizes base encoder and encoders

cd66f27

reorganises transformer based on baseencoder

f4669de

david-cortes added 3 commits June 26, 2022 19:04

solve merge conflict

1c77925

Merge branch 'main' into categ_encoder

1ab08ac

redo for new branch

386e3e5

david-cortes changed the title ~~Add simple CategoryEncoder~~ Add simple MatchCategories Jun 26, 2022

solegalli reviewed Jun 26, 2022

View reviewed changes

david-cortes force-pushed the categ_encoder branch from 386e3e5 to 64e9c56 Compare June 26, 2022 20:10

rebase on branch from PR

7e76846

david-cortes force-pushed the categ_encoder branch from 64e9c56 to 7e76846 Compare June 27, 2022 18:48

add docs for MatchCategories

d9bbab6

david-cortes and others added 5 commits June 29, 2022 21:45

add transformer to readme

dbe6193

reword class function

487f2d6

reword user guide

8e9c4a7

replace errors by missing_values, expand tests

6d27c0b

add transformer to general checks file, remove unused import

ed923f8

solegalli reviewed Jul 5, 2022

View reviewed changes

solegalli changed the title ~~Add simple MatchCategories~~ Add new transformer MatchCategories to preprocessing module Jul 5, 2022

add example packages in docs

bbcf689

solegalli merged commit 60a7045 into feature-engine:main Jul 5, 2022

Conversation

david-cortes commented May 12, 2022

Uh oh!

solegalli commented May 13, 2022

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solegalli commented Jun 22, 2022

Uh oh!

david-cortes commented Jun 26, 2022

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes commented Jun 26, 2022

Uh oh!

david-cortes commented Jun 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solegalli commented Jun 27, 2022

Uh oh!

david-cortes commented Jun 27, 2022

Uh oh!

solegalli commented Jun 29, 2022

Uh oh!

david-cortes commented Jun 29, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solegalli commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-cortes commented Jul 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

david-cortes commented Jun 26, 2022 •

edited

Loading

solegalli commented Jul 5, 2022 •

edited

Loading