add option to encode unseen categories with zeros for `CountFrequencyEncoder` by david-cortes · Pull Request #456 · feature-engine/feature_engine

david-cortes · 2022-05-12T17:40:11Z

Fixes #417

This PR adds an option for CountFrequencyEncoder to output an encoding result of zero when calling transform on new data that has categories that were not present in the training data, which makes sense for this particular class since unseen categories would have a count of zero by definition.

david-cortes · 2022-05-12T18:04:04Z

The failing tests are not from the module that was modified and do not look to be related to this PR. Probably some incompatibility with newer scikit-learn versions.

solegalli · 2022-05-13T06:23:20Z

Will have a look when I am back from hols, apologies for the delay

solegalli

Hi @david-cortes

Thanks a lot for the changes! And apologies for the delay. I am just back from holidays and had to fix those failing tests as they were blocking all PRs.

On that front, could you please rebase the lastest version of main to this PR?

I am thinking of expanding the options to encode unseen data to more encoders:
CountFrequencyEncoder by 0 (this PR)
OrdinalEncoder by -1 (this issue #428)
TargetEncoder by mean of target for all observations (no issue yet)
WoE and PR we need to think.

So instead of calling it "zeroes" could we call the parameter "encode"?
I add more details throughout the code.

Please let me know your thoughts and if you could implement this. Thank you!

solegalli · 2022-05-31T07:55:28Z

feature_engine/encoding/base_encoder.py

-                .columns[X[self.encoder_dict_.keys()].isnull().any()]
-                .tolist()
-            )
+        if self.errors != "zeros":


I am thinking of expanding the allowed parameters for this parameter into "raise", "ignore" and "encode". The first 2 will perform the same funtionality across transformers. The third one would perform different functionality in different transformers. In this transformer, it will replace the nan by 0.

In the ordinalEncoder it would replace by -1. In the target mean by the target prior, and so on. Since this is a base class, we would need to add that versatility here.

Changed it for this particular transformer.

solegalli · 2022-05-31T07:56:00Z

feature_engine/encoding/base_encoder.py

-                    "During the encoding, NaN values were introduced in the feature(s) "
-                    f"{nan_columns_str}."
+                msg = (
+                    "During the encoding, NaN values were introduced " +


can we avoid using + at the end of line? so we have consistent code across our codebase?

Good idea to unify the text in a variable, thank you!

solegalli · 2022-05-31T07:56:55Z

feature_engine/encoding/base_encoder.py


    {errors}
+
+    allow_zero_enc : bool


I would prefer not to add an extra parameter. I don't think we really need it, do we?

solegalli · 2022-05-31T07:57:33Z

feature_engine/encoding/base_encoder.py

        errors: str = "ignore",
+        allow_zero_enc: bool = False
    ) -> None:
-        if errors not in ["raise", "ignore"]:


can we just expand this to if errors not in ["raise", "ignore", "encode"] instead of using an extra parameter?

feature_engine/encoding/count_frequency.py

solegalli · 2022-05-31T07:59:51Z

feature_engine/encoding/count_frequency.py

                n_obs = float(len(X))
-                self.encoder_dict_[var] = (X[var].value_counts() / n_obs).to_dict()
+                self.encoder_dict_[var] = (
+                    X[var].value_counts() / n_obs


here we could do value_counts(normalize=True) instead of the division. Minor change and not related to this PR thought. But since we are here :p

feature_engine/encoding/count_frequency.py

david-cortes · 2022-06-04T15:40:09Z

Redone with the new base classes for categoricals.

solegalli · 2022-06-12T07:37:48Z

feature_engine/encoding/base_encoder.py

        variables: Union[None, int, str, List[Union[str, int]]] = None,
        ignore_format: bool = False,
        errors: str = "ignore",
+        supports_errors_encode: bool = False


hi @david-cortes

We can't really add parameters unless they are strictly necessary, because that turns classes less user friendly.

I am trying to think what would be the best way forward, given that at the moment, all classes take "raise" and "ignore" and we want the Count frequency to take the extra parameter.

I think, probably the best way is to take this functionality out of the base class and into a function, that takes the allowed strings as inputs.

Unless you have a different idea?

Yes, that also would do.

solegalli · 2022-06-12T07:38:37Z

feature_engine/encoding/_docstrings.py

+        error. If 'ignore', then unseen categories will be set as NaN and a warning will
+        be raised instead. If 'encode', then unseen categories will be encoded according
+        to the default strategy from the transformer, provided that it supports it.
+    """.rstrip()


At the moment, most classes accept only "raise" and "ignore". So I would not change the original docstring.

solegalli · 2022-06-12T07:39:54Z

feature_engine/encoding/_docstrings.py

+        to the default strategy from the transformer, provided that it supports it.
+    """.rstrip()
+
+_errors_docstring_with_encode = _errors_docstring + """


at the moment, only the count frequency has the "encode" extra functionality. And when we expand this to other classes, each one will do something different. So we would probably have to right independent docstrings for each one of them. I am not sure that having a centralized copy is suitable in this case. I would just re-word it in the class.

But not all of the clases would end up supporting the "encode" option.

solegalli · 2022-06-12T07:42:52Z

tests/test_encoding/test_count_frequency_encoder.py

-    with pytest.raises(ValueError):
-        encoder = CountFrequencyEncoder()
-        encoder.fit(df_enc_na)
+    for errors in ["raise", "ignore", "encode"]:


why do we need this change? The NA check comes before the unseen categories check.

It should still fail when passing "encode" if the fit input has NAs.

solegalli · 2022-06-12T07:44:30Z

tests/test_encoding/test_count_frequency_encoder.py

-        encoder = CountFrequencyEncoder()
-        encoder.fit(df_enc)
-        encoder.transform(df_enc_na)
+    for errors in ["raise", "encode"]:


interesting, here you skipped the "ignore", so I guess, "ignore" would ignore both already existing nan and the newly introduced.... it feels like a code smell in our original source code...

solegalli

Hi @david-cortes

Thank you very much for the code changes.

The code looks good, and I think it could be ready to go.

Your tests made me think that there might be something weird with how we handle existing NAN and newly introduced NaN, I would like to hear your thoughts when creating the tests?

Other than that, my main concern is around adding a new parameter to the base class. I would prefer we did not do that.

I suggested replacing the functionality in the main class by an external function. That would mean changing slightly the code in all other encoding classes. I would like to hear your thoughts on this. Maybe you have a better idea?

thanks a lot!

david-cortes · 2022-06-12T14:44:01Z

I am not sure what would be the right way to handle explicit missing values that are not from unseen categories.

I think ideally whether to handle them should be configured through a different option than this one, and ideally one of the options for how to handle them would be to encode them into a separate category "missing" or similar (which would also require controlling whether to mix up the "encode" values into it depending on the transformer). However pandas functions and methods don't lend themselves well towards handling missing values and that would require a lot more modifications.

…' errors or not

solegalli · 2022-06-22T09:03:24Z

Hi @david-cortes

Sorry for the delay. Please see this PR: david-cortes#1

I mostly reorganized the code you wrote.

The conflict is probably through the main branches not being in sync. If you can't resolve it let me know.

Thank you so much for the great contribution!

david-cortes · 2022-06-22T17:30:36Z

Updated, but I'm not sure if the merge conflict solving I made here preserved the changes that you wanted.

solegalli

Hi @david-cortes

Yes, this is looking great.

Unfortunately, the codestyle change. We've got an indentation pattern that we do not normally use in the codebase.

Could you please have a look and change it back to what it looked like in my PR?

Thank you!

feature_engine/base_transformers.py

solegalli · 2022-06-22T17:54:02Z

feature_engine/encoding/base_encoder.py

-    _ignore_format_docstring,
-    _variables_docstring,
-)
+from feature_engine.dataframe_checks import (_check_contains_na,


can we revert to the original indentation please? so it is coherent with the rest of the codebase?

solegalli · 2022-06-22T17:54:37Z

feature_engine/encoding/base_encoder.py

        variables: Union[None, int, str, List[Union[str, int]]] = None,
        ignore_format: bool = False,
-        errors: str = "ignore",
+        errors: str = "ignore"


can we please keep the comma after the parameter?

solegalli · 2022-06-22T17:54:54Z

feature_engine/encoding/count_frequency.py

-from feature_engine.encoding.base_encoder import (
-    CategoricalInitExpandedMixin,
-    CategoricalMethodsMixin,
+from feature_engine.encoding._docstrings import (_errors_docstring,


indentation

solegalli · 2022-06-22T17:55:41Z

tests/test_encoding/test_count_frequency_encoder.py

+
 import pandas as pd
 import pytest
+from numpy import nan


can we use isort here?

david-cortes · 2022-06-22T17:58:34Z

@solegalli What linter are you using for indentation?

solegalli · 2022-06-22T20:05:37Z

black

david-cortes · 2022-06-22T20:11:38Z

Reformatted with black and isort.

solegalli · 2022-06-23T06:15:47Z

Hi @david-cortes

Not sure all the files were reformatted. Also, can we remove the base transformer from this PR?

Thank you!

david-cortes · 2022-06-23T21:21:32Z

If I run either black or isort on this package, it will reformat pretty much all the files, not just the ones connected with this PR.

david-cortes · 2022-06-23T21:22:38Z

The base class CategoricalInitExpandedMixin is used by other transformers too.

solegalli · 2022-06-24T08:39:28Z

you can either black an isort a specific file eg black feature_engine\encoding\__init_.py

or you can isort the entire codebase and then commit only the relevant files.

I would suggest the first option.

Also, this is the file that needs to be removed from this PR: feature_engine/base_transformers.py

If you have a look at the files that changed you should be able to find it.

david-cortes · 2022-06-24T16:54:11Z

Updated.

solegalli · 2022-06-26T13:02:52Z

Thank you! merging as we speak.

solegalli reviewed May 31, 2022

View reviewed changes

redo zero-encoding with new base transformers

6e10090

david-cortes force-pushed the count_zero_categ branch from 49a3710 to 6e10090 Compare June 4, 2022 15:39

david-cortes added 3 commits June 4, 2022 17:47

linter

94203ae

linter

05b81cb

use pandas built-in normalization

7b95f75

solegalli reviewed Jun 12, 2022

View reviewed changes

david-cortes and others added 14 commits June 12, 2022 16:44

fix duplicated lines

c6ad8d1

split base categorical class according to whether it supports 'encode…

182da73

…' errors or not

fix docs

d7ac950

missing comma

0d834f0

redo zero-encoding with new base transformers

5468f1a

linter

8e25b70

linter

dde3072

use pandas built-in normalization

5821e0c

fix duplicated lines

bd34740

split base categorical class according to whether it supports 'encode…

e6281a5

…' errors or not

fix docs

02366bd

missing comma

b0f04b8

remove unnecesary docstring from common docstrings file

e2bfe52

reverts base encoder to original representation

69a29b9

solegalli added 4 commits June 22, 2022 09:56

adds function to check parameter errors

4ed5579

reorganizes docstrings for param errors

9cdb751

reorganize init checks

b44d242

reorganizes tests

5bf7c3e

solegalli changed the title ~~Add option to output zeros from CountFrequencyEncoder~~ add option to encode unseen categories with zeros for CountFrequencyEncoder Jun 22, 2022

solegalli added 3 commits June 22, 2022 10:50

add test for frequency

d3354cd

fix codestyle

15ccce4

sorts imports

aa3efe7

solve merge conflicts

cc830c0

solegalli reviewed Jun 22, 2022

View reviewed changes

formatting

6bd3b1c

formatting

7ad5f95

solegalli added 4 commits June 26, 2022 14:41

reorganize imports and parameters

916291d

add Note on missing values for unseen categories

61363d2

reorganize imports

5cb23d7

remove white space

69171a0

solegalli merged commit 0c59a69 into feature-engine:main Jun 26, 2022

Conversation

david-cortes commented May 12, 2022

Uh oh!

david-cortes commented May 12, 2022

Uh oh!

solegalli commented May 13, 2022

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david-cortes commented Jun 4, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

david-cortes commented Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solegalli commented Jun 22, 2022

Uh oh!

david-cortes commented Jun 22, 2022

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes commented Jun 22, 2022

Uh oh!

solegalli commented Jun 22, 2022

Uh oh!

david-cortes commented Jun 22, 2022

Uh oh!

solegalli commented Jun 23, 2022

Uh oh!

david-cortes commented Jun 23, 2022

Uh oh!

david-cortes commented Jun 23, 2022

Uh oh!

david-cortes commented Jun 12, 2022 •

edited

Loading