Skip to content

add option to encode unseen categories with zeros for CountFrequencyEncoder#456

Merged
solegalli merged 32 commits intofeature-engine:mainfrom
david-cortes:count_zero_categ
Jun 26, 2022
Merged

add option to encode unseen categories with zeros for CountFrequencyEncoder#456
solegalli merged 32 commits intofeature-engine:mainfrom
david-cortes:count_zero_categ

Conversation

@david-cortes
Copy link
Contributor

Fixes #417

This PR adds an option for CountFrequencyEncoder to output an encoding result of zero when calling transform on new data that has categories that were not present in the training data, which makes sense for this particular class since unseen categories would have a count of zero by definition.

@david-cortes
Copy link
Contributor Author

The failing tests are not from the module that was modified and do not look to be related to this PR. Probably some incompatibility with newer scikit-learn versions.

@solegalli
Copy link
Collaborator

Will have a look when I am back from hols, apologies for the delay

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-cortes

Thanks a lot for the changes! And apologies for the delay. I am just back from holidays and had to fix those failing tests as they were blocking all PRs.

On that front, could you please rebase the lastest version of main to this PR?

I am thinking of expanding the options to encode unseen data to more encoders:
CountFrequencyEncoder by 0 (this PR)
OrdinalEncoder by -1 (this issue #428)
TargetEncoder by mean of target for all observations (no issue yet)
WoE and PR we need to think.

So instead of calling it "zeroes" could we call the parameter "encode"?
I add more details throughout the code.

Please let me know your thoughts and if you could implement this. Thank you!

.columns[X[self.encoder_dict_.keys()].isnull().any()]
.tolist()
)
if self.errors != "zeros":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of expanding the allowed parameters for this parameter into "raise", "ignore" and "encode". The first 2 will perform the same funtionality across transformers. The third one would perform different functionality in different transformers. In this transformer, it will replace the nan by 0.

In the ordinalEncoder it would replace by -1. In the target mean by the target prior, and so on. Since this is a base class, we would need to add that versatility here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it for this particular transformer.

"During the encoding, NaN values were introduced in the feature(s) "
f"{nan_columns_str}."
msg = (
"During the encoding, NaN values were introduced " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid using + at the end of line? so we have consistent code across our codebase?

Good idea to unify the text in a variable, thank you!


{errors}

allow_zero_enc : bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to add an extra parameter. I don't think we really need it, do we?

errors: str = "ignore",
allow_zero_enc: bool = False
) -> None:
if errors not in ["raise", "ignore"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just expand this to if errors not in ["raise", "ignore", "encode"] instead of using an extra parameter?

n_obs = float(len(X))
self.encoder_dict_[var] = (X[var].value_counts() / n_obs).to_dict()
self.encoder_dict_[var] = (
X[var].value_counts() / n_obs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we could do value_counts(normalize=True) instead of the division. Minor change and not related to this PR thought. But since we are here :p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@david-cortes
Copy link
Contributor Author

Redone with the new base classes for categoricals.

variables: Union[None, int, str, List[Union[str, int]]] = None,
ignore_format: bool = False,
errors: str = "ignore",
supports_errors_encode: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @david-cortes

We can't really add parameters unless they are strictly necessary, because that turns classes less user friendly.

I am trying to think what would be the best way forward, given that at the moment, all classes take "raise" and "ignore" and we want the Count frequency to take the extra parameter.

I think, probably the best way is to take this functionality out of the base class and into a function, that takes the allowed strings as inputs.

Unless you have a different idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that also would do.

error. If 'ignore', then unseen categories will be set as NaN and a warning will
be raised instead. If 'encode', then unseen categories will be encoded according
to the default strategy from the transformer, provided that it supports it.
""".rstrip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, most classes accept only "raise" and "ignore". So I would not change the original docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

to the default strategy from the transformer, provided that it supports it.
""".rstrip()

_errors_docstring_with_encode = _errors_docstring + """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment, only the count frequency has the "encode" extra functionality. And when we expand this to other classes, each one will do something different. So we would probably have to right independent docstrings for each one of them. I am not sure that having a centralized copy is suitable in this case. I would just re-word it in the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not all of the clases would end up supporting the "encode" option.

with pytest.raises(ValueError):
encoder = CountFrequencyEncoder()
encoder.fit(df_enc_na)
for errors in ["raise", "ignore", "encode"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? The NA check comes before the unseen categories check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should still fail when passing "encode" if the fit input has NAs.

encoder = CountFrequencyEncoder()
encoder.fit(df_enc)
encoder.transform(df_enc_na)
for errors in ["raise", "encode"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, here you skipped the "ignore", so I guess, "ignore" would ignore both already existing nan and the newly introduced.... it feels like a code smell in our original source code...

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-cortes

Thank you very much for the code changes.

The code looks good, and I think it could be ready to go.

Your tests made me think that there might be something weird with how we handle existing NAN and newly introduced NaN, I would like to hear your thoughts when creating the tests?

Other than that, my main concern is around adding a new parameter to the base class. I would prefer we did not do that.

I suggested replacing the functionality in the main class by an external function. That would mean changing slightly the code in all other encoding classes. I would like to hear your thoughts on this. Maybe you have a better idea?

thanks a lot!

@david-cortes
Copy link
Contributor Author

david-cortes commented Jun 12, 2022

I am not sure what would be the right way to handle explicit missing values that are not from unseen categories.

I think ideally whether to handle them should be configured through a different option than this one, and ideally one of the options for how to handle them would be to encode them into a separate category "missing" or similar (which would also require controlling whether to mix up the "encode" values into it depending on the transformer). However pandas functions and methods don't lend themselves well towards handling missing values and that would require a lot more modifications.

@solegalli solegalli changed the title Add option to output zeros from CountFrequencyEncoder add option to encode unseen categories with zeros for CountFrequencyEncoder Jun 22, 2022
@solegalli
Copy link
Collaborator

Hi @david-cortes

Sorry for the delay. Please see this PR: david-cortes#1

I mostly reorganized the code you wrote.

The conflict is probably through the main branches not being in sync. If you can't resolve it let me know.

Thank you so much for the great contribution!

@david-cortes
Copy link
Contributor Author

Updated, but I'm not sure if the merge conflict solving I made here preserved the changes that you wanted.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-cortes

Yes, this is looking great.

Unfortunately, the codestyle change. We've got an indentation pattern that we do not normally use in the codebase.

Could you please have a look and change it back to what it looked like in my PR?

Thank you!

_ignore_format_docstring,
_variables_docstring,
)
from feature_engine.dataframe_checks import (_check_contains_na,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we revert to the original indentation please? so it is coherent with the rest of the codebase?

variables: Union[None, int, str, List[Union[str, int]]] = None,
ignore_format: bool = False,
errors: str = "ignore",
errors: str = "ignore"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please keep the comma after the parameter?

from feature_engine.encoding.base_encoder import (
CategoricalInitExpandedMixin,
CategoricalMethodsMixin,
from feature_engine.encoding._docstrings import (_errors_docstring,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation


import pandas as pd
import pytest
from numpy import nan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use isort here?

@david-cortes
Copy link
Contributor Author

@solegalli What linter are you using for indentation?

@solegalli
Copy link
Collaborator

black

@david-cortes
Copy link
Contributor Author

Reformatted with black and isort.

@solegalli
Copy link
Collaborator

Hi @david-cortes

Not sure all the files were reformatted. Also, can we remove the base transformer from this PR?

Thank you!

@david-cortes
Copy link
Contributor Author

If I run either black or isort on this package, it will reformat pretty much all the files, not just the ones connected with this PR.

@david-cortes
Copy link
Contributor Author

The base class CategoricalInitExpandedMixin is used by other transformers too.

@solegalli
Copy link
Collaborator

you can either black an isort a specific file eg black feature_engine\encoding\__init_.py

or you can isort the entire codebase and then commit only the relevant files.

I would suggest the first option.

Also, this is the file that needs to be removed from this PR: feature_engine/base_transformers.py

If you have a look at the files that changed you should be able to find it.

@david-cortes
Copy link
Contributor Author

Updated.

@solegalli
Copy link
Collaborator

Thank you! merging as we speak.

@solegalli solegalli merged commit 0c59a69 into feature-engine:main Jun 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CountFrequencyEncoder could output zeros for new categories

2 participants