add option to encode unseen categories with zeros for CountFrequencyEncoder#456
Conversation
|
The failing tests are not from the module that was modified and do not look to be related to this PR. Probably some incompatibility with newer scikit-learn versions. |
|
Will have a look when I am back from hols, apologies for the delay |
solegalli
left a comment
There was a problem hiding this comment.
Thanks a lot for the changes! And apologies for the delay. I am just back from holidays and had to fix those failing tests as they were blocking all PRs.
On that front, could you please rebase the lastest version of main to this PR?
I am thinking of expanding the options to encode unseen data to more encoders:
CountFrequencyEncoder by 0 (this PR)
OrdinalEncoder by -1 (this issue #428)
TargetEncoder by mean of target for all observations (no issue yet)
WoE and PR we need to think.
So instead of calling it "zeroes" could we call the parameter "encode"?
I add more details throughout the code.
Please let me know your thoughts and if you could implement this. Thank you!
| .columns[X[self.encoder_dict_.keys()].isnull().any()] | ||
| .tolist() | ||
| ) | ||
| if self.errors != "zeros": |
There was a problem hiding this comment.
I am thinking of expanding the allowed parameters for this parameter into "raise", "ignore" and "encode". The first 2 will perform the same funtionality across transformers. The third one would perform different functionality in different transformers. In this transformer, it will replace the nan by 0.
In the ordinalEncoder it would replace by -1. In the target mean by the target prior, and so on. Since this is a base class, we would need to add that versatility here.
There was a problem hiding this comment.
Changed it for this particular transformer.
| "During the encoding, NaN values were introduced in the feature(s) " | ||
| f"{nan_columns_str}." | ||
| msg = ( | ||
| "During the encoding, NaN values were introduced " + |
There was a problem hiding this comment.
can we avoid using + at the end of line? so we have consistent code across our codebase?
Good idea to unify the text in a variable, thank you!
|
|
||
| {errors} | ||
|
|
||
| allow_zero_enc : bool |
There was a problem hiding this comment.
I would prefer not to add an extra parameter. I don't think we really need it, do we?
| errors: str = "ignore", | ||
| allow_zero_enc: bool = False | ||
| ) -> None: | ||
| if errors not in ["raise", "ignore"]: |
There was a problem hiding this comment.
can we just expand this to if errors not in ["raise", "ignore", "encode"] instead of using an extra parameter?
| n_obs = float(len(X)) | ||
| self.encoder_dict_[var] = (X[var].value_counts() / n_obs).to_dict() | ||
| self.encoder_dict_[var] = ( | ||
| X[var].value_counts() / n_obs |
There was a problem hiding this comment.
here we could do value_counts(normalize=True) instead of the division. Minor change and not related to this PR thought. But since we are here :p
49a3710 to
6e10090
Compare
|
Redone with the new base classes for categoricals. |
| variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
| ignore_format: bool = False, | ||
| errors: str = "ignore", | ||
| supports_errors_encode: bool = False |
There was a problem hiding this comment.
We can't really add parameters unless they are strictly necessary, because that turns classes less user friendly.
I am trying to think what would be the best way forward, given that at the moment, all classes take "raise" and "ignore" and we want the Count frequency to take the extra parameter.
I think, probably the best way is to take this functionality out of the base class and into a function, that takes the allowed strings as inputs.
Unless you have a different idea?
There was a problem hiding this comment.
Yes, that also would do.
| error. If 'ignore', then unseen categories will be set as NaN and a warning will | ||
| be raised instead. If 'encode', then unseen categories will be encoded according | ||
| to the default strategy from the transformer, provided that it supports it. | ||
| """.rstrip() |
There was a problem hiding this comment.
At the moment, most classes accept only "raise" and "ignore". So I would not change the original docstring.
| to the default strategy from the transformer, provided that it supports it. | ||
| """.rstrip() | ||
|
|
||
| _errors_docstring_with_encode = _errors_docstring + """ |
There was a problem hiding this comment.
at the moment, only the count frequency has the "encode" extra functionality. And when we expand this to other classes, each one will do something different. So we would probably have to right independent docstrings for each one of them. I am not sure that having a centralized copy is suitable in this case. I would just re-word it in the class.
There was a problem hiding this comment.
But not all of the clases would end up supporting the "encode" option.
| with pytest.raises(ValueError): | ||
| encoder = CountFrequencyEncoder() | ||
| encoder.fit(df_enc_na) | ||
| for errors in ["raise", "ignore", "encode"]: |
There was a problem hiding this comment.
why do we need this change? The NA check comes before the unseen categories check.
There was a problem hiding this comment.
It should still fail when passing "encode" if the fit input has NAs.
| encoder = CountFrequencyEncoder() | ||
| encoder.fit(df_enc) | ||
| encoder.transform(df_enc_na) | ||
| for errors in ["raise", "encode"]: |
There was a problem hiding this comment.
interesting, here you skipped the "ignore", so I guess, "ignore" would ignore both already existing nan and the newly introduced.... it feels like a code smell in our original source code...
solegalli
left a comment
There was a problem hiding this comment.
Thank you very much for the code changes.
The code looks good, and I think it could be ready to go.
Your tests made me think that there might be something weird with how we handle existing NAN and newly introduced NaN, I would like to hear your thoughts when creating the tests?
Other than that, my main concern is around adding a new parameter to the base class. I would prefer we did not do that.
I suggested replacing the functionality in the main class by an external function. That would mean changing slightly the code in all other encoding classes. I would like to hear your thoughts on this. Maybe you have a better idea?
thanks a lot!
|
I am not sure what would be the right way to handle explicit missing values that are not from unseen categories. I think ideally whether to handle them should be configured through a different option than this one, and ideally one of the options for how to handle them would be to encode them into a separate category "missing" or similar (which would also require controlling whether to mix up the "encode" values into it depending on the transformer). However pandas functions and methods don't lend themselves well towards handling missing values and that would require a lot more modifications. |
CountFrequencyEncoderCountFrequencyEncoder
|
Sorry for the delay. Please see this PR: david-cortes#1 I mostly reorganized the code you wrote. The conflict is probably through the main branches not being in sync. If you can't resolve it let me know. Thank you so much for the great contribution! |
|
Updated, but I'm not sure if the merge conflict solving I made here preserved the changes that you wanted. |
solegalli
left a comment
There was a problem hiding this comment.
Yes, this is looking great.
Unfortunately, the codestyle change. We've got an indentation pattern that we do not normally use in the codebase.
Could you please have a look and change it back to what it looked like in my PR?
Thank you!
| _ignore_format_docstring, | ||
| _variables_docstring, | ||
| ) | ||
| from feature_engine.dataframe_checks import (_check_contains_na, |
There was a problem hiding this comment.
can we revert to the original indentation please? so it is coherent with the rest of the codebase?
| variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
| ignore_format: bool = False, | ||
| errors: str = "ignore", | ||
| errors: str = "ignore" |
There was a problem hiding this comment.
can we please keep the comma after the parameter?
| from feature_engine.encoding.base_encoder import ( | ||
| CategoricalInitExpandedMixin, | ||
| CategoricalMethodsMixin, | ||
| from feature_engine.encoding._docstrings import (_errors_docstring, |
|
|
||
| import pandas as pd | ||
| import pytest | ||
| from numpy import nan |
|
@solegalli What linter are you using for indentation? |
|
black |
|
Reformatted with black and isort. |
|
Not sure all the files were reformatted. Also, can we remove the base transformer from this PR? Thank you! |
|
If I run either |
|
The base class |
|
you can either black an isort a specific file eg or you can isort the entire codebase and then commit only the relevant files. I would suggest the first option. Also, this is the file that needs to be removed from this PR: If you have a look at the files that changed you should be able to find it. |
|
Updated. |
|
Thank you! merging as we speak. |
Fixes #417
This PR adds an option for
CountFrequencyEncoderto output an encoding result of zero when callingtransformon new data that has categories that were not present in the training data, which makes sense for this particular class since unseen categories would have a count of zero by definition.