Skip to content

add arcsin transformer#444

Merged
solegalli merged 55 commits intofeature-engine:mainfrom
tomtom-95:arcsin_transformation
Jun 1, 2022
Merged

add arcsin transformer#444
solegalli merged 55 commits intofeature-engine:mainfrom
tomtom-95:arcsin_transformation

Conversation

@tomtom-95
Copy link
Contributor

@tomtom-95 tomtom-95 commented May 7, 2022

closes #434

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tomtom-95

Thank you so much for such a complete contribution! This is looking really good.

I am not sure you were done with the code, but I went ahead and added a few comments and suggestions here and there.

This file runs a battery of tests on all transformers from the module. Many tests come from the scikit-learn library through scikit-learn's check_estimator function. And we have our own version for the tests that are common for all our transformers. I pointed those out in the code you committed. So in short, if we add your transformer to the lists in there, multiple tests will be performed over it (I am yet to write some docs for developers on this, apologies)

The tests that we designed for our transformers can be found here in case you want to have a look. I can help you troubleshoot if needed.

I would focus the tests more on the functionality of the class, we can hard code the values, or we can obtain the expected values by applying np.arcsin (to save some work) which is in essence used under the hood.

Thank you so much! and let me know if you need anything.

train_t['LotArea'].hist(bins=50)


.. image:: ../../images/lotareareciprocal.png
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this image probably needs updating with the output of the arcsintransformer?

you need to commit a copy of the image that you can probably obtain from executing this code in a jupyter notebook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to execute the code in a Jupiter notebook but does not find the file 'houseprice.csv'. What am I missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, at some point we need to migrate all our demos to datasets that are online and we can pick up directly from sklearn.

The house prices dataset is in kaggle, so you need to download it from here and rename it.

Having said this, I would prefer that we used the sklearn California Housing dataset which you can find and load as described here.

You can find more details about the :class:`ArcsinTransformer()` here:


- `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/transformation/ReciprocalTransformer.ipynb>`_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can commit a notebook to an example to this folder:
https://github.com/feature-engine/feature-engine-examples/tree/main/transformation

that would be the best.

Alternatively, we just remove the more details section completely. That would also be ok :)

transformer.transform(df_neg)


def test_non_fitted_error(df_vartypes):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer = ArcsinTransformer(variables=None)

# test fails here because of the df_vartypes data that are passed raise ValueError
X = transformer.fit_transform(df_vartypes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is failing.

.. code:: python

# set up the variable transformer
tf = vt.ArcsinTransformer(variables = ['LotArea', 'GrLivArea'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these variables between -1 and 1? I don't remember.

For sure, the California housing dataset from sklearn has a decimal variable we can use for the demo, which also gets a nice gaussian distribution after the arcsin, so it is kind of the perfect example 📦

I normally run the code I write in these examples in a jupyter notebook, and then copy and paste the output here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Load dataset

cal_housing = fetch_california_housing()
X = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names)
# divide by 50 to obtain values between -1 and +1
X['Latitude'] = X['Latitude'] / 50
y = cal_housing.target

I loaded the California housing dataset but I did not find a decimal variable, so I just take one of the column and divide all his value by 50 to make sure to have values between -1 and +1

Copy link
Collaborator

@solegalli solegalli May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that is not ideal, we'd like to make demos as "real" as possible. Would you be able to check if the breast cancer dataset has some suitable variables?

Alternatively, the kdd99 dataset, which variable description is here has some % variables (see bottom table), which if not in decimal scale already, we could divide by 100 and that would make more sense.

Worse case scenario, the iris dataset's variables are in cm, if we divide by 10, they would be decimal and it is still a valid measure.

The idea of the demo is to show that the variable originally was skewed, and after the transformation it is less skewed. So we would need to play around with the datasets, to see which one fits best. Thank you so much!!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I modified the demo with a variable from the breast cancer dataset



.. image:: ../../images/lotareareciprocal.png

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could also demo the get_feature_names_out method that comes from the parent class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I add the method to the demo

@tomtom-95 tomtom-95 marked this pull request as draft May 7, 2022 19:14
Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tomtom-95

Thank you very much for the code changes!

I went ahead and did some checks:

Unfortunately, because this transformer can only work with variables in the -1 and 1 value range, there are a bunch of tests that we can't run. Both from scikit-learn and our own tests.

So to ignore scikit-learn's tests, we need to add in the _more_tags():

def _more_tags(self):
tags_dict = _return_tags()
tags_dict["variables"] = "numerical"
# ======= this tests fail because the transformers throw an error
# when the values are less than -1 or greater than 1. Nothing to do with the test itself but
# mostly with the data created and used in the test
msg = (
"transformers raise errors when data is outside [-1, 1] range, thus this"
"check fails"
)
tags_dict["_xfail_checks"]["check_estimators_dtypes"] = msg
tags_dict["_xfail_checks"]["check_estimators_fit_returns_self"] = msg
tags_dict["_xfail_checks"]["check_pipeline_consistency"] = msg
tags_dict["_xfail_checks"]["check_estimators_overwrite_params"] = msg
tags_dict["_xfail_checks"]["check_estimators_pickle"] = msg
tags_dict["_xfail_checks"]["check_transformer_general"] = msg
return tags_dict

All of these tests:

check_methods_subset_invariance
check_fit2d_1sample
check_fit2d_1feature
check_dict_unchanged
check_dont_overwrite_parameters
check_fit_check_is_fitted
check_n_features_in

To skip our own tests in this file:

_estimators = [
BoxCoxTransformer(),
LogTransformer(),
LogCpTransformer(),
PowerTransformer(),
ReciprocalTransformer(),
YeoJohnsonTransformer(),
ArcsinTransformer(),
]
@pytest.mark.parametrize("estimator", _estimators)
def test_check_estimator_from_sklearn(estimator):
return check_estimator(estimator)
@pytest.mark.parametrize("estimator", _estimators[3:])
def test_check_estimator_from_feature_engine(estimator):
return check_feature_engine_estimator(estimator)

We need to re-order the ArcsinTransformer right under the LogCpTransformer, and then in line 31 we need to change the range to [4:]

Also, I went ahead and checked the links relevant to the arcsin transformation, and the correct transformation is np.arcsin(np.sqrt(X[self.variables_])). You can find more info in this link and in this paper.. Also a demo here.

This transformation was designed for binomial distributions, and by including the root, it is now only possible in the space [0,1], so proportions.

Would you mind updating the code to accommodate this transformation?

Regarding tests: since we can't run our generic tests, it might be a good idea to actually keep the tests that you already drafted here (apologies, this was my mistake):

#def test_automatically_find_variables(df_vartypes):
# # test case 1: automatically select variables
# transformer = ArcsinTransformer(variables=None)
#
# # test fails here because of the df_vartypes data that are passed raise ValueError
# X = transformer.fit_transform(df_vartypes)
#
# # expected output
# transf_df = df_vartypes.copy()
# transf_df["Age"] = [0.05, 0.047619, 0.0526316, 0.0555556]
# transf_df["Marks"] = [1.0000, 0.200, 0.34637 , 0]
#
# # test inverse_transform
# Xit = transformer.inverse_transform(X)
#
# # convert numbers to original format.
# Xit["Age"] = Xit["Age"].round().astype("int64")
# Xit["Marks"] = Xit["Marks"].round(1)
#def test_fit_raises_error_if_na_in_df(df_na):
# # test case 2: when dataset contains na, fit method
# with pytest.raises(ValueError):
# transformer = ArcsinTransformer()
# transformer.fit(df_na)
#def test_transform_raises_error_if_na_in_df(df_vartypes, df_na):
# # test case 3: when dataset contains na, transform method
# with pytest.raises(ValueError):
# transformer = ArcsinTransformer()
## transformer.fit(df_vartypes)
# transformer.transform(df_na[["Name", "City", "Age", "Marks", "dob"]])

Finally, for the demo the breast cancer dataset has many variables between 0 and 1 that have a skewed distribution which is fixed by the arcsin transformation. So I would apply the transformer to all of these variables:

vars_ = [
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean fractal dimension',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst symmetry',
 'worst fractal dimension']

And make histograms for all the variables before and after the transformation with this command:

X[vars_].hist(figsize=(20,20))
plt.show()

Would you be happy to make those changes? Thanks a lot!

@tomtom-95
Copy link
Contributor Author

All done (I think). Some problem: both ArcsinTransformer.rst file in api_doc and user_guide do not pass black test (I do not know why). The other problem is when I build the documentation, there are a lot of warnings about file in .ipynb_checkpoints folder: what is this folder?

@solegalli
Copy link
Collaborator

solegalli commented May 11, 2022

Hi @tomtom-95

very well done. I am impressed!!

These are the errors:

stylechecks run-test: commands[0] | flake8 feature_engine tests
tests/test_transformation/test_arcsin_transformer.py:1:1: F401 'pandas as pd' imported but unused
tests/test_transformation/test_arcsin_transformer.py:7:1: E302 expected 2 blank lines, found 1
ERROR: InvocationError for command /home/circleci/project/.tox/unit_tests/bin/flake8 feature_engine tests (exited with code 1

If you execute flake8 tests in your command line, you should be able to see them.

In short, you need to remove an import and add a blank line :)

The .ipynb_checkpoints folder is where jupyter stores the last execution state of your notebooks. That should not matter. not sure why that is raising an error with the docs. But the tests have passed here, so I think its fine.

@solegalli
Copy link
Collaborator

Hi @tomtom-95

Thank you for the changes to the code base and apologies for the delay.

I made a PR to your repo, which you can find here with minor cosmetic tweaks to the user guide and docstrings and reorganizing the lines of code in the tests.

I also had to fix the failing tests (unrelated to this PR) and then rebase those changes into this PR, so hopefully, if you merge the PR in your repo, we would now have here the final version of the transformer with passing tests.

I added a link with more info on how to update your fork and rebase main to your branch, in case you want to see more clearly the changes I made without the files corresponding to the unrelated changes. If you don't want to or don't have time, just merging would update the code here. I do recommend doing the exercise of rebasing though, it is a great git learning experience, lol. I struggled quite a bit with that in the past.

If you have questions let me know.

Thanks for contributing to the project!

rebase main, change wording in user guide, reformat tests
@tomtom-95
Copy link
Contributor Author

Hi @solegalli

Sorry for the delay, I merged your PR on my main branch. Now I see that all the checks have passed here (hooray!!!).
I think we are really done but if there's more please tell me, I am really learning a lot.

@solegalli solegalli changed the title Arcsin transformation added and documentation updated add arcsin transformer Jun 1, 2022
@solegalli solegalli merged commit bdc047c into feature-engine:main Jun 1, 2022
@solegalli
Copy link
Collaborator

it's done!

thank you for you amazing contribution @tomtom-95

@tomtom-95 tomtom-95 deleted the arcsin_transformation branch June 1, 2022 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NEW TRANSFORMER] arcsin transformation for decimal data

3 participants