add arcsin transformer by tomtom-95 · Pull Request #444 · feature-engine/feature_engine

tomtom-95 · 2022-05-07T16:31:50Z

closes #434

solegalli

Thank you so much for such a complete contribution! This is looking really good.

I am not sure you were done with the code, but I went ahead and added a few comments and suggestions here and there.

This file runs a battery of tests on all transformers from the module. Many tests come from the scikit-learn library through scikit-learn's check_estimator function. And we have our own version for the tests that are common for all our transformers. I pointed those out in the code you committed. So in short, if we add your transformer to the lists in there, multiple tests will be performed over it (I am yet to write some docs for developers on this, apologies)

The tests that we designed for our transformers can be found here in case you want to have a look. I can help you troubleshoot if needed.

I would focus the tests more on the functionality of the class, we can hard code the values, or we can obtain the expected values by applying np.arcsin (to save some work) which is in essence used under the hood.

Thank you so much! and let me know if you need anything.

docs/user_guide/transformation/ArcsinTransformer.rst

solegalli · 2022-05-07T17:13:29Z

docs/user_guide/transformation/ArcsinTransformer.rst

+	train_t['LotArea'].hist(bins=50)
+
+
+.. image:: ../../images/lotareareciprocal.png


this image probably needs updating with the output of the arcsintransformer?

you need to commit a copy of the image that you can probably obtain from executing this code in a jupyter notebook.

I tried to execute the code in a Jupiter notebook but does not find the file 'houseprice.csv'. What am I missing?

yes, at some point we need to migrate all our demos to datasets that are online and we can pick up directly from sklearn.

The house prices dataset is in kaggle, so you need to download it from here and rename it.

Having said this, I would prefer that we used the sklearn California Housing dataset which you can find and load as described here.

solegalli · 2022-05-07T17:15:43Z

docs/user_guide/transformation/ArcsinTransformer.rst

+You can find more details about the :class:`ArcsinTransformer()` here:
+
+
+- `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/transformation/ReciprocalTransformer.ipynb>`_


we can commit a notebook to an example to this folder:
https://github.com/feature-engine/feature-engine-examples/tree/main/transformation

that would be the best.

Alternatively, we just remove the more details section completely. That would also be ok :)

feature_engine/transformation/arcsin.py

tests/test_transformation/test_arcsin_transformer.py

solegalli · 2022-05-07T17:31:49Z

tests/test_transformation/test_arcsin_transformer.py

+        transformer.transform(df_neg)
+
+
+def test_non_fitted_error(df_vartypes):


we have a common test for this: https://github.com/feature-engine/feature_engine/blob/main/tests/estimator_checks/non_fitted_error_checks.py

solegalli · 2022-05-07T17:34:07Z

tests/test_transformation/test_arcsin_transformer.py

+    transformer = ArcsinTransformer(variables=None)
+
+    # test fails here because of the df_vartypes data that are passed raise ValueError 
+    X = transformer.fit_transform(df_vartypes)


this is failing.

solegalli · 2022-05-07T17:36:03Z

docs/user_guide/transformation/ArcsinTransformer.rst

+.. code:: python
+
+	# set up the variable transformer
+	tf = vt.ArcsinTransformer(variables = ['LotArea', 'GrLivArea'])


are these variables between -1 and 1? I don't remember.

For sure, the California housing dataset from sklearn has a decimal variable we can use for the demo, which also gets a nice gaussian distribution after the arcsin, so it is kind of the perfect example 📦

I normally run the code I write in these examples in a jupyter notebook, and then copy and paste the output here.

# Load dataset cal_housing = fetch_california_housing() X = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names) # divide by 50 to obtain values between -1 and +1 X['Latitude'] = X['Latitude'] / 50 y = cal_housing.target

I loaded the California housing dataset but I did not find a decimal variable, so I just take one of the column and divide all his value by 50 to make sure to have values between -1 and +1

hmm, that is not ideal, we'd like to make demos as "real" as possible. Would you be able to check if the breast cancer dataset has some suitable variables?

Alternatively, the kdd99 dataset, which variable description is here has some % variables (see bottom table), which if not in decimal scale already, we could divide by 100 and that would make more sense.

Worse case scenario, the iris dataset's variables are in cm, if we divide by 10, they would be decimal and it is still a valid measure.

The idea of the demo is to show that the variable originally was skewed, and after the transformation it is less skewed. So we would need to play around with the datasets, to see which one fits best. Thank you so much!!!

ok I modified the demo with a variable from the breast cancer dataset

solegalli · 2022-05-07T17:37:04Z

docs/user_guide/transformation/ArcsinTransformer.rst

+
+
+.. image:: ../../images/lotareareciprocal.png
+


maybe we could also demo the get_feature_names_out method that comes from the parent class

ok I add the method to the demo

solegalli

Hi @tomtom-95

Thank you very much for the code changes!

I went ahead and did some checks:

Unfortunately, because this transformer can only work with variables in the -1 and 1 value range, there are a bunch of tests that we can't run. Both from scikit-learn and our own tests.

So to ignore scikit-learn's tests, we need to add in the _more_tags():

feature_engine/feature_engine/transformation/arcsin.py

Lines 154 to 171 in cbd7ccc

    
           def _more_tags(self): 
        
               tags_dict = _return_tags() 
        
               tags_dict["variables"] = "numerical" 
        
               # =======  this tests fail because the transformers throw an error 
        
               # when the values are less than -1 or greater than 1. Nothing to do with the test itself but 
        
               # mostly with the data created and used in the test 
        
               msg = ( 
        
                   "transformers raise errors when data is outside [-1, 1] range, thus this" 
        
                   "check fails" 
        
               ) 
        
               tags_dict["_xfail_checks"]["check_estimators_dtypes"] = msg 
        
               tags_dict["_xfail_checks"]["check_estimators_fit_returns_self"] = msg 
        
               tags_dict["_xfail_checks"]["check_pipeline_consistency"] = msg 
        
               tags_dict["_xfail_checks"]["check_estimators_overwrite_params"] = msg 
        
               tags_dict["_xfail_checks"]["check_estimators_pickle"] = msg 
        
               tags_dict["_xfail_checks"]["check_transformer_general"] = msg 
        
               return tags_dict

All of these tests:

check_methods_subset_invariance
check_fit2d_1sample
check_fit2d_1feature
check_dict_unchanged
check_dont_overwrite_parameters
check_fit_check_is_fitted
check_n_features_in

To skip our own tests in this file:

feature_engine/tests/test_transformation/test_check_estimator_transformers.py

Lines 15 to 33 in cbd7ccc

    
           _estimators = [ 
        
               BoxCoxTransformer(), 
        
               LogTransformer(), 
        
               LogCpTransformer(), 
        
               PowerTransformer(), 
        
               ReciprocalTransformer(), 
        
               YeoJohnsonTransformer(), 
        
               ArcsinTransformer(), 
        
           ] 
        
           @pytest.mark.parametrize("estimator", _estimators) 
        
           def test_check_estimator_from_sklearn(estimator): 
        
               return check_estimator(estimator) 
        
           @pytest.mark.parametrize("estimator", _estimators[3:]) 
        
           def test_check_estimator_from_feature_engine(estimator): 
        
               return check_feature_engine_estimator(estimator)

We need to re-order the ArcsinTransformer right under the LogCpTransformer, and then in line 31 we need to change the range to [4:]

Also, I went ahead and checked the links relevant to the arcsin transformation, and the correct transformation is np.arcsin(np.sqrt(X[self.variables_])). You can find more info in this link and in this paper.. Also a demo here.

This transformation was designed for binomial distributions, and by including the root, it is now only possible in the space [0,1], so proportions.

Would you mind updating the code to accommodate this transformation?

Regarding tests: since we can't run our generic tests, it might be a good idea to actually keep the tests that you already drafted here (apologies, this was my mistake):

feature_engine/tests/test_transformation/test_arcsin_transformer.py

Lines 8 to 39 in cbd7ccc

    
           #def test_automatically_find_variables(df_vartypes): 
        
           #    # test case 1: automatically select variables 
        
           #    transformer = ArcsinTransformer(variables=None) 
        
           # 
        
           #    # test fails here because of the df_vartypes data that are passed raise ValueError  
        
           #    X = transformer.fit_transform(df_vartypes) 
        
           # 
        
           #    # expected output 
        
           #    transf_df = df_vartypes.copy() 
        
           #    transf_df["Age"] = [0.05, 0.047619, 0.0526316, 0.0555556] 
        
           #    transf_df["Marks"] = [1.0000, 0.200, 0.34637 , 0] 
        
           # 
        
           #    # test inverse_transform 
        
           #    Xit = transformer.inverse_transform(X) 
        
           # 
        
           #    # convert numbers to original format. 
        
           #    Xit["Age"] = Xit["Age"].round().astype("int64") 
        
           #    Xit["Marks"] = Xit["Marks"].round(1) 
        
           #def test_fit_raises_error_if_na_in_df(df_na): 
        
           #    # test case 2: when dataset contains na, fit method 
        
           #    with pytest.raises(ValueError): 
        
           #        transformer = ArcsinTransformer() 
        
           #        transformer.fit(df_na) 
        
           #def test_transform_raises_error_if_na_in_df(df_vartypes, df_na): 
        
           #    # test case 3: when dataset contains na, transform method 
        
           #    with pytest.raises(ValueError): 
        
           #        transformer = ArcsinTransformer() 
        
           ##        transformer.fit(df_vartypes) 
        
           #        transformer.transform(df_na[["Name", "City", "Age", "Marks", "dob"]])

Finally, for the demo the breast cancer dataset has many variables between 0 and 1 that have a skewed distribution which is fixed by the arcsin transformation. So I would apply the transformer to all of these variables:

vars_ = [
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean fractal dimension',
 'smoothness error',
 'compactness error',
 'concavity error',
 'concave points error',
 'symmetry error',
 'fractal dimension error',
 'worst symmetry',
 'worst fractal dimension']

And make histograms for all the variables before and after the transformation with this command:

X[vars_].hist(figsize=(20,20))
plt.show()

Would you be happy to make those changes? Thanks a lot!

…ormer

…mitation

tomtom-95 · 2022-05-11T09:28:30Z

All done (I think). Some problem: both ArcsinTransformer.rst file in api_doc and user_guide do not pass black test (I do not know why). The other problem is when I build the documentation, there are a lot of warnings about file in .ipynb_checkpoints folder: what is this folder?

solegalli · 2022-05-11T09:54:18Z

Hi @tomtom-95

very well done. I am impressed!!

These are the errors:

stylechecks run-test: commands[0] | flake8 feature_engine tests
tests/test_transformation/test_arcsin_transformer.py:1:1: F401 'pandas as pd' imported but unused
tests/test_transformation/test_arcsin_transformer.py:7:1: E302 expected 2 blank lines, found 1
ERROR: InvocationError for command /home/circleci/project/.tox/unit_tests/bin/flake8 feature_engine tests (exited with code 1

If you execute flake8 tests in your command line, you should be able to see them.

In short, you need to remove an import and add a blank line :)

The .ipynb_checkpoints folder is where jupyter stores the last execution state of your notebooks. That should not matter. not sure why that is raising an error with the docs. But the tests have passed here, so I think its fine.

…ormer

…mitation

solegalli · 2022-05-31T07:36:11Z

Hi @tomtom-95

Thank you for the changes to the code base and apologies for the delay.

I made a PR to your repo, which you can find here with minor cosmetic tweaks to the user guide and docstrings and reorganizing the lines of code in the tests.

I also had to fix the failing tests (unrelated to this PR) and then rebase those changes into this PR, so hopefully, if you merge the PR in your repo, we would now have here the final version of the transformer with passing tests.

I added a link with more info on how to update your fork and rebase main to your branch, in case you want to see more clearly the changes I made without the files corresponding to the unrelated changes. If you don't want to or don't have time, just merging would update the code here. I do recommend doing the exercise of rebasing though, it is a great git learning experience, lol. I struggled quite a bit with that in the past.

If you have questions let me know.

Thanks for contributing to the project!

rebase main, change wording in user guide, reformat tests

tomtom-95 · 2022-06-01T17:35:43Z

Hi @solegalli

Sorry for the delay, I merged your PR on my main branch. Now I see that all the checks have passed here (hooray!!!).
I think we are really done but if there's more please tell me, I am really learning a lot.

solegalli · 2022-06-01T17:54:42Z

it's done!

thank you for you amazing contribution @tomtom-95

tomtom-95 added 7 commits April 19, 2022 10:00

adds arcsin transformer and updates documentation

c098fda

adds test for arcsin_transformer.py

138e86b

test arcsin transformer

fe0ee22

add ArcsinTranform documentation

afcf088

correction of little error on doc

5bdddb0

modifies index to add ArcsinTransformer

4158740

style correction

d33470c

solegalli reviewed May 7, 2022

View reviewed changes

tomtom-95 marked this pull request as draft May 7, 2022 19:14

Tommaso Pellegrino added 5 commits May 8, 2022 11:58

little modification

a5b2be4

corrects string formatting

603632b

modifies ArcsinTransformer.rst and adds images

112e7d6

adds images

1b719e0

modifies tests on arcsin

1f80ae5

solegalli mentioned this pull request May 10, 2022

[NEW TRANSFORMER] arcsin transformation for decimal data #434

Closed

Tommaso Pellegrino added 3 commits May 10, 2022 09:36

deletes test that can be run from another file

b38ee84

modifies dataset for arcsin demo

f11aa71

adds get_feature_names_out to the demo

cbd7ccc

solegalli reviewed May 11, 2022

View reviewed changes

Tommaso Pellegrino added 7 commits May 11, 2022 09:33

excludes tests on arcsin we can't run, includes sqrt in arcsin transf…

8d71de7

…ormer

reintegrates useful tests, deletes test we can't run for the range li…

106b1cd

…mitation

modifies demo, adds variables

9400f52

modifies range specification

979d099

modifies file with black/isort

e754c59

reformats file

93c03e8

reformats to pass black test (still doesn't)

ff720ae

Tommaso Pellegrino added 2 commits May 11, 2022 12:21

delete additional information part

0c5478a

change to pass flake8 test

0249e0b

tomtom-95 and others added 23 commits May 31, 2022 08:34

style correction

a737573

little modification

30e0390

corrects string formatting

82955a8

modifies ArcsinTransformer.rst and adds images

8780d40

adds images

0a489d0

modifies tests on arcsin

abcad9a

deletes test that can be run from another file

b6e1d16

modifies dataset for arcsin demo

8d4cab3

adds get_feature_names_out to the demo

7eb19ae

excludes tests on arcsin we can't run, includes sqrt in arcsin transf…

3d421c1

…ormer

reintegrates useful tests, deletes test we can't run for the range li…

11ce217

…mitation

modifies demo, adds variables

9f527ee

modifies range specification

daefac9

modifies file with black/isort

792606e

reformats file

8261d76

reformats to pass black test (still doesn't)

402e7ef

delete additional information part

a245c8b

change to pass flake8 test

ebf5bf1

adds test and modifies a little the docs

65d9354

expand user guide

3d828b5

change wording and exp implentation in class

dcddde1

reorganise tests, fix bugs in tests

2ae052c

Merge branch 'arcsin_transformation' into arcsin

8e0ddbe

Merge pull request #1 from feature-engine/arcsin

e52c99f

rebase main, change wording in user guide, reformat tests

solegalli changed the title ~~Arcsin transformation added and documentation updated~~ add arcsin transformer Jun 1, 2022

solegalli merged commit bdc047c into feature-engine:main Jun 1, 2022

tomtom-95 deleted the arcsin_transformation branch June 1, 2022 18:09

		train_t['LotArea'].hist(bins=50)


		.. image:: ../../images/lotareareciprocal.png

		You can find more details about the :class:`ArcsinTransformer()` here:


		- `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/transformation/ReciprocalTransformer.ipynb>`_

		transformer.transform(df_neg)


		def test_non_fitted_error(df_vartypes):

	def _more_tags(self):
	tags_dict = _return_tags()
	tags_dict["variables"] = "numerical"
	# ======= this tests fail because the transformers throw an error
	# when the values are less than -1 or greater than 1. Nothing to do with the test itself but
	# mostly with the data created and used in the test
	msg = (
	"transformers raise errors when data is outside [-1, 1] range, thus this"
	"check fails"
	)
	tags_dict["_xfail_checks"]["check_estimators_dtypes"] = msg
	tags_dict["_xfail_checks"]["check_estimators_fit_returns_self"] = msg
	tags_dict["_xfail_checks"]["check_pipeline_consistency"] = msg
	tags_dict["_xfail_checks"]["check_estimators_overwrite_params"] = msg
	tags_dict["_xfail_checks"]["check_estimators_pickle"] = msg
	tags_dict["_xfail_checks"]["check_transformer_general"] = msg

	return tags_dict

	_estimators = [
	BoxCoxTransformer(),
	LogTransformer(),
	LogCpTransformer(),
	PowerTransformer(),
	ReciprocalTransformer(),
	YeoJohnsonTransformer(),
	ArcsinTransformer(),
	]


	@pytest.mark.parametrize("estimator", _estimators)
	def test_check_estimator_from_sklearn(estimator):
	return check_estimator(estimator)


	@pytest.mark.parametrize("estimator", _estimators[3:])
	def test_check_estimator_from_feature_engine(estimator):
	return check_feature_engine_estimator(estimator)

	#def test_automatically_find_variables(df_vartypes):
	# # test case 1: automatically select variables
	# transformer = ArcsinTransformer(variables=None)
	#
	# # test fails here because of the df_vartypes data that are passed raise ValueError
	# X = transformer.fit_transform(df_vartypes)
	#
	# # expected output
	# transf_df = df_vartypes.copy()
	# transf_df["Age"] = [0.05, 0.047619, 0.0526316, 0.0555556]
	# transf_df["Marks"] = [1.0000, 0.200, 0.34637 , 0]
	#
	# # test inverse_transform
	# Xit = transformer.inverse_transform(X)
	#
	# # convert numbers to original format.
	# Xit["Age"] = Xit["Age"].round().astype("int64")
	# Xit["Marks"] = Xit["Marks"].round(1)

	#def test_fit_raises_error_if_na_in_df(df_na):
	# # test case 2: when dataset contains na, fit method
	# with pytest.raises(ValueError):
	# transformer = ArcsinTransformer()
	# transformer.fit(df_na)


	#def test_transform_raises_error_if_na_in_df(df_vartypes, df_na):
	# # test case 3: when dataset contains na, transform method
	# with pytest.raises(ValueError):
	# transformer = ArcsinTransformer()
	## transformer.fit(df_vartypes)
	# transformer.transform(df_na[["Name", "City", "Age", "Marks", "dob"]])

Conversation

tomtom-95 commented May 7, 2022 • edited by solegalli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solegalli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solegalli May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

tomtom-95 commented May 11, 2022

Uh oh!

solegalli commented May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solegalli commented May 31, 2022

Uh oh!

tomtom-95 commented Jun 1, 2022

Uh oh!

solegalli commented Jun 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomtom-95 commented May 7, 2022 •

edited by solegalli

Loading

solegalli left a comment •

edited

Loading

solegalli May 10, 2022 •

edited

Loading

solegalli commented May 11, 2022 •

edited

Loading