Conversation
There was a problem hiding this comment.
Hi @tomtom-95
Thank you so much for such a complete contribution! This is looking really good.
I am not sure you were done with the code, but I went ahead and added a few comments and suggestions here and there.
This file runs a battery of tests on all transformers from the module. Many tests come from the scikit-learn library through scikit-learn's check_estimator function. And we have our own version for the tests that are common for all our transformers. I pointed those out in the code you committed. So in short, if we add your transformer to the lists in there, multiple tests will be performed over it (I am yet to write some docs for developers on this, apologies)
The tests that we designed for our transformers can be found here in case you want to have a look. I can help you troubleshoot if needed.
I would focus the tests more on the functionality of the class, we can hard code the values, or we can obtain the expected values by applying np.arcsin (to save some work) which is in essence used under the hood.
Thank you so much! and let me know if you need anything.
| train_t['LotArea'].hist(bins=50) | ||
|
|
||
|
|
||
| .. image:: ../../images/lotareareciprocal.png |
There was a problem hiding this comment.
this image probably needs updating with the output of the arcsintransformer?
you need to commit a copy of the image that you can probably obtain from executing this code in a jupyter notebook.
There was a problem hiding this comment.
I tried to execute the code in a Jupiter notebook but does not find the file 'houseprice.csv'. What am I missing?
There was a problem hiding this comment.
yes, at some point we need to migrate all our demos to datasets that are online and we can pick up directly from sklearn.
The house prices dataset is in kaggle, so you need to download it from here and rename it.
Having said this, I would prefer that we used the sklearn California Housing dataset which you can find and load as described here.
| You can find more details about the :class:`ArcsinTransformer()` here: | ||
|
|
||
|
|
||
| - `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/transformation/ReciprocalTransformer.ipynb>`_ |
There was a problem hiding this comment.
we can commit a notebook to an example to this folder:
https://github.com/feature-engine/feature-engine-examples/tree/main/transformation
that would be the best.
Alternatively, we just remove the more details section completely. That would also be ok :)
| transformer.transform(df_neg) | ||
|
|
||
|
|
||
| def test_non_fitted_error(df_vartypes): |
There was a problem hiding this comment.
we have a common test for this: https://github.com/feature-engine/feature_engine/blob/main/tests/estimator_checks/non_fitted_error_checks.py
| transformer = ArcsinTransformer(variables=None) | ||
|
|
||
| # test fails here because of the df_vartypes data that are passed raise ValueError | ||
| X = transformer.fit_transform(df_vartypes) |
| .. code:: python | ||
|
|
||
| # set up the variable transformer | ||
| tf = vt.ArcsinTransformer(variables = ['LotArea', 'GrLivArea']) |
There was a problem hiding this comment.
are these variables between -1 and 1? I don't remember.
For sure, the California housing dataset from sklearn has a decimal variable we can use for the demo, which also gets a nice gaussian distribution after the arcsin, so it is kind of the perfect example 📦
I normally run the code I write in these examples in a jupyter notebook, and then copy and paste the output here.
There was a problem hiding this comment.
# Load dataset
cal_housing = fetch_california_housing()
X = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names)
# divide by 50 to obtain values between -1 and +1
X['Latitude'] = X['Latitude'] / 50
y = cal_housing.targetI loaded the California housing dataset but I did not find a decimal variable, so I just take one of the column and divide all his value by 50 to make sure to have values between -1 and +1
There was a problem hiding this comment.
hmm, that is not ideal, we'd like to make demos as "real" as possible. Would you be able to check if the breast cancer dataset has some suitable variables?
Alternatively, the kdd99 dataset, which variable description is here has some % variables (see bottom table), which if not in decimal scale already, we could divide by 100 and that would make more sense.
Worse case scenario, the iris dataset's variables are in cm, if we divide by 10, they would be decimal and it is still a valid measure.
The idea of the demo is to show that the variable originally was skewed, and after the transformation it is less skewed. So we would need to play around with the datasets, to see which one fits best. Thank you so much!!!
There was a problem hiding this comment.
ok I modified the demo with a variable from the breast cancer dataset
|
|
||
|
|
||
| .. image:: ../../images/lotareareciprocal.png | ||
|
|
There was a problem hiding this comment.
maybe we could also demo the get_feature_names_out method that comes from the parent class
There was a problem hiding this comment.
ok I add the method to the demo
solegalli
left a comment
There was a problem hiding this comment.
Hi @tomtom-95
Thank you very much for the code changes!
I went ahead and did some checks:
Unfortunately, because this transformer can only work with variables in the -1 and 1 value range, there are a bunch of tests that we can't run. Both from scikit-learn and our own tests.
So to ignore scikit-learn's tests, we need to add in the _more_tags():
feature_engine/feature_engine/transformation/arcsin.py
Lines 154 to 171 in cbd7ccc
All of these tests:
check_methods_subset_invariance
check_fit2d_1sample
check_fit2d_1feature
check_dict_unchanged
check_dont_overwrite_parameters
check_fit_check_is_fitted
check_n_features_in
To skip our own tests in this file:
We need to re-order the ArcsinTransformer right under the LogCpTransformer, and then in line 31 we need to change the range to [4:]
Also, I went ahead and checked the links relevant to the arcsin transformation, and the correct transformation is np.arcsin(np.sqrt(X[self.variables_])). You can find more info in this link and in this paper.. Also a demo here.
This transformation was designed for binomial distributions, and by including the root, it is now only possible in the space [0,1], so proportions.
Would you mind updating the code to accommodate this transformation?
Regarding tests: since we can't run our generic tests, it might be a good idea to actually keep the tests that you already drafted here (apologies, this was my mistake):
Finally, for the demo the breast cancer dataset has many variables between 0 and 1 that have a skewed distribution which is fixed by the arcsin transformation. So I would apply the transformer to all of these variables:
vars_ = [
'mean compactness',
'mean concavity',
'mean concave points',
'mean fractal dimension',
'smoothness error',
'compactness error',
'concavity error',
'concave points error',
'symmetry error',
'fractal dimension error',
'worst symmetry',
'worst fractal dimension']
And make histograms for all the variables before and after the transformation with this command:
X[vars_].hist(figsize=(20,20))
plt.show()
Would you be happy to make those changes? Thanks a lot!
|
All done (I think). Some problem: both ArcsinTransformer.rst file in api_doc and user_guide do not pass black test (I do not know why). The other problem is when I build the documentation, there are a lot of warnings about file in .ipynb_checkpoints folder: what is this folder? |
|
Hi @tomtom-95 very well done. I am impressed!! These are the errors: If you execute In short, you need to remove an import and add a blank line :) The .ipynb_checkpoints folder is where jupyter stores the last execution state of your notebooks. That should not matter. not sure why that is raising an error with the docs. But the tests have passed here, so I think its fine. |
|
Hi @tomtom-95 Thank you for the changes to the code base and apologies for the delay. I made a PR to your repo, which you can find here with minor cosmetic tweaks to the user guide and docstrings and reorganizing the lines of code in the tests. I also had to fix the failing tests (unrelated to this PR) and then rebase those changes into this PR, so hopefully, if you merge the PR in your repo, we would now have here the final version of the transformer with passing tests. I added a link with more info on how to update your fork and rebase main to your branch, in case you want to see more clearly the changes I made without the files corresponding to the unrelated changes. If you don't want to or don't have time, just merging would update the code here. I do recommend doing the exercise of rebasing though, it is a great git learning experience, lol. I struggled quite a bit with that in the past. If you have questions let me know. Thanks for contributing to the project! |
rebase main, change wording in user guide, reformat tests
|
Hi @solegalli Sorry for the delay, I merged your PR on my main branch. Now I see that all the checks have passed here (hooray!!!). |
|
it's done! thank you for you amazing contribution @tomtom-95 |
closes #434