Add features types for the metadata to extract and test multiprocessing#118
Add features types for the metadata to extract and test multiprocessing#118SaulLu merged 7 commits intobigscience-workshop:masterfrom
Conversation
changjonathanc
left a comment
There was a problem hiding this comment.
What's the purpose of adding value type? Adding type-checking to the tests, I guess?
BTW, is it more memory efficient? I can't tell from the doc.
features (Optional[datasets.Features], default None) – Use a specific Features to store the cache file instead of the automatically generated one.
The purpose is to use the multiprocessing feature (num_proc): without specifying the features type, we can end up with errors when the program reconciles the datasets calculated on the different processes
Unfortunately, I don't know either 🙂 |
|
LGTM, thanks @SaulLu! |
bsmetadata/utils.py
Outdated
| @@ -0,0 +1,20 @@ | |||
| import importlib.util | |||
There was a problem hiding this comment.
I don't fully understand why we need all of this logic. Wouldn't it make more sense to just directly add datasets to the project requirements and replace
if _datasets_available:
from datasets import Valuewith from datasets import Value?
There was a problem hiding this comment.
I just thought that if we wanted to use this part without datasets it wouldn't block (because in the end datasets is not the only lib available). But if it confuses you, I can remove it. 😊
There was a problem hiding this comment.
datasets is already in the requirements.txt
There was a problem hiding this comment.
@timoschick , as for me this feature is not at all as important as the rest, I removed it in my last commit 🙂
* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py
This PR add some variables that store the features type of the metadata extracted by each processor and the test are modified to test the multiprocessing