Add features types for the metadata to extract and test multiprocessing by SaulLu · Pull Request #118 · bigscience-workshop/metadata

SaulLu · 2021-12-23T12:21:29Z

This PR add some variables that store the features type of the metadata extracted by each processor and the test are modified to test the multiprocessing

changjonathanc

What's the purpose of adding value type? Adding type-checking to the tests, I guess?

BTW, is it more memory efficient? I can't tell from the doc.

features (Optional[datasets.Features], default None) – Use a specific Features to store the cache file instead of the automatically generated one.

SaulLu · 2021-12-23T13:00:18Z

@cccntu

What's the purpose of adding value type? Adding type-checking to the tests, I guess?

The purpose is to use the multiprocessing feature (num_proc): without specifying the features type, we can end up with errors when the program reconciles the datasets calculated on the different processes ☺️

BTW, is it more memory efficient? I can't tell from the doc.

Unfortunately, I don't know either 🙂

manandey · 2021-12-23T13:49:58Z

LGTM, thanks @SaulLu!

timoschick

LGTM

timoschick · 2021-12-23T13:49:35Z

bsmetadata/utils.py

@@ -0,0 +1,20 @@
+import importlib.util


I don't fully understand why we need all of this logic. Wouldn't it make more sense to just directly add datasets to the project requirements and replace

if _datasets_available: from datasets import Value

with from datasets import Value?

I just thought that if we wanted to use this part without datasets it wouldn't block (because in the end datasets is not the only lib available). But if it confuses you, I can remove it. 😊

datasets is already in the requirements.txt

@timoschick , as for me this feature is not at all as important as the rest, I removed it in my last commit 🙂

* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py

SaulLu added 4 commits December 23, 2021 12:50

add multiprocessing to test

6c57cee

add features types

13c5f8a

use feature types in tests

a87a8f0

define a is_datasets_is_available

dfaa273

SaulLu requested review from changjonathanc, chkla, manandey, shanyas10 and timoschick December 23, 2021 12:21

SaulLu marked this pull request as draft December 23, 2021 12:31

manandey approved these changes Dec 23, 2021

View reviewed changes

chkla approved these changes Dec 23, 2021

View reviewed changes

changjonathanc reviewed Dec 23, 2021

View reviewed changes

include the feature type inside each processor

805e715

SaulLu requested review from chkla and manandey December 23, 2021 12:57

SaulLu marked this pull request as ready for review December 23, 2021 13:46

manandey approved these changes Dec 23, 2021

View reviewed changes

timoschick reviewed Dec 23, 2021

View reviewed changes

SaulLu added 2 commits December 23, 2021 15:01

remove the _if_datasets_is_available

9517657

remove old import

d560537

SaulLu merged commit 1292b4a into bigscience-workshop:master Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add features types for the metadata to extract and test multiprocessing#118

Add features types for the metadata to extract and test multiprocessing#118
SaulLu merged 7 commits intobigscience-workshop:masterfrom
SaulLu:improve-tests

SaulLu commented Dec 23, 2021

Uh oh!

changjonathanc left a comment

Uh oh!

SaulLu commented Dec 23, 2021

Uh oh!

manandey commented Dec 23, 2021

Uh oh!

timoschick left a comment

Uh oh!

timoschick Dec 23, 2021

Uh oh!

SaulLu Dec 23, 2021

Uh oh!

SaulLu Dec 23, 2021

Uh oh!

SaulLu Dec 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

SaulLu commented Dec 23, 2021

Uh oh!

changjonathanc left a comment

Choose a reason for hiding this comment

Uh oh!

SaulLu commented Dec 23, 2021

Uh oh!

manandey commented Dec 23, 2021

Uh oh!

timoschick left a comment

Choose a reason for hiding this comment

Uh oh!

timoschick Dec 23, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 23, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 23, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 23, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants