Skip to content

Add features types for the metadata to extract and test multiprocessing#118

Merged
SaulLu merged 7 commits intobigscience-workshop:masterfrom
SaulLu:improve-tests
Dec 23, 2021
Merged

Add features types for the metadata to extract and test multiprocessing#118
SaulLu merged 7 commits intobigscience-workshop:masterfrom
SaulLu:improve-tests

Conversation

@SaulLu
Copy link
Collaborator

@SaulLu SaulLu commented Dec 23, 2021

This PR add some variables that store the features type of the metadata extracted by each processor and the test are modified to test the multiprocessing

@SaulLu SaulLu marked this pull request as draft December 23, 2021 12:31
Copy link
Collaborator

@changjonathanc changjonathanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of adding value type? Adding type-checking to the tests, I guess?

BTW, is it more memory efficient? I can't tell from the doc.

features (Optional[datasets.Features], default None) – Use a specific Features to store the cache file instead of the automatically generated one.

@SaulLu SaulLu requested review from chkla and manandey December 23, 2021 12:57
@SaulLu
Copy link
Collaborator Author

SaulLu commented Dec 23, 2021

@cccntu

What's the purpose of adding value type? Adding type-checking to the tests, I guess?

The purpose is to use the multiprocessing feature (num_proc): without specifying the features type, we can end up with errors when the program reconciles the datasets calculated on the different processes ☺️

BTW, is it more memory efficient? I can't tell from the doc.

Unfortunately, I don't know either 🙂

@SaulLu SaulLu marked this pull request as ready for review December 23, 2021 13:46
@manandey
Copy link
Member

LGTM, thanks @SaulLu!

Copy link
Contributor

@timoschick timoschick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,20 @@
import importlib.util
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why we need all of this logic. Wouldn't it make more sense to just directly add datasets to the project requirements and replace

if _datasets_available:
    from datasets import Value

with from datasets import Value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just thought that if we wanted to use this part without datasets it wouldn't block (because in the end datasets is not the only lib available). But if it confuses you, I can remove it. 😊

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets is already in the requirements.txt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timoschick , as for me this feature is not at all as important as the rest, I removed it in my last commit 🙂

@SaulLu SaulLu merged commit 1292b4a into bigscience-workshop:master Dec 23, 2021
tianjianjiang added a commit to tianjianjiang/bigscience-metadata that referenced this pull request Jan 21, 2022
* master: (141 commits)
  build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130)
  build: bump nltk to 3.6.7 for security and performance (#5)
  Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123)
  create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119)
  remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121)
  Add joint training slurm script (bigscience-workshop#111)
  Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118)
  feat: add a feature to choose where to extract metadata (bigscience-workshop#116)
  Use dateutil to parse date (bigscience-workshop#117)
  feat: change how the entity extraction process use ids (bigscience-workshop#115)
  add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106)
  delete old submodule
  delete ds_store
  style check
  style & quality
  imports
  handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102)
  handle the comment specific type not recognized by pyarrow (bigscience-workshop#83)
  quality check
  Change torch version + make it optional (bigscience-workshop#82)
  ...

# Conflicts:
#	bsmetadata/metadata_utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants