Handle IndexError in WikipediaDescUtils for wikitext#102
Handle IndexError in WikipediaDescUtils for wikitext#102SaulLu merged 1 commit intobigscience-workshop:masterfrom
WikipediaDescUtils for wikitext#102Conversation
|
@SaulLu it could be that the key used for fetching paragraphs is not valid, which would result in key error (that we've already handled). Empty paragraphs are not something I've come across yet. Can you share a few cases where you're getting these? |
|
@shanyas10, thank you for your answer! I encountered this problem on the 1000 examples toy dataset, especially during entities extraction. I put the stacktrace of the error obtained in the issue. Currently, I don't have an easy way to know which examples caused this problem (I use |
|
@SaulLu, I checked this and this did seem to happen in a few cases while fetching entity descriptions. Thanks for the fix! |
* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py
Fix #101
During my last test, I notice that the list of paragraphs can be empty (can you confirm that this could happen @manandey and @shanyas10 ?)