create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML by SaulLu · Pull Request #119 · bigscience-workshop/metadata

SaulLu · 2021-12-24T16:49:55Z

This PR includes:

the python script to extract all the metadata from our initial dataset
slurm scripts used to explore the hyperparameters to choose for extraction
modifications on the HTML processor to 1) avoid error return and 2) extract data in new columns (head and footers and titles)
An error wrapper to prevent a single example from crashing an entire job

What is missing in this PR:

SLURMS scripts to extract the website descriptions and entities on the full dataset

SaulLu · 2021-12-24T17:00:23Z

bsmetadata/preprocessing_tools/html_parser/filters_and_cleaners.py

+        additional_columns = {}
+        if self.tags_sub_tree_to_isolate:
+            try:
+                root = fromstring(html_str)
+            except ValueError:
+                # this is not a valid HTML (begin probably with <?xml version="1.0")
+                logger.warning(f"This example wasn't parsed: invalid HTML")
+                return "", [], {}
+            for tag in self.tags_sub_tree_to_isolate:
+                find = etree.XPath(f"//{tag}")
+                matches = find(root)
+                html_extracted = [
+                    etree.tostring(match, encoding="UTF-8", pretty_print=False).decode("UTF-8") for match in matches
+                ]
+                additional_columns[tag] = html_extracted


This addition allows you to keep the head section and footer sections of the html file in other columns that we parse

SaulLu · 2021-12-24T17:01:42Z

bsmetadata/preprocessing_tools/html_parser/filters_and_cleaners.py

+                root = fromstring(html_str)
+            except ValueError:
+                # this is not a valid HTML (begin probably with <?xml version="1.0")
+                logger.warning(f"This example wasn't parsed: invalid HTML")
+                return "", [], {}
+            find = etree.XPath(f"//{self.start_parsing_at_tag}")
+            matches = find(root)
+            if len(matches) == 0:
+                logger.warning(


These additions avoid two types of errors encountered:

a file that the lxml library cannot parse as html

the case where the html does not contain a "body" tag

SaulLu · 2021-12-24T17:02:31Z

bsmetadata/preprocessing_tools/html_parser/objects.py

+            # some tags have a value of type `cython_function_or_method` which is not supported by pyrarrow
+            "value": str(metadata.value.tag) if type(metadata.value.tag) == type(Comment) else metadata.value.tag,


This addition avoids another type of error when working with the datasets library (it does not accept cpython formats)

SaulLu · 2021-12-24T17:03:47Z

bsmetadata/preprocessing_utils.py

        return examples
+
+
+class ErrorWrapperPreprocessor:


This class is useful for mass extraction: if I have an example that returns an error it does not prevent the processing of other examples and it allows me to "know" the type of error returned on this particular example (what we do not see otherwise)

SaulLu · 2021-12-24T17:04:27Z

experiments/jz/dataset/c4/README.md

@@ -0,0 +1,38 @@
+1. Download `wiki_en_dump.db` 


This file contains instructions for downloading everything we need on jz

* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py

SaulLu added 24 commits December 23, 2021 16:31

fix error file without body

4b9cdd7

fix value in cypthon format

d9cb3ba

remove useless part test

2f72837

add commands to execute

2e4bb2c

add_metadata script

661af75

use task_id in add_metadata

ad2d90d

add html col

cd7142c

handle invalid html error

7e96614

reduce html possibilities

c7b59d7

add path_wiki_db to website desc

d1ed469

format

9ba5e7f

add slurm scripts

6160d71

add html error wrapper

8455462

format

8b0185a

fix tests (with extraction method 3)

a8533da

change parameters

18f9dd1

add footer and head extraction

596a7cc

adapt tests

d65f258

adapt add_metadata

b93664b

fix output

c2e006f

last command

6ee45a4

fix test

acf1bce

slurm file to gzip

d72d324

format

54fe3ca

SaulLu commented Dec 24, 2021

View reviewed changes

SaulLu requested a review from timoschick December 24, 2021 17:07

SaulLu added 27 commits January 3, 2022 10:31

remove add_url_as_metadata

8b8720e

format

94d07ec

last test

fcdb658

finish to add titles column

ba76d3d

add ds description if logs

36980c0

add website description slurm script

13baea4

add entity script

8b6f462

add readme

5f203f9

pull mike final dataset slurm script

b2048dc

change job name and partition

3496591

change partition

8b8d6ba

last tests hyperparameters

2daf252

add changes to how pull dataset

62ea516

add slurm script to perform 1st extraction step

1f1c493

fix typo

ea2fc63

format

1a9358c

fix tests regarding the title column

32da8f6

format

bb1dfc9

rename slurm file

8271bc3

add logger and store in tmp file

56478cb

fix keep html column

875f46c

last version for extraction

6e5209c

use save_to_disk and load_from_disk

b15a7c7

add website description slurm script from toy experiment

2ea952b

adapt website extraction

affe014

fix script

7b7edb1

format

baa0555

SaulLu changed the title ~~Create dataset with metadata~~ create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML Jan 21, 2022

SaulLu merged commit e835e3a into bigscience-workshop:master Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML#119

create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML#119
SaulLu merged 57 commits intobigscience-workshop:masterfrom
SaulLu:create-dataset-with-metadata

SaulLu commented Dec 24, 2021 •

edited

Loading

Uh oh!

SaulLu Dec 24, 2021

Uh oh!

SaulLu Dec 24, 2021

Uh oh!

SaulLu Dec 24, 2021

Uh oh!

SaulLu Dec 24, 2021

Uh oh!

SaulLu Dec 24, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# some tags have a value of type `cython_function_or_method` which is not supported by pyrarrow
		"value": str(metadata.value.tag) if type(metadata.value.tag) == type(Comment) else metadata.value.tag,

Conversation

SaulLu commented Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SaulLu Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 24, 2021

Choose a reason for hiding this comment

Uh oh!

SaulLu Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SaulLu commented Dec 24, 2021 •

edited

Loading

SaulLu Dec 24, 2021 •

edited

Loading