create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML#119
Conversation
| additional_columns = {} | ||
| if self.tags_sub_tree_to_isolate: | ||
| try: | ||
| root = fromstring(html_str) | ||
| except ValueError: | ||
| # this is not a valid HTML (begin probably with <?xml version="1.0") | ||
| logger.warning(f"This example wasn't parsed: invalid HTML") | ||
| return "", [], {} | ||
| for tag in self.tags_sub_tree_to_isolate: | ||
| find = etree.XPath(f"//{tag}") | ||
| matches = find(root) | ||
| html_extracted = [ | ||
| etree.tostring(match, encoding="UTF-8", pretty_print=False).decode("UTF-8") for match in matches | ||
| ] | ||
| additional_columns[tag] = html_extracted |
There was a problem hiding this comment.
This addition allows you to keep the head section and footer sections of the html file in other columns that we parse
| root = fromstring(html_str) | ||
| except ValueError: | ||
| # this is not a valid HTML (begin probably with <?xml version="1.0") | ||
| logger.warning(f"This example wasn't parsed: invalid HTML") | ||
| return "", [], {} | ||
| find = etree.XPath(f"//{self.start_parsing_at_tag}") | ||
| matches = find(root) | ||
| if len(matches) == 0: | ||
| logger.warning( |
There was a problem hiding this comment.
These additions avoid two types of errors encountered:
- a file that the
lxmllibrary cannot parse as html - the case where the html does not contain a "body" tag
| # some tags have a value of type `cython_function_or_method` which is not supported by pyrarrow | ||
| "value": str(metadata.value.tag) if type(metadata.value.tag) == type(Comment) else metadata.value.tag, |
There was a problem hiding this comment.
This addition avoids another type of error when working with the datasets library (it does not accept cpython formats)
| return examples | ||
|
|
||
|
|
||
| class ErrorWrapperPreprocessor: |
There was a problem hiding this comment.
This class is useful for mass extraction: if I have an example that returns an error it does not prevent the processing of other examples and it allows me to "know" the type of error returned on this particular example (what we do not see otherwise)
| @@ -0,0 +1,38 @@ | |||
| 1. Download `wiki_en_dump.db` | |||
There was a problem hiding this comment.
This file contains instructions for downloading everything we need on jz
* master: (141 commits) build: bump nltk to 3.6.7 for security and performance (bigscience-workshop#130) build: bump nltk to 3.6.7 for security and performance (#5) Add fp16, multi-GPU training script (toy dataset) (bigscience-workshop#123) create dataset with html, timestamp, url, datasource, generation length and website description metadata and tittles, footers and headers from HTML (bigscience-workshop#119) remove `#SBATCH --gres=gpu:0 ` from `03_create_dataset.slurm` (bigscience-workshop#121) Add joint training slurm script (bigscience-workshop#111) Add features types for the metadata to extract and test multiprocessing (bigscience-workshop#118) feat: add a feature to choose where to extract metadata (bigscience-workshop#116) Use dateutil to parse date (bigscience-workshop#117) feat: change how the entity extraction process use ids (bigscience-workshop#115) add `path_or_url_flair_ner_model` in order to execute the entity extraction on a partition without internet (bigscience-workshop#106) delete old submodule delete ds_store style check style & quality imports handle IndexError for `wikipedia_desc_utils` (bigscience-workshop#102) handle the comment specific type not recognized by pyarrow (bigscience-workshop#83) quality check Change torch version + make it optional (bigscience-workshop#82) ... # Conflicts: # bsmetadata/metadata_utils.py
This PR includes:
What is missing in this PR: