Adding WebsiteMetadataProcessor to preprocessing_utils by shanyas10 · Pull Request #49 · bigscience-workshop/metadata

shanyas10 · 2021-10-15T06:51:48Z

@timoschick raising this PR to understand if this is what is expected. Also, while annotating data myself, I had taken the top 100 keywords since they were most likely to have a wikipedia description. But since we aren't doing that here, chances are that some examples might not contain the description.

Yet to write tests to the PR is not mergeable.

timoschick

Hi @shanyas10, thanks for taking the time to write this. This is exactly what I'd expect a MetadataProcessor for websites to look like 👍
I've added a couple of comments on how I think the code might further be improved.

timoschick · 2021-10-15T14:22:21Z

bsmetadata/preprocessing_utils.py

+    """Metadata preprocessor for adding website description based on URLs."""
+
+    website_description_cache = {}
+    org_list = ["com", "co", "org", "go", "in"]


What's the reason for using this specific set of top-level domains?

timoschick · 2021-10-15T14:22:54Z

bsmetadata/preprocessing_utils.py

+            website_description = self._extract_website_desc_from_url(urls[0])
+
+            if website_description:
+                metadata.append({"key": "timestamp", "type": "global", "value": website_description})


This should probably be something like "key": "website_description"

timoschick · 2021-10-15T14:26:02Z

bsmetadata/preprocessing_utils.py

+
+    def _extract_website_desc_from_url(self, url: str) -> Optional:
+
+        domain = url.str.split("/")[2]  # e.g http://www.californialandcan.org/Plumas -> www.californialandcan.org


This would fail for URLs that don't start with http://. I guess all URLs in C4 start with http://, but it would probably be good to be on the safe side here. Also, you may want to consider using some library like urllib (see https://docs.python.org/3/library/urllib.parse.html) for splitting URLs into components as this will take care of all unexpected edge cases for you.

Pardon me for the intrusion.
I have done some simple steps in #24 just like what @timoschick suggested.
Later I will find a way to put some suggestion code snippets here.

timoschick · 2021-10-15T14:27:24Z

bsmetadata/preprocessing_utils.py

+
+        return self.website_description_cache[keyword]
+
+    def extract_wiki_desc(self, keyword: str) -> Optional:


As JeanZ has no access to the internet, I think we need to first download all wiki descriptions as a preprocessing step.

shanyas10 · 2021-11-01T13:42:41Z

download_wiki_dump.sh -> Will clone the repo that contains the dump_db file for fetching wikipedia data
WebsiteMetadataProcessor -> needs to be initialized with the path to the above repo

Points for discussion:

website_desc_utils L[19:27] is fetching the first two sentences of description for a given title. Splitting on ('. ') might not work for some cases e.g. "XYZ is a U.S. based company." would partition as "XYZ is a U.S" and "based company". While it worked for all the cases I tested for (around 10) but we can use a more robust method if any.
Suggestions on how to test: Using the processor requires the download of the db file as a prerequisite. Any suggestions on how we can test the processor without the download? Or should we create a db file from a smaller wiki dump and use that as test data?

Edit: Changed the code to use nltk tokenizer for sentence splitting.
Edit2: Mocked the DumpDB object to run the tests but I'm not sure if this is the best way to do it.

timoschick

Thanks a lot, @shanyas10 - looks good to me 👍

timoschick · 2021-11-19T14:29:50Z

bsmetadata/preprocessing_tools/website_desc_utils.py

+    def fetch_wikipedia_description_for_title(self, title: str) -> Optional:
+        try:
+            text = self.wiki_dump_db.get_paragraphs(title)[0].text
+            text = nltk.sent_tokenize(text)[0]  # Picking the first sentence


Would it maybe make sense to remove information in brackets? For example, the first sentence in the Wikipedia article on Wikipedia itself is Wikipedia (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. At least in this case, I would think that the part in brackets (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is completely useless for the model but will probably require many tokens. Any thoughts on that?

Sounds like a great idea!

If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:

import re text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)

On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.

Makes a lot of sense @timoschick .
Thank you for the snippet @SaulLu . I've made the changes accordingly

SaulLu

Thank you so much for all your hard work @shanyas10! I especially appreciate all the efforts you made to make this script run without internet access 🎊 !

Thanks also for the tests!

SaulLu · 2021-11-19T18:03:50Z

requirements.txt

+wikipedia2vec==1.0.5
+nltk==3.6.5


Really a tiny detail, but I think we can make these dependencies optional by adding them to the setup.py in extras_require 🙂 :

setup( name="bsmetadata", python_requires=">=3.7.11, <3.10", version="0.1.0", url="https://github.com/bigscience-workshop/metadata.git", author="Multiple Authors", author_email="xxx", description="Codebase for including metadata (e.g., URLs, timestamps, HTML tags) during language model pretraining.", packages=find_packages(), install_requires=install_requires, extras_require={ "website_description_preprocessing": ["wikipedia2vec==1.0.5", "nltk==3.6.5"], }, )

thank you for pointing this out @SaulLu :) I've made the required changes

SaulLu · 2021-11-19T18:15:57Z

bsmetadata/preprocessing_tools/website_desc_utils.py

+    def fetch_wikipedia_description_for_title(self, title: str) -> Optional:
+        try:
+            text = self.wiki_dump_db.get_paragraphs(title)[0].text
+            text = nltk.sent_tokenize(text)[0]  # Picking the first sentence


Sounds like a great idea!

If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:

import re text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)

On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.

shanyas10 · 2021-11-23T10:00:18Z

since the changes post approval are minor, I'm merging the PR :)

Update preprocessing_utils.py

5d6a2a2

shanyas10 requested a review from timoschick October 15, 2021 06:51

Update preprocessing_utils.py

80dcd7a

timoschick reviewed Oct 15, 2021

View reviewed changes

Shanya Sharma - s0s0cr3 added 4 commits October 29, 2021 20:22

Merge branch 'master' into ds_preprocessing_website

742d2db

Merge branch 'master' into ds_preprocessing_website

6cd8389

adding processor for website metadata

7aa1797

Create download_wiki_dump.sh

561867a

shanyas10 changed the base branch from TS/ds_preprocessing to master November 1, 2021 13:47

Shanya Sharma - s0s0cr3 added 20 commits November 1, 2021 19:19

run style and quality checks

7a59eba

Update website_desc_utils.py

fce9980

adding tokenization for sentence

ab7bd72

Update download_wiki_dump.sh

60e6620

add test

0299c88

Update preprocessing_utils.py

5a3f0f1

Update preprocessing_utils.py

2bf8ca2

Update test_preprocessing_utils.py

1ef9493

Update test_preprocessing_utils.py

19ff034

adding tests

a6761db

fixing a bug in mocking

9afa782

Update test.yml

46773a9

updating name in workflow

8cd384d

adding nltk to requirements

e6e0342

Merge branch 'ds_preprocessing_website' into test_ds_preprocess

a3785e3

Update test_preprocessing_utils.py

a82cd78

Update test_preprocessing_utils.py

8160bda

fixing tests

f8c05e2

reverting changes from test

7ea07b4

fixing quality

7ef1d9a

Shanya Sharma - s0s0cr3 added 4 commits November 2, 2021 12:04

modifying script and deleting extra file

b251109

Update preprocessing_utils.py

ca9c9d0

Update download_wiki_dump.sh

fe9a228

Update preprocessing_utils.py

e185fbe

timoschick approved these changes Nov 19, 2021

View reviewed changes

SaulLu approved these changes Nov 19, 2021

View reviewed changes

Shanya Sharma - s0s0cr3 added 3 commits November 23, 2021 15:06

Merge branch 'master' into ds_preprocessing_website

21c15c1

addressing PR comments

81af40a

make quality

1611733

shanyas10 merged commit a5a0c6a into bigscience-workshop:master Nov 23, 2021


		def _extract_website_desc_from_url(self, url: str) -> Optional:

		domain = url.str.split("/")[2] # e.g http://www.californialandcan.org/Plumas -> www.californialandcan.org


		return self.website_description_cache[keyword]

		def extract_wiki_desc(self, keyword: str) -> Optional:

		wikipedia2vec==1.0.5
		nltk==3.6.5

Conversation

shanyas10 commented Oct 15, 2021

Uh oh!

timoschick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianjianjiang Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanyas10 commented Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timoschick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SaulLu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanyas10 commented Nov 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianjianjiang Oct 15, 2021 •

edited

Loading

shanyas10 commented Nov 1, 2021 •

edited

Loading