Adding WebsiteMetadataProcessor to preprocessing_utils#49
Adding WebsiteMetadataProcessor to preprocessing_utils#49shanyas10 merged 33 commits intobigscience-workshop:masterfrom
Conversation
timoschick
left a comment
There was a problem hiding this comment.
Hi @shanyas10, thanks for taking the time to write this. This is exactly what I'd expect a MetadataProcessor for websites to look like 👍
I've added a couple of comments on how I think the code might further be improved.
bsmetadata/preprocessing_utils.py
Outdated
| """Metadata preprocessor for adding website description based on URLs.""" | ||
|
|
||
| website_description_cache = {} | ||
| org_list = ["com", "co", "org", "go", "in"] |
There was a problem hiding this comment.
What's the reason for using this specific set of top-level domains?
bsmetadata/preprocessing_utils.py
Outdated
| website_description = self._extract_website_desc_from_url(urls[0]) | ||
|
|
||
| if website_description: | ||
| metadata.append({"key": "timestamp", "type": "global", "value": website_description}) |
There was a problem hiding this comment.
This should probably be something like "key": "website_description"
bsmetadata/preprocessing_utils.py
Outdated
|
|
||
| def _extract_website_desc_from_url(self, url: str) -> Optional: | ||
|
|
||
| domain = url.str.split("/")[2] # e.g http://www.californialandcan.org/Plumas -> www.californialandcan.org |
There was a problem hiding this comment.
This would fail for URLs that don't start with http://. I guess all URLs in C4 start with http://, but it would probably be good to be on the safe side here. Also, you may want to consider using some library like urllib (see https://docs.python.org/3/library/urllib.parse.html) for splitting URLs into components as this will take care of all unexpected edge cases for you.
There was a problem hiding this comment.
Pardon me for the intrusion.
I have done some simple steps in #24 just like what @timoschick suggested.
Later I will find a way to put some suggestion code snippets here.
bsmetadata/preprocessing_utils.py
Outdated
|
|
||
| return self.website_description_cache[keyword] | ||
|
|
||
| def extract_wiki_desc(self, keyword: str) -> Optional: |
There was a problem hiding this comment.
As JeanZ has no access to the internet, I think we need to first download all wiki descriptions as a preprocessing step.
|
download_wiki_dump.sh -> Will clone the repo that contains the dump_db file for fetching wikipedia data Points for discussion:
Edit: Changed the code to use nltk tokenizer for sentence splitting. |
timoschick
left a comment
There was a problem hiding this comment.
Thanks a lot, @shanyas10 - looks good to me 👍
| def fetch_wikipedia_description_for_title(self, title: str) -> Optional: | ||
| try: | ||
| text = self.wiki_dump_db.get_paragraphs(title)[0].text | ||
| text = nltk.sent_tokenize(text)[0] # Picking the first sentence |
There was a problem hiding this comment.
Would it maybe make sense to remove information in brackets? For example, the first sentence in the Wikipedia article on Wikipedia itself is Wikipedia (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. At least in this case, I would think that the part in brackets (/ˌwɪkɪˈpiːdiə/ (listen) wik-ih-PEE-dee-ə or /ˌwɪki-/ (listen) wik-ee-) is completely useless for the model but will probably require many tokens. Any thoughts on that?
There was a problem hiding this comment.
Sounds like a great idea!
If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:
import re
text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.
There was a problem hiding this comment.
Makes a lot of sense @timoschick .
Thank you for the snippet @SaulLu . I've made the changes accordingly
SaulLu
left a comment
There was a problem hiding this comment.
Thank you so much for all your hard work @shanyas10! I especially appreciate all the efforts you made to make this script run without internet access 🎊 !
Thanks also for the tests!
| wikipedia2vec==1.0.5 | ||
| nltk==3.6.5 |
There was a problem hiding this comment.
Really a tiny detail, but I think we can make these dependencies optional by adding them to the setup.py in extras_require 🙂 :
setup(
name="bsmetadata",
python_requires=">=3.7.11, <3.10",
version="0.1.0",
url="https://github.com/bigscience-workshop/metadata.git",
author="Multiple Authors",
author_email="xxx",
description="Codebase for including metadata (e.g., URLs, timestamps, HTML tags) during language model pretraining.",
packages=find_packages(),
install_requires=install_requires,
extras_require={
"website_description_preprocessing": ["wikipedia2vec==1.0.5", "nltk==3.6.5"],
},
)
There was a problem hiding this comment.
thank you for pointing this out @SaulLu :) I've made the required changes
| def fetch_wikipedia_description_for_title(self, title: str) -> Optional: | ||
| try: | ||
| text = self.wiki_dump_db.get_paragraphs(title)[0].text | ||
| text = nltk.sent_tokenize(text)[0] # Picking the first sentence |
There was a problem hiding this comment.
Sounds like a great idea!
If it saves you some time @shanyas10 , here is a little code snippet that will remove the text in parentheses:
import re
text = re.sub(r"\((?:[^)(]|\([^)(]*\))*\)", "", text)On @timoschick's example, it will output: Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system.
|
since the changes post approval are minor, I'm merging the PR :) |
@timoschick raising this PR to understand if this is what is expected. Also, while annotating data myself, I had taken the top 100 keywords since they were most likely to have a wikipedia description. But since we aren't doing that here, chances are that some examples might not contain the description.
Yet to write tests to the PR is not mergeable.