Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 24 additions & 26 deletions notebooks/10 Data_acquisition.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -451,7 +451,7 @@
"\n",
"To scrape web pages, you firstly need to download them. This can be done using the `requests` library that was explained above. \n",
"\n",
"The code below scrapes data from a website which was developed specifically for developers who want to practice their web scraping skills, [toscrape.com](toscrape.com). It is a safe web scraping sandbox. The web page [http://books.toscrape.com/](http://books.toscrape.com/) displays a fictional bookstore. "
"The code below scrapes data from a website which was developed specifically for developers who want to practice their web scraping skills, [toscrape.com](https://toscrape.com). It is a safe web scraping sandbox. The web page [books.toscrape.com](https://books.toscrape.com/) displays a fictional bookstore. "
]
},
{
Expand All @@ -463,7 +463,7 @@
"\n",
"import requests\n",
"\n",
"url = 'http://books.toscrape.com/'\n",
"url = 'https://books.toscrape.com/'\n",
"\n",
"response = requests.get( url )\n",
"\n",
Expand Down Expand Up @@ -516,23 +516,23 @@
"```\n",
" <article class=\"product_pod\">\n",
" \n",
" <div class=\"image_container\">\n",
" <a href=\"catalogue/libertarianism-for-beginners_982/index.html\">\n",
" <img alt=\"Libertarianism for Beginners\" class=\"thumbnail\" src=\"media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg\"/>\n",
" </a>\n",
" </div>\n",
"\n",
" <h3>\n",
" <a href=\"catalogue/libertarianism-for-beginners_982/index.html\" title=\"Libertarianism for Beginners\">\n",
" Libertarianism for Beginners\n",
" </a>\n",
" </h3>\n",
" <div class=\"product_price\">\n",
" <p class=\"price_color\">\n",
" £51.33\n",
" </p>\n",
" \n",
" </article>\n",
" <div class=\"image_container\">\n",
" <a href=\"catalogue/libertarianism-for-beginners_982/index.html\">\n",
" <img alt=\"Libertarianism for Beginners\" class=\"thumbnail\" src=\"media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg\"/>\n",
" </a>\n",
" </div>\n",
"\n",
" <h3>\n",
" <a href=\"catalogue/libertarianism-for-beginners_982/index.html\" title=\"Libertarianism for Beginners\">\n",
" Libertarianism for Beginners\n",
" </a>\n",
" </h3>\n",
" <div class=\"product_price\">\n",
" <p class=\"price_color\">\n",
" £51.33\n",
" </p>\n",
" \n",
"</article>\n",
"```\n",
"\n",
"The title of the book can be found in an `h3` element. The price is given in a `<p>` element, with the class `price_color`. This `<p>` element is contained within a `<div>` with the class `product_price`. 'Scraping' the page really means that we need to extract the values we need from these HTML elements. \n",
Expand Down Expand Up @@ -625,9 +625,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## Advanced scraping: Scrapy\n",
"### Advanced scraping: Scrapy\n",
"\n",
"As you can see, web scraping can easily become rather difficult. You need to inspect the structure of the HTML source quite carefully, and you often need to work with fairly complicated code to extract only the values that you need. This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will often need to dig deeply into the data that you get. \n",
"\n",
Expand All @@ -640,9 +638,9 @@
"source": [
"### Exercise 10.5.\n",
"\n",
"This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page [http://books.toscrape.com/](http://books.toscrape.com/). \n",
"This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page <https://books.toscrape.com/>.\n",
"\n",
"Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `<img>` elements within the `<article>` about each book. Note that the `<img>` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url ([http://books.toscrape.com/](http://books.toscrape.com/) and the relative path to the image. "
"Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `<img>` elements within the `<article>` about each book. Note that the `<img>` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url (<https://books.toscrape.com/>) and the relative path to the image. "
]
},
{
Expand All @@ -658,7 +656,7 @@
"source": [
"### Exercise 10.6. \n",
"\n",
"On the web page [http://books.toscrape.com/](http://books.toscrape.com/), the menu on the lefthand side contains a list of all the subject categories of the books. \n",
"On the web page <https://books.toscrape.com/>, the menu on the lefthand side contains a list of all the subject categories of the books. \n",
"\n",
"Try to write some code which can extract all the terms in this list. This list is in an element named `div`, and this `<div>` has a `class` attribute with the value `side_categories`. The categories themselves are all encoded within an `<a>` element. "
]
Expand Down Expand Up @@ -876,7 +874,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
"version": "3.9.7"
}
},
"nbformat": 4,
Expand Down
12 changes: 6 additions & 6 deletions notebooks/Solutions/10 Data_acquisition.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -276,9 +276,9 @@
"source": [
"### Exercise 10.5.\n",
"\n",
"This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page [http://books.toscrape.com/](http://books.toscrape.com/). \n",
"This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page <https://books.toscrape.com/>.\n",
"\n",
"Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `<img>` elements within the `<article>` about each book. Note that the `<img>` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url ([http://books.toscrape.com/](http://books.toscrape.com/) and the relative path to the image. "
"Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `<img>` elements within the `<article>` about each book. Note that the `<img>` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url (<https://books.toscrape.com/>) and the relative path to the image. "
]
},
{
Expand All @@ -290,7 +290,7 @@
"from bs4 import BeautifulSoup\n",
"import requests\n",
"\n",
"url = 'http://books.toscrape.com/'\n",
"url = 'https://books.toscrape.com/'\n",
"response = requests.get( url )\n",
"\n",
"\n",
Expand All @@ -316,7 +316,7 @@
"source": [
"### Exercise 10.6. \n",
"\n",
"On the web page [http://books.toscrape.com/](http://books.toscrape.com/), the menu on the lefthand side contains a list of all the subject categories of the books. \n",
"On the web page <https://books.toscrape.com/>, the menu on the lefthand side contains a list of all the subject categories of the books. \n",
"\n",
"Try to write some code which can extract all the terms in this list. This list is in an element named `div`, and this `<div>` has a `class` attribute with the value `side_categories`. The categories themselves are all encoded within an `<a>` element. "
]
Expand All @@ -330,7 +330,7 @@
"from bs4 import BeautifulSoup\n",
"import requests\n",
"\n",
"url = 'http://books.toscrape.com/'\n",
"url = 'https://books.toscrape.com/'\n",
"response = requests.get( url )\n",
"\n",
"\n",
Expand Down Expand Up @@ -636,7 +636,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
"version": "3.9.7"
}
},
"nbformat": 4,
Expand Down