diff --git a/notebooks/10 Data_acquisition.ipynb b/notebooks/10 Data_acquisition.ipynb index 1c8cc85..3dac4ab 100644 --- a/notebooks/10 Data_acquisition.ipynb +++ b/notebooks/10 Data_acquisition.ipynb @@ -451,7 +451,7 @@ "\n", "To scrape web pages, you firstly need to download them. This can be done using the `requests` library that was explained above. \n", "\n", - "The code below scrapes data from a website which was developed specifically for developers who want to practice their web scraping skills, [toscrape.com](toscrape.com). It is a safe web scraping sandbox. The web page [http://books.toscrape.com/](http://books.toscrape.com/) displays a fictional bookstore. " + "The code below scrapes data from a website which was developed specifically for developers who want to practice their web scraping skills, [toscrape.com](https://toscrape.com). It is a safe web scraping sandbox. The web page [books.toscrape.com](https://books.toscrape.com/) displays a fictional bookstore. " ] }, { @@ -463,7 +463,7 @@ "\n", "import requests\n", "\n", - "url = 'http://books.toscrape.com/'\n", + "url = 'https://books.toscrape.com/'\n", "\n", "response = requests.get( url )\n", "\n", @@ -516,23 +516,23 @@ "```\n", "
\n", " \n", - "
\n", - " \n", - " \"Libertarianism\n", - " \n", - "
\n", - "\n", - "

\n", - " \n", - " Libertarianism for Beginners\n", - " \n", - "

\n", - "
\n", - "

\n", - " £51.33\n", - "

\n", - " \n", - "
\n", + "
\n", + " \n", + " \"Libertarianism\n", + " \n", + "
\n", + "\n", + "

\n", + " \n", + " Libertarianism for Beginners\n", + " \n", + "

\n", + "
\n", + "

\n", + " £51.33\n", + "

\n", + " \n", + "\n", "```\n", "\n", "The title of the book can be found in an `h3` element. The price is given in a `

` element, with the class `price_color`. This `

` element is contained within a `

` with the class `product_price`. 'Scraping' the page really means that we need to extract the values we need from these HTML elements. \n", @@ -625,9 +625,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Advanced scraping: Scrapy\n", + "### Advanced scraping: Scrapy\n", "\n", "As you can see, web scraping can easily become rather difficult. You need to inspect the structure of the HTML source quite carefully, and you often need to work with fairly complicated code to extract only the values that you need. This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will often need to dig deeply into the data that you get. \n", "\n", @@ -640,9 +638,9 @@ "source": [ "### Exercise 10.5.\n", "\n", - "This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page [http://books.toscrape.com/](http://books.toscrape.com/). \n", + "This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page .\n", "\n", - "Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `` elements within the `
` about each book. Note that the `` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url ([http://books.toscrape.com/](http://books.toscrape.com/) and the relative path to the image. " + "Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `` elements within the `
` about each book. Note that the `` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url () and the relative path to the image. " ] }, { @@ -658,7 +656,7 @@ "source": [ "### Exercise 10.6. \n", "\n", - "On the web page [http://books.toscrape.com/](http://books.toscrape.com/), the menu on the lefthand side contains a list of all the subject categories of the books. \n", + "On the web page , the menu on the lefthand side contains a list of all the subject categories of the books. \n", "\n", "Try to write some code which can extract all the terms in this list. This list is in an element named `div`, and this `
` has a `class` attribute with the value `side_categories`. The categories themselves are all encoded within an `` element. " ] @@ -876,7 +874,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.9.7" } }, "nbformat": 4, diff --git a/notebooks/Solutions/10 Data_acquisition.ipynb b/notebooks/Solutions/10 Data_acquisition.ipynb index f281b92..99e43ed 100644 --- a/notebooks/Solutions/10 Data_acquisition.ipynb +++ b/notebooks/Solutions/10 Data_acquisition.ipynb @@ -276,9 +276,9 @@ "source": [ "### Exercise 10.5.\n", "\n", - "This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page [http://books.toscrape.com/](http://books.toscrape.com/). \n", + "This tutorial has explained how you can extract data about the titles and the prices of all the books that are shown on the web page .\n", "\n", - "Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `` elements within the `
` about each book. Note that the `` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url ([http://books.toscrape.com/](http://books.toscrape.com/) and the relative path to the image. " + "Can you write code to extract the URLs of all the book covers on this page? These URLs can be found in the `src` attribute of the `` elements within the `
` about each book. Note that the `` element specifies a relative path. To change the relative path into an absolute path, you need to concatenate the base url () and the relative path to the image. " ] }, { @@ -290,7 +290,7 @@ "from bs4 import BeautifulSoup\n", "import requests\n", "\n", - "url = 'http://books.toscrape.com/'\n", + "url = 'https://books.toscrape.com/'\n", "response = requests.get( url )\n", "\n", "\n", @@ -316,7 +316,7 @@ "source": [ "### Exercise 10.6. \n", "\n", - "On the web page [http://books.toscrape.com/](http://books.toscrape.com/), the menu on the lefthand side contains a list of all the subject categories of the books. \n", + "On the web page , the menu on the lefthand side contains a list of all the subject categories of the books. \n", "\n", "Try to write some code which can extract all the terms in this list. This list is in an element named `div`, and this `