aka Web Scraping with `requests` and `BeautifulSoup`
#### Introduction
Web scraping is a technique for extracting data from websites. In Python, this can be efficiently achieved using the `requests` library to download webpage content and `BeautifulSoup` from `bs4` to parse this content. This lesson will guide you through scraping a webpage to extract structured data like class titles and start dates from The Multiverse School's classes page.
#### 1. Understanding `requests`
- **Purpose**: To send HTTP requests easily. This is the first step in web scraping, which involves fetching the webpage.
- **Common Uses**:
- Downloading HTML content of websites.
- Making API calls.
- **Key Commands**:
- `requests.get(url)`: Sends a GET request to the specified URL.
#### 2. Exploring `BeautifulSoup`
- **Purpose**: To parse HTML and XML documents. It creates parse trees that can be used to extract data easily.
- **Common Uses**:
- Parsing HTML.
- Extracting data from HTML tags.
- **Key Functions**:
- `BeautifulSoup(markup, 'html.parser')`: Parses markup into a BeautifulSoup object.
- `soup.find_all(tag_name)`: Finds all instances of a tag.
- `soup.get_text()`: Extracts text from a tag.
#### Installing Libraries
Before we begin, ensure that you have the `requests` and `bs4` libraries installed:
```bash
pip install requests beautifulsoup4
```
#### Example: Scraping Class Titles and Start Dates
Let's write a script to extract the titles and start dates of classes from the provided HTML of The Multiverse School's classes page.
```python
import requests
from bs4 import BeautifulSoup
# URL of the page to scrape
url = "http://www.themultiverse.school/classes"
# Sending HTTP request
response = requests.get(url)
response.raise_for_status() # Ensure we notice bad responses
# Parsing the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Initialize a list to hold our extracted class information
class_info = []
# Extract data from each article tag, assuming each class is in an 'article' tag
for article in soup.find_all('article', class_='thumb'):
title_tag = article.find('h2')
if title_tag:
# Extracting the class title and dates
date_text, class_title = title_tag.get_text(strip=True, separator='|').split('|')
class_info.append({'title': class_title.strip(), 'start_dates': date_text.strip()})
# Display extracted information
for info in class_info:
print(f"Class Title: {info['title']}, Start Dates: {info['start_dates']}")
```
#### Explanation
1. We use `requests.get()` to fetch the webpage.
2. `BeautifulSoup` parses the HTML content.
3. We then look for each `article` tag with a class of `thumb`, as this seems to contain each class's information.
4. Inside each `article`, we locate the `h2` tag, which holds the title and dates, then extract the text.
#### Exercise
Modify the script to also extract and print the URL for each class, assuming it can be found within an `a` tag with the class `image`. This will practice navigating HTML structures and using BeautifulSoup's `find` method.
This task reinforces how to combine Python tools to automate the collection of web data, useful for a wide range of applications like competitive analysis, event monitoring, or academic research.