The requests and beautifulsoup library in Python

aka Web Scraping with `requests` and `BeautifulSoup` #### Introduction Web scraping is a technique for extracting data from websites. In Python, this can be efficiently achieved using the `requests` library to download webpage content and `BeautifulSoup` from `bs4` to parse this content. This lesson will guide you through scraping a webpage to extract structured data like class titles and start dates from The Multiverse School's classes page. #### 1. Understanding `requests` - **Purpose**: To send HTTP requests easily. This is the first step in web scraping, which involves fetching the webpage. - **Common Uses**: - Downloading HTML content of websites. - Making API calls. - **Key Commands**: - `requests.get(url)`: Sends a GET request to the specified URL. #### 2. Exploring `BeautifulSoup` - **Purpose**: To parse HTML and XML documents. It creates parse trees that can be used to extract data easily. - **Common Uses**: - Parsing HTML. - Extracting data from HTML tags. - **Key Functions**: - `BeautifulSoup(markup, 'html.parser')`: Parses markup into a BeautifulSoup object. - `soup.find_all(tag_name)`: Finds all instances of a tag. - `soup.get_text()`: Extracts text from a tag. #### Installing Libraries Before we begin, ensure that you have the `requests` and `bs4` libraries installed: ```bash pip install requests beautifulsoup4 ``` #### Example: Scraping Class Titles and Start Dates Let's write a script to extract the titles and start dates of classes from the provided HTML of The Multiverse School's classes page. ```python import requests from bs4 import BeautifulSoup # URL of the page to scrape url = "http://www.themultiverse.school/classes" # Sending HTTP request response = requests.get(url) response.raise_for_status() # Ensure we notice bad responses # Parsing the HTML content of the page soup = BeautifulSoup(response.text, 'html.parser') # Initialize a list to hold our extracted class information class_info = [] # Extract data from each article tag, assuming each class is in an 'article' tag for article in soup.find_all('article', class_='thumb'): title_tag = article.find('h2') if title_tag: # Extracting the class title and dates date_text, class_title = title_tag.get_text(strip=True, separator='|').split('|') class_info.append({'title': class_title.strip(), 'start_dates': date_text.strip()}) # Display extracted information for info in class_info: print(f"Class Title: {info['title']}, Start Dates: {info['start_dates']}") ``` #### Explanation 1. We use `requests.get()` to fetch the webpage. 2. `BeautifulSoup` parses the HTML content. 3. We then look for each `article` tag with a class of `thumb`, as this seems to contain each class's information. 4. Inside each `article`, we locate the `h2` tag, which holds the title and dates, then extract the text. #### Exercise Modify the script to also extract and print the URL for each class, assuming it can be found within an `a` tag with the class `image`. This will practice navigating HTML structures and using BeautifulSoup's `find` method. This task reinforces how to combine Python tools to automate the collection of web data, useful for a wide range of applications like competitive analysis, event monitoring, or academic research.