What is web scraping?
Before watching the web scraping tutorial using Python down below, let’s see what web scraping is.
The process of extracting data from a webpage is known as web scraping. Let’s say you would like to get a list of all the emails on a webpage. You can manually find all the emails and copy them one by one and paste them onto a spreadsheet. What if there are hundreds of them in different parts of the webpage? It would be so time-consuming! So, you can use a bit of programming and write a script to grab all the emails in a certain page and put them all in a spreadsheet.
How is web scraping done?
To illustrate the web scraping process using Python, I am going to use https://books.toscrape.com website as a web scraping playground to scrape the title, rating, and the author of books.
First, the web scraper receives the target URL to load. The scraper then loads all the HTML for that page. Then you can either extract all the data on the webpage or extract specific data (the title, rating, and the author). After extracting the required information, you can export it to a spreadsheet using either Python CSV library or Pandas library.
In order to scrape a website, we use a Python library called BeautifulSoup:
What is BeautifulSoup?
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.beautifulsoup documentation
Web scraping tutorial using beautifulSoup
Subscribe to Receive the Latest Python Tips
For this projects you need to install the following libraries and modules in your terminal:
pip install requests
pip install beautifulsoup4
pip install pandas
I am using the requests library to retrieve the content of the webpage I am trying to scrape.
I will use BeautifulSoup to extract the information I want from the website.
I use Pandas library to export the scraped information to a csv file.
Looking for more tutorials like this only for subscribers? Subscribe to my email list.