https://youtu.be/zMWYC2ai91w
https://gist.github.com/daniyalb/42106c74afe20f2c442b796dba727d28
Web Scraping workshop 2023-02-22
Daniyal Bokhari on February 22, 2023
CSC108 level of knowledge required
Wed 22nd, 5pm
What you’ll learn:
How to use python requests
* Use BS4 to parse through HTML and find info
* Use python data structures to manage info
* Make a price finder?
3. What is Web Scraping?
The textbook definition
• Web scraping is the act of extracting data from
website
• Typically performed using automated tools to reduce
time spent on collecting data from websites
• Once data is extracted from websites, it can be parsed
to access specific pieces of information relevant to
your search
4. Some common uses
What is it used for?
• Industry statistics/insights
• By consumers to compare prices across different
stores
• Analyze stock or cryptocurrency prices
• Academic research
5. The legality of it
When should it not be used?
• Web scraping is generally considered to be legal when
it’s used to find publicly available information
• Should not be used to collect personal information,
data that the poster did not intend to be publicly
available
• Should not be used to access data that requires you
to create an account or login
6. A Python library for web scraping
• Beautiful Soup is a Python library that helps in web
scraping by providing useful methods to parse
through the HTML code of websites
• Creates python objects out of elements of the HTML
code allowing you to easily search and traverse
through the contents of websites
7. Which is better?
• Selenium refers to multiple different open-source projects, the Selenium
API can be used to control browsers
• Much more complex than Beautiful Soup, can interact with websites by
pressing buttons and filling out fields
• Used for automation and testing
• Can also be used to perform web scraping
• Beautiful Soup is useful for simple projects, can extract information faster
since it doesn’t load up all of the web page content and JavaScript
Beautiful Soup vs Selenium
8. • In order to use Beautiful Soup, you must specify a parser
• Parsers take the HTML code we receive from a website and construct an
object which we can use in Python to navigate and search through this
code
• Python has a built-in HTML parser called html.parser
• We will be using lxml, an external parser, as it is faster at parsing HTML
than html.parser and works for more versions of Python
Choice of Parser
Which one should you use?
9. Live Demo
• You will need to install Beautiful Soup, lxml, and the requests library
The best way to learn is to practice
Run these in a terminal:
pip install beautifulsoup4
pip install lxml
pip install requests
Alternate commands:
pip3 install <module>
Python3 -m pip install <module>
Python3 -m pip3 install <module>
• You can use your favourite IDE/text editor (Pycharm, VS Code, Vim, etc) to
follow along
10. Where can I learn more about this?
• The documentation is a great place to look at all available functions
• Online tutorials, YouTube videos are also a great place to learn more about
web scraping
Beautiful Soup 4 documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Many resources