Web Scraping Workshop Using Beautiful Soup

•

0 recomendaciones•57 vistas

https://youtu.be/zMWYC2ai91w https://gist.github.com/daniyalb/42106c74afe20f2c442b796dba727d28 Web Scraping workshop 2023-02-22 Daniyal Bokhari on February 22, 2023 CSC108 level of knowledge required Wed 22nd, 5pm What you’ll learn: How to use python requests * Use BS4 to parse through HTML and find info * Use python data structures to manage info * Make a price finder?

Tecnología

Web Scraping Workshop
Daniyal Bokhari
Using Beautiful Soup Package
for Python

What is Web Scraping?
The textbook definition
• Web scraping is the act of extracting data from
website
• Typically performed using automated tools to reduce
time spent on collecting data from websites
• Once data is extracted from websites, it can be parsed
to access speciﬁc pieces of information relevant to
your search

Some common uses
What is it used for?
• Industry statistics/insights
• By consumers to compare prices across diﬀerent
stores
• Analyze stock or cryptocurrency prices
• Academic research

The legality of it
When should it not be used?
• Web scraping is generally considered to be legal when
it’s used to ﬁnd publicly available information
• Should not be used to collect personal information,
data that the poster did not intend to be publicly
available
• Should not be used to access data that requires you
to create an account or login

A Python library for web scraping
• Beautiful Soup is a Python library that helps in web
scraping by providing useful methods to parse
through the HTML code of websites
• Creates python objects out of elements of the HTML
code allowing you to easily search and traverse
through the contents of websites

Which is better?
• Selenium refers to multiple diﬀerent open-source projects, the Selenium
API can be used to control browsers
• Much more complex than Beautiful Soup, can interact with websites by
pressing buttons and ﬁlling out ﬁelds
• Used for automation and testing
• Can also be used to perform web scraping
• Beautiful Soup is useful for simple projects, can extract information faster
since it doesn’t load up all of the web page content and JavaScript
Beautiful Soup vs Selenium

• In order to use Beautiful Soup, you must specify a parser
• Parsers take the HTML code we receive from a website and construct an
object which we can use in Python to navigate and search through this
code
• Python has a built-in HTML parser called html.parser
• We will be using lxml, an external parser, as it is faster at parsing HTML
than html.parser and works for more versions of Python
Choice of Parser
Which one should you use?

Live Demo
• You will need to install Beautiful Soup, lxml, and the requests library
The best way to learn is to practice
Run these in a terminal:
pip install beautifulsoup4
pip install lxml
pip install requests
Alternate commands:
pip3 install <module>
Python3 -m pip install <module>
Python3 -m pip3 install <module>
• You can use your favourite IDE/text editor (Pycharm, VS Code, Vim, etc) to
follow along

Where can I learn more about this?
• The documentation is a great place to look at all available functions
• Online tutorials, YouTube videos are also a great place to learn more about
web scraping
Beautiful Soup 4 documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Many resources

Más contenido relacionado

Similar a Web Scraping Workshop Using Beautiful Soup

Web Technology Part 1Thapar Institute

Best practices with development of enterprise-scale SharePoint solutions - Pa...SPC Adriatics

Find maximum bugs in limited timebeched

Getting started developing for share pointRoel Bethlehem

Automated Acceptance Testing from ScratchExcella

BSides Lisbon 2013 - All your sites belong to BurpTiago Mendo

DNN Summit: Robots.txt & Multi-Site DNN InstancesWill Strohl

WTF: Where To Focus when you take over a Drupal projectSymetris

SharePoint 2013 Search OperationsSPC Adriatics

Practical automation for beginnersSeoweon Yoo

DCI - Free Seo ToolsDot Com Infoway - Custom Software, Mobile, Web Application Development and Digital Marketing Company

Exploring Content API Options - March 23rd 2016Jani Tarvainen

If you want to automate, you learn to codeAlan Richardson

Search engine and web crawlervinay arora

From marketplace to WordPress - WordCamp BelfastFellyph Cintra

Wp 3hr-courseRich Webster

Untangling spring week11Derek Jacoby

Web Scrapping Using PythonComputerScienceJunct

Software design with Domain-driven design Allan Mangune

DotNetNuke Urls - Best practice for administrators, editors and developersbrchapman

Similar a Web Scraping Workshop Using Beautiful Soup (20)

Web Technology Part 1

Best practices with development of enterprise-scale SharePoint solutions - Pa...

Find maximum bugs in limited time

Getting started developing for share point

Automated Acceptance Testing from Scratch

BSides Lisbon 2013 - All your sites belong to Burp

DNN Summit: Robots.txt & Multi-Site DNN Instances

WTF: Where To Focus when you take over a Drupal project

SharePoint 2013 Search Operations

Practical automation for beginners

DCI - Free Seo Tools

Exploring Content API Options - March 23rd 2016

If you want to automate, you learn to code

Search engine and web crawler

From marketplace to WordPress - WordCamp Belfast

Wp 3hr-course

Untangling spring week11

Web Scrapping Using Python

Software design with Domain-driven design

DotNetNuke Urls - Best practice for administrators, editors and developers

Más de GDSC UofT Mississauga

CSSC ML WorkshopGDSC UofT Mississauga

ICCIT Council × GDSC: UX / UI and FigmaGDSC UofT Mississauga

Community Projects Info Session Fall 2023GDSC UofT Mississauga

GDSC x Deerhacks - Origami WorkshopGDSC UofT Mississauga

Reverse Engineering 101GDSC UofT Mississauga

Michael's OWASP Juice Shop WorkshopGDSC UofT Mississauga

MCSS × GDSC: Intro to Cybersecurity WorkshopGDSC UofT Mississauga

Basics of CGDSC UofT Mississauga

Discord Bot Workshop SlidesGDSC UofT Mississauga

Devops WorkshopGDSC UofT Mississauga

ExpressGDSC UofT Mississauga

HTML_CSS_JS WorkshopGDSC UofT Mississauga

DevOps Workshop Part 1GDSC UofT Mississauga

Docker workshop GDSC_CSSCGDSC UofT Mississauga

Back-end (Flask_AWS)GDSC UofT Mississauga

Full Stack React Workshop [CSSC x GDSC]GDSC UofT Mississauga

Git Init (Introduction to Git)GDSC UofT Mississauga

Database Workshop SlidesGDSC UofT Mississauga

ChatGPT General MeetingGDSC UofT Mississauga

Elon & Twitter General MeetingGDSC UofT Mississauga

Más de GDSC UofT Mississauga (20)

CSSC ML Workshop

ICCIT Council × GDSC: UX / UI and Figma

Community Projects Info Session Fall 2023

GDSC x Deerhacks - Origami Workshop

Reverse Engineering 101

Michael's OWASP Juice Shop Workshop

MCSS × GDSC: Intro to Cybersecurity Workshop

Basics of C

Discord Bot Workshop Slides

Devops Workshop

Express

HTML_CSS_JS Workshop

DevOps Workshop Part 1

Docker workshop GDSC_CSSC

Back-end (Flask_AWS)

Full Stack React Workshop [CSSC x GDSC]

Git Init (Introduction to Git)

Database Workshop Slides

ChatGPT General Meeting

Elon & Twitter General Meeting

Último

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Real Time Object Detection Using Open CVKhem

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

A Year of the Servo Reboot: Where Are We Now?Igalia

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

A Call to Action for Generative AI in 2024Results

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Histor y of HAM Radio presentation slidevu2urc

Web Scraping Workshop Using Beautiful Soup

1. Starting Soon…

2. Web Scraping Workshop Daniyal Bokhari Using Beautiful Soup Package for Python

3. What is Web Scraping? The textbook definition • Web scraping is the act of extracting data from website • Typically performed using automated tools to reduce time spent on collecting data from websites • Once data is extracted from websites, it can be parsed to access speciﬁc pieces of information relevant to your search

4. Some common uses What is it used for? • Industry statistics/insights • By consumers to compare prices across diﬀerent stores • Analyze stock or cryptocurrency prices • Academic research

5. The legality of it When should it not be used? • Web scraping is generally considered to be legal when it’s used to ﬁnd publicly available information • Should not be used to collect personal information, data that the poster did not intend to be publicly available • Should not be used to access data that requires you to create an account or login

6. A Python library for web scraping • Beautiful Soup is a Python library that helps in web scraping by providing useful methods to parse through the HTML code of websites • Creates python objects out of elements of the HTML code allowing you to easily search and traverse through the contents of websites

7. Which is better? • Selenium refers to multiple different open-source projects, the Selenium API can be used to control browsers • Much more complex than Beautiful Soup, can interact with websites by pressing buttons and filling out fields • Used for automation and testing • Can also be used to perform web scraping • Beautiful Soup is useful for simple projects, can extract information faster since it doesn’t load up all of the web page content and JavaScript Beautiful Soup vs Selenium

8. • In order to use Beautiful Soup, you must specify a parser • Parsers take the HTML code we receive from a website and construct an object which we can use in Python to navigate and search through this code • Python has a built-in HTML parser called html.parser • We will be using lxml, an external parser, as it is faster at parsing HTML than html.parser and works for more versions of Python Choice of Parser Which one should you use?

9. Live Demo • You will need to install Beautiful Soup, lxml, and the requests library The best way to learn is to practice Run these in a terminal: pip install beautifulsoup4 pip install lxml pip install requests Alternate commands: pip3 install <module> Python3 -m pip install <module> Python3 -m pip3 install <module> • You can use your favourite IDE/text editor (Pycharm, VS Code, Vim, etc) to follow along

10. Where can I learn more about this? • The documentation is a great place to look at all available functions • Online tutorials, YouTube videos are also a great place to learn more about web scraping Beautiful Soup 4 documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Many resources

Web Scraping Workshop Using Beautiful Soup

Recomendados

Recomendados

Más contenido relacionado

Similar a Web Scraping Workshop Using Beautiful Soup

Similar a Web Scraping Workshop Using Beautiful Soup (20)

Más de GDSC UofT Mississauga

Más de GDSC UofT Mississauga (20)

Último

Último (20)

Web Scraping Workshop Using Beautiful Soup