SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Big Data and Automated Content Analysis
Week 6 – Monday
»Web scraping«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
12 March 2018
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Today
1 A first intro to parsing webpages
2 OK, but this surely can be doe more elegantly? Yes!
3 Scaling up
4 Next steps
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s have a look of a webpage with comments and try to
understand the underlying structure.
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s make a plan!
Which elements from the page do we need?
• What do they mean?
• How are they represented in the source code?
How should our output look like?
• What lists do we want?
• . . .
And how can we achieve this?
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
• Possibly taking measures to deal with cookie walls, being
blocked, etc.
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
• Possibly taking measures to deal with cookie walls, being
blocked, etc.
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
A first plan
1 Download the page
• Possibly taking measures to deal with cookie walls, being
blocked, etc.
2 Remove all line breaks (n, but maybe also nr or r) and
TABs (t): We want one long string
3 Try to isolate the comments
• Do we see any pattern in the source code? ⇒ last week: if we
can see a pattern, we can describe it with a regular expression
Big Data and Automated Content Analysis Damian Trilling
1 import requests
2 import re
3
4 response = requests.get(’http://kudtkoekiewet.nl/setcookie.php?t=http://
www.geenstijl.nl/mt/archieven/2014/05/das_toch_niet_normaal.html’)
5
6 tekst=response.text.replace("n"," ").replace("t"," ")
7
8 comments=re.findall(r’<div class="cmt-content">(.*?)</div>’,tekst)
9 print("There are",len(comments),"comments")
10 print("These are the first two:")
11 print(comments[:2])
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <div> and </div>)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Some remarks
The regexp
• .*? instead of .* means lazy matching. As .* matches
everything, the part where the regexp should stop would not
be analyzed (greedy matching) – we would get the whole rest
of the document (or the line, but we removed all line breaks).
• The parentheses in (.*?) make sure that the function only
returns what’s between them and not the surrounding stuff
(like <div> and </div>)
Optimization
• Parse usernames, date, time, . . .
• Replace <p> tags
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Further reading
Doing this with other sites?
• It’s basically puzzling with regular expressions.
• Look at the source code of the website to see how
well-structured it is.
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
OK, but this surely can be doe more elegantly? Yes!
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Scraping
Geenstijl-example
• Worked well (and we could do it with the knowledge we
already had)
• But we can also use existing parsers (that can interpret the
structure of the html page)
• especially when the structure of the site is more complex
The following example is based on http://www.chicagoreader.com/
chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It
uses the module lxml
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
What do we need?
• the URL (of course)
• the XPATH of the element we want to scrape (you’ll see in a
minute what this is)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But you can also create your XPATH yourself
There are multiple different XPATHs to address a specific element.
Some things to play around with:
• // means ‘arbitrary depth’ (=may be nested in many higher
levels)
• * means ‘anything’. (p[2] is the second paragraph, p[*] are
all
• If you want to refer to a specific attribute of a HTML tag, you
can use @. For example, every
*[@id="reviews-container"] would grap a tag like <div
id=reviews-container” class=”’user-content’
• Let the XPATH end with /text() to get all text
• Have a look at the source code (via ‘inspect elements’) of the
web page to think of other possible XPATHs!
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s test it!
https://www.kieskeurig.nl/smartphone/product/
3518003-samsung-galaxy-a5-2017-zwart/reviews
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Let’s scrape them!
1 from lxml import html
2 import requests
3
4 htmlsource = requests.get("https://www.kieskeurig.nl/smartphone/product
/3518003-samsung-galaxy-a5-2017-zwart/reviews").text
5 tree = html.fromstring(htmlsource)
6
7 reviews = tree.xpath(’//*[@class="reviews-single__text"]/text()’)
8
9 # remove empty reviews
10 reviews = [r.strip() for r in reviews if r.strip()!=""]
11
12 print (len(reviews),"reviews scraped. Showing the first 60 characters of
each:")
13 i=0
14 for review in reviews:
15 print("Review",i,":",review[:60])
16 i+=1
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
The output – perfect!
1 61 reviews scraped. Showing the first 60 characters of each:
2 Review 0 : Wat een fenomenale smartphone is de nieuwe Samsung Galaxy A5
3 Review 1 : Bij het openen van de verpakking valt direct op dat dit toes
4 Review 2 : Tijd om deze badboy eens aan de tand te voelen dus direct he
5 Review 3 : Na een volle week dit toestel gebruikt te hebben kan ik het
6 Review 4 : ik ben verliefd op de Samsung Galaxy A5 2017!
Big Data and Automated Content Analysis Damian Trilling
Scaling up
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But this was on one page only, right?
Next step: Repeat for each relevant page.
Possibility 1: Based on url schemes
If the url of one review page is
https://www.hostelworld.com/hosteldetails.php/
ClinkNOORD/Amsterdam/93919/reviews?page=2
. . . then the next one is probably?
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But this was on one page only, right?
Next step: Repeat for each relevant page.
Possibility 1: Based on url schemes
If the url of one review page is
https://www.hostelworld.com/hosteldetails.php/
ClinkNOORD/Amsterdam/93919/reviews?page=2
. . . then the next one is probably?
⇒ you can construct a list of all possible URLs:
1 baseurl = ’https://www.hostelworld.com/hosteldetails.php/ClinkNOORD/
Amsterdam/93919/reviews?page=’
2
3 allurls = [baseurl+str(i+1) for i in range(20)]
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
But this was on one page only, right?
Next step: Repeat for each relevant page.
Possibility 2: Based on XPATHs
Use XPATH to get the url of the next page (i.e., to get the link
that you would click to get the next review)
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Recap
General idea
1 Identify each element by its XPATH (look it up in your
browser)
2 Read the webpage into a (loooooong) string
3 Use the XPATH to extract the relevant text into a list (with a
module like lxml)
4 Do something with the list (preprocess, analyze, save)
5 Repeat
Alternatives: scrapy, beautifulsoup, regular expressions, . . .
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Last remarks
There is often more than one way to specify an XPATH
1 Sometimes, you might want to use a different suggestion to
be able to generalize better (e.g., using the attributes rather
than the tags)
2 in that case, it makes sense to look deeper into the structure
of the HTML code, for example with “Inspect Element” and
use that information to play around with in the XPATH
Checker
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Next steps
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
From now on. . .
. . . focus on individual projects!
Big Data and Automated Content Analysis Damian Trilling
A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps
Wednesday
• Write a scraper for a website of your choice!
• Prepare a bit, so that you know where to start
• Suggestions: Review texts and grades from http://iens.nl,
articles from a news site of your choice, . . .
Big Data and Automated Content Analysis Damian Trilling
BDACA - Lecture6

Más contenido relacionado

La actualidad más candente

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
Jonathan Mugan
 

La actualidad más candente (20)

BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2BDACA1516s2 - Lecture2
BDACA1516s2 - Lecture2
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 

Similar a BDACA - Lecture6

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Similar a BDACA - Lecture6 (20)

BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
Python Homework Help
Python Homework HelpPython Homework Help
Python Homework Help
 
Gaps in the algorithm
Gaps in the algorithmGaps in the algorithm
Gaps in the algorithm
 
Essential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrationsEssential Technical SEO learnings from 120+ site migrations
Essential Technical SEO learnings from 120+ site migrations
 
Content Management for Publishers
Content Management for PublishersContent Management for Publishers
Content Management for Publishers
 
Lecture 9 Professional Practices
Lecture 9 Professional PracticesLecture 9 Professional Practices
Lecture 9 Professional Practices
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB GalaxyMongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
MongoDB World 2019: Don't Panic - The Hitchhiker's Guide to the MongoDB Galaxy
 
Migration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, ParisMigration Best Practices - Search Y 2019, Paris
Migration Best Practices - Search Y 2019, Paris
 
SEARCH Y - Bastian Grimm - Migrations Best Practices
SEARCH Y - Bastian Grimm -  Migrations Best PracticesSEARCH Y - Bastian Grimm -  Migrations Best Practices
SEARCH Y - Bastian Grimm - Migrations Best Practices
 
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
Crawling & Indexing for JavaScript Heavy Sites brightonSEO 2021
 
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
Accessible Websites With Lotus Notes/Domino, presented at the BLUG day event,...
 
What book and journal publishers need to know to get accessibility right
What book and journal publishers need to know to get accessibility rightWhat book and journal publishers need to know to get accessibility right
What book and journal publishers need to know to get accessibility right
 
BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!
BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!
BrightonSEO - Indexation, Cannibalization, Experimentation, Oh My!
 
Provenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine LearningProvenance in Production-Grade Machine Learning
Provenance in Production-Grade Machine Learning
 
Developer Grade SEO
Developer Grade SEODeveloper Grade SEO
Developer Grade SEO
 
JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4JavaScript 2.0 in Dreamweaver CS4
JavaScript 2.0 in Dreamweaver CS4
 
SEO and Accessibility
SEO and AccessibilitySEO and Accessibility
SEO and Accessibility
 

Más de Department of Communication Science, University of Amsterdam

Más de Department of Communication Science, University of Amsterdam (13)

BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 

BDACA - Lecture6

  • 1. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Big Data and Automated Content Analysis Week 6 – Monday »Web scraping« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 12 March 2018 Big Data and Automated Content Analysis Damian Trilling
  • 2. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Today 1 A first intro to parsing webpages 2 OK, but this surely can be doe more elegantly? Yes! 3 Scaling up 4 Next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s have a look of a webpage with comments and try to understand the underlying structure. Big Data and Automated Content Analysis Damian Trilling
  • 4.
  • 5.
  • 6.
  • 7. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s make a plan! Which elements from the page do we need? • What do they mean? • How are they represented in the source code? How should our output look like? • What lists do we want? • . . . And how can we achieve this? Big Data and Automated Content Analysis Damian Trilling
  • 8. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan Big Data and Automated Content Analysis Damian Trilling
  • 9. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page Big Data and Automated Content Analysis Damian Trilling
  • 10. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page • Possibly taking measures to deal with cookie walls, being blocked, etc. Big Data and Automated Content Analysis Damian Trilling
  • 11. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page • Possibly taking measures to deal with cookie walls, being blocked, etc. 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string Big Data and Automated Content Analysis Damian Trilling
  • 12. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps A first plan 1 Download the page • Possibly taking measures to deal with cookie walls, being blocked, etc. 2 Remove all line breaks (n, but maybe also nr or r) and TABs (t): We want one long string 3 Try to isolate the comments • Do we see any pattern in the source code? ⇒ last week: if we can see a pattern, we can describe it with a regular expression Big Data and Automated Content Analysis Damian Trilling
  • 13. 1 import requests 2 import re 3 4 response = requests.get(’http://kudtkoekiewet.nl/setcookie.php?t=http:// www.geenstijl.nl/mt/archieven/2014/05/das_toch_niet_normaal.html’) 5 6 tekst=response.text.replace("n"," ").replace("t"," ") 7 8 comments=re.findall(r’<div class="cmt-content">(.*?)</div>’,tekst) 9 print("There are",len(comments),"comments") 10 print("These are the first two:") 11 print(comments[:2])
  • 14. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). Big Data and Automated Content Analysis Damian Trilling
  • 15. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <div> and </div>) Big Data and Automated Content Analysis Damian Trilling
  • 16. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Some remarks The regexp • .*? instead of .* means lazy matching. As .* matches everything, the part where the regexp should stop would not be analyzed (greedy matching) – we would get the whole rest of the document (or the line, but we removed all line breaks). • The parentheses in (.*?) make sure that the function only returns what’s between them and not the surrounding stuff (like <div> and </div>) Optimization • Parse usernames, date, time, . . . • Replace <p> tags Big Data and Automated Content Analysis Damian Trilling
  • 17. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Further reading Doing this with other sites? • It’s basically puzzling with regular expressions. • Look at the source code of the website to see how well-structured it is. Big Data and Automated Content Analysis Damian Trilling
  • 18. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps OK, but this surely can be doe more elegantly? Yes! Big Data and Automated Content Analysis Damian Trilling
  • 19. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) Big Data and Automated Content Analysis Damian Trilling
  • 20. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) Big Data and Automated Content Analysis Damian Trilling
  • 21. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex Big Data and Automated Content Analysis Damian Trilling
  • 22. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Scraping Geenstijl-example • Worked well (and we could do it with the knowledge we already had) • But we can also use existing parsers (that can interpret the structure of the html page) • especially when the structure of the site is more complex The following example is based on http://www.chicagoreader.com/ chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228. It uses the module lxml Big Data and Automated Content Analysis Damian Trilling
  • 23. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps What do we need? • the URL (of course) • the XPATH of the element we want to scrape (you’ll see in a minute what this is) Big Data and Automated Content Analysis Damian Trilling
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) Big Data and Automated Content Analysis Damian Trilling
  • 29. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all Big Data and Automated Content Analysis Damian Trilling
  • 30. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ Big Data and Automated Content Analysis Damian Trilling
  • 31. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text Big Data and Automated Content Analysis Damian Trilling
  • 32. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But you can also create your XPATH yourself There are multiple different XPATHs to address a specific element. Some things to play around with: • // means ‘arbitrary depth’ (=may be nested in many higher levels) • * means ‘anything’. (p[2] is the second paragraph, p[*] are all • If you want to refer to a specific attribute of a HTML tag, you can use @. For example, every *[@id="reviews-container"] would grap a tag like <div id=reviews-container” class=”’user-content’ • Let the XPATH end with /text() to get all text • Have a look at the source code (via ‘inspect elements’) of the web page to think of other possible XPATHs! Big Data and Automated Content Analysis Damian Trilling
  • 33. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s test it! https://www.kieskeurig.nl/smartphone/product/ 3518003-samsung-galaxy-a5-2017-zwart/reviews Big Data and Automated Content Analysis Damian Trilling
  • 34. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Let’s scrape them! 1 from lxml import html 2 import requests 3 4 htmlsource = requests.get("https://www.kieskeurig.nl/smartphone/product /3518003-samsung-galaxy-a5-2017-zwart/reviews").text 5 tree = html.fromstring(htmlsource) 6 7 reviews = tree.xpath(’//*[@class="reviews-single__text"]/text()’) 8 9 # remove empty reviews 10 reviews = [r.strip() for r in reviews if r.strip()!=""] 11 12 print (len(reviews),"reviews scraped. Showing the first 60 characters of each:") 13 i=0 14 for review in reviews: 15 print("Review",i,":",review[:60]) 16 i+=1 Big Data and Automated Content Analysis Damian Trilling
  • 35. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps The output – perfect! 1 61 reviews scraped. Showing the first 60 characters of each: 2 Review 0 : Wat een fenomenale smartphone is de nieuwe Samsung Galaxy A5 3 Review 1 : Bij het openen van de verpakking valt direct op dat dit toes 4 Review 2 : Tijd om deze badboy eens aan de tand te voelen dus direct he 5 Review 3 : Na een volle week dit toestel gebruikt te hebben kan ik het 6 Review 4 : ik ben verliefd op de Samsung Galaxy A5 2017! Big Data and Automated Content Analysis Damian Trilling
  • 37. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But this was on one page only, right? Next step: Repeat for each relevant page. Possibility 1: Based on url schemes If the url of one review page is https://www.hostelworld.com/hosteldetails.php/ ClinkNOORD/Amsterdam/93919/reviews?page=2 . . . then the next one is probably? Big Data and Automated Content Analysis Damian Trilling
  • 38. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But this was on one page only, right? Next step: Repeat for each relevant page. Possibility 1: Based on url schemes If the url of one review page is https://www.hostelworld.com/hosteldetails.php/ ClinkNOORD/Amsterdam/93919/reviews?page=2 . . . then the next one is probably? ⇒ you can construct a list of all possible URLs: 1 baseurl = ’https://www.hostelworld.com/hosteldetails.php/ClinkNOORD/ Amsterdam/93919/reviews?page=’ 2 3 allurls = [baseurl+str(i+1) for i in range(20)] Big Data and Automated Content Analysis Damian Trilling
  • 39. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps But this was on one page only, right? Next step: Repeat for each relevant page. Possibility 2: Based on XPATHs Use XPATH to get the url of the next page (i.e., to get the link that you would click to get the next review) Big Data and Automated Content Analysis Damian Trilling
  • 40. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Recap General idea 1 Identify each element by its XPATH (look it up in your browser) 2 Read the webpage into a (loooooong) string 3 Use the XPATH to extract the relevant text into a list (with a module like lxml) 4 Do something with the list (preprocess, analyze, save) 5 Repeat Alternatives: scrapy, beautifulsoup, regular expressions, . . . Big Data and Automated Content Analysis Damian Trilling
  • 41. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Last remarks There is often more than one way to specify an XPATH 1 Sometimes, you might want to use a different suggestion to be able to generalize better (e.g., using the attributes rather than the tags) 2 in that case, it makes sense to look deeper into the structure of the HTML code, for example with “Inspect Element” and use that information to play around with in the XPATH Checker Big Data and Automated Content Analysis Damian Trilling
  • 42.
  • 43. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Next steps Big Data and Automated Content Analysis Damian Trilling
  • 44. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps From now on. . . . . . focus on individual projects! Big Data and Automated Content Analysis Damian Trilling
  • 45. A first intro to parsing webpages OK, but this surely can be doe more elegantly? Yes! Scaling up Next steps Wednesday • Write a scraper for a website of your choice! • Prepare a bit, so that you know where to start • Suggestions: Review texts and grades from http://iens.nl, articles from a news site of your choice, . . . Big Data and Automated Content Analysis Damian Trilling