10. Filters
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class=“story">
Once upon a time there were three little sisters;
and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id=“link3">Tillie</
a>;
and they lived at the bottom of a well.</p>
!
<p class="story">...</p>
“”"
!
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
!
18. The API
find_all(name, attrs, recursive,
text, limit, **kwargs)
attrs: a dictionary of html attributes to match
soup.find_all("a", attrs={"class": "sister"})
[<a class="sister" href="http://example.com/elsie"
id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie"
id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie"
id="link3">Tillie</a>]
20. The API
find_all(name, attrs, recursive,
text, limit, **kwargs)
text: search for text instead of tags
soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
21. The API
find_all(name, attrs, recursive,
text, limit, **kwargs)
limit: an int to control the number of items returned
22. The API
find_all(name, attrs, recursive,
text, limit, **kwargs)
keyword: A keyword argument will search for a tag
with that attribute
!
>>>soup.find_all(id=‘link2’)
[<a class="sister" href="http://example.com/lacie"
id="link2">Lacie</a>]
23. Navigating with BS
The easiest way to navigate elements down the tree
is to use the dot notation.
>>>soup.head
<head><title>The Dormouse's story</
title></head>
>>>soup.title
<title>The Dormouse's story</title>
24. Navigating with BS
You can look at the children of an element
with .contents
>>>head_tag = soup.head
>>>head_tag
<head><title>The Dormouse's story</
title></head>
>>>head_tag.contents
[<title>The Dormouse's story</title>]