目录
1.Beautiful Soup简介
2.解析器
3.安装Beautiful Soup
4.基本使用
5.节点选择器
6.提取信息
7.关联选择
8.方法选择器
9.CSS选择器
1.Beautiful Soup简介
前面学习通过正则表达式提取网页信息时,如果正则表达式出现错误则无法正确提取我们所需要的结果。由于网页有一定的特殊和层级关系,利用强大的解析工具——Beautiful Soup能够借助网页的结构和属性等特性来解析网页,相比于正则表达式,它可以利用更简单的语句提取网页内容。
简单来说,Beautiful Soup是Python的一个HTML或XML的解析库,我们用它可以方便地从网页中提取数据,其官方解释如下:
2.解析器
通过对比不同解析器可以看出,LXML解析器有解析HTML和XML的功能,而且速度快,容错能力强,推荐使用。在使用LXML解析器时,只需要在初始化Beautiful Soup时,将第二个参数修改为lxml即可。
from bs4 import BeautifulSoup soup = BeautifulSoup('hello
','lxml') print(soup.p.string)
运行结果:
hello
3.安装Beautiful Soup
在使用之前 确保已经正确安装好Beautiful Soup和lxml两个库。在cmd里直接pip安装即可,命令如下:
pip install beautifulsoup4
pip install lxml
4.基本使用
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()) #自动补全代码 容错处理
print(soup.title.string) #返回title的内容
The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) #自动补全代码 容错处理 print(soup.title.string) #返回title的内容运行结果:
The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie ; and they lived at the bottom of a well.
...
The Dormouse's story
首先声明变量html字符串,但是需要注意的是这并不是一个完整的html字符串。接着将它作为第一个参数传给BeautifulSoup对象,第二个参数为解析器的类型(设置为lxml),此时完成BeautifulSoup对象的初始化,接着将这个对象赋值给soup变量。之后,就可以调用soup的各个方法和属性解析这串html代码了。
①调用prettify方法。对不标准的html字符串自动更正格式。
②调用soup.title.string。输出HTML中title节点的文本内容。
5.节点选择器
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) #输出title节点的选择结果
print(type(soup.title)) #输出title节点的类型
print(soup.title.string) #输出title节点里面的文字内容
print(soup.head) #输出head节点
print(soup.p) #输出第一个p标签的内容
The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,"lxml") print(soup.title) #输出title节点的选择结果 print(type(soup.title)) #输出title节点的类型 print(soup.title.string) #输出title节点里面的文字内容 print(soup.head) #输出head节点 print(soup.p) #输出第一个p标签的内容运行结果:
The Dormouse's story The Dormouse's story The Dormouse's story The Dormouse's story
【注】bs4.element.Tag是BeautifulSoup中一个重要的数据结构,经过选择器选择的结果都是这种Tag类型。
6.提取信息
#下面皆由这段html文本为例:
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
- 获取名称
The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,"lxml")利用name属性可以获得节点的名称,先调用节点然后再调用name属性即可获得节点名称:
print(soup.title.name)
运行结果:
title
- 获取属性
一个节点可能有多个属性如class、id等,选择节点后可以调用attrs获取其所有属性
print(soup.p.attrs)
运行结果:
{'class': ['title'], 'name': 'dromouse'}
调用attrs属性返回结果是字典形式,包括属性和属性值,想要获取属性值,如下:
print(soup.p.attrs['name'])
运行结果:
dromouse
除此之外,还有更简便获得属性值的方法,如下:
print(soup.p['class']) print(soup.p['name'])
运行结果:
['title'] dromouse
在这里需要注意的是,class属性返回的是列表,而name属性返回的是字符串。因为name属性的值是唯一的于是返回的结果就是单个字符串。而一个节点元元素可能包含多个class,所以返回的就是列表。在实际处理的过程中,需要注意这个问题。
- 获取内容
前面也使用过,利用string属性获取节点元素包含的文本内容,如下:
print(soup.p.string)
运行结果:
The Dormouse's story
- 嵌套选择
返回类型为bs4.element.Tag,tag对象同样可以继续调用节点进行下一步选择:
html = """The Dormouse's story """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.head.title) print(type(soup.head.title)) print(soup.head.title.string)
运行结果:
The Dormouse's story The Dormouse's story
7.关联选择
- 子节点和子孙节点
①调用contents属性
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.contents)
运行结果:
['n Once upon a time there were three little sisters; and their names weren ', Elsie , 'n', Lacie, 'n andn ', Tillie, 'n and they lived at the bottom of a well.n ']
由结果可以看到,返回的是列表形式,p节点里既包含了文本也包含了节点,这些内容最终以列表的形式返回。
但是需要注意的是,列表中的每个元素都是p节点的直接子节点。像第一个a节点里面包含的span节点,就相当于子孙节点,但是返回的内容没有将span节点单独选出来。所以contents属性得到的结果是直接子节点组成的列表。
调用children属性得到相应的结果:
②调用children属性
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child)
运行结果:
0 Once upon a time there were three little sisters; and their names were 1 Elsie 2 3 Lacie 4 and 5 Tillie 6 and they lived at the bottom of a well.
③调用descendants属性
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.descendants) for i,child in enumerate(soup.p.descendants): print(i,child)
运行结果:
0 Once upon a time there were three little sisters; and their names were 1 Elsie 2 3 Elsie 4 Elsie 5 6 7 Lacie 8 Lacie 9 and 10 Tillie 11 Tillie 12 and they lived at the bottom of a well.
可以发现,返回结果跟children属性一样是生成器。利用for循环遍历输出可以看到,此时输出的结果中包含了span节点,因为descendants会递归查询所有子节点,得到所有子孙节点,
- 父节点和祖先节点
调用parent和parents属性
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(list(enumerate(soup.a.parents)))
运行结果:
[(0,Once upon a time there were three little sisters; and their names were Elsie
), (1,Once upon a time there were three little sisters; and their names were Elsie
...
), (2,The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie
...
), (3,The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie
...
)]
- 兄弟节点
调用next_sibling和previous_sibling属性
html = """Once upon a time there were three little sisters; and their names were Elsie Hello Lacie and Tillie and they lived at the bottom of a well.
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('next_sibling:',soup.a.next_sibling) print('previous_sibling:',soup.a.previous_sibling) print('next_siblings:',list(enumerate(soup.a.next_siblings))) print('previous_siblings:',list(enumerate(soup.a.previous_siblings)))
运行结果:
next_sibling: Hello previous_sibling: Once upon a time there were three little sisters; and their names were next_siblings: [(0, 'n Hellon '), (1, Lacie), (2, 'n andn '), (3, Tillie), (4, 'n and they lived at the bottom of a well.n ')] previous_siblings: [(0, 'n Once upon a time there were three little sisters; and their names weren ')]
分别调用了4个属性,next_sibling和previous_sibling属性分别用于获取节点的上一个和下一个兄弟节点,next_siblings和previous_siblings属性则分别返回前面和后面所有的兄弟节点
- 提取信息
通过上面关联选择,提取想要的信息
html = """Once upon a time there were three little sisters; and their names were BobLacie
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('Next Sibling:') print(soup.a.next_sibling) print(soup.a.next_sibling.string) print('-----------------------------') print('parent:') print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs['class'])
Next Sibling: Lacie Lacie ----------------------------- parent:Once upon a time there were three little sisters; and their names were BobLacie
['story']
8.方法选择器
- find_all
顾名思义就是查询所有符合条件的元素。
find_all(name, attrs, recursive, text, **kwargs)
- name
可以根据name参数查询元素
html=''' Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
- Foo
- Bar
- Jay
- Foo
- Bar
返回结果为列表类型,长度为2,列表中的元素都是bs4.element.Tag类型。接下来我们可以遍历每个li节点,并获取他的文本内容:
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string)
运行结果:
[
- attrs
除了根据节点名查询,我们也可以传入一些属性进行查询:
html=''' Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
- Foo
- Bar
- Jay
- Foo
- Bar
- Jay
查询时传入attrs参数,其属于字典类型。
对于一些常用的属性,例如id,class等,我们可以不用attrs传递,换一种方式查询:
print(soup.find_all(id='list-1')) print(soup.find_all(class_='element')) #因为class是python里的关键字,注意使用下划线
运行结果:
[
- Foo
- Bar
- Jay
- text
text参数可以用来匹配节点的文本,其传入的形式可能是字符串,也可以是正则表达式对象:
import re html=''' Hello, this is a link Hello, this is a link, too ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text = re.compile('link'))) print(soup.find_all(text = re.compile('Hello')))
返回结果是由所有与正则表达式相匹配的节点文本组成的列表。
- find
除了find_all方法,还有find方法也可以查询符合条件的元素,只不过find方法返回的是单个元素,也就是第一个匹配的元素,而find_all则返回所有匹配元素组成的列表。find用法跟find_all完全相同,区别在于查询范围不同,这里就不一一实现了
9.CSS选择器
CSS选择器需要调用select方法,传入相应的CSS选择器即可。
html=''' Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[ Hello ] [
- 嵌套选择
select方法支持嵌套选择,实例如下:
from bs4 import BeautifulStoneSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select("ul"): print(ul.select("li"))
运行结果:
[
输出每个ul节点下所有li节点组成的列表
- 获取属性
既然节点是Tag类型,于是获取属性依然可以用原爱的方法,这里尝试获取每个ul节点的id属性:
from bs4 import BeautifulStoneSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select("ul"): print(ul["id"]) print(ul.attrs['id'])
运行结果:
list-1 list-1 list-2 list-2
可以看到直接将属性名传入中括号和通过attrs属性获得属性值,都能够成功获得属性的。
- 获取文本
要获取文本,可以用前面所用到的string属性,这里还有一个办法,就是用get_text,二者实现效果完全一致,都可以获取节点的文本值,实例如下:
from bs4 import BeautifulStoneSoup soup = BeautifulSoup(html,'lxml') for li in soup.select("li"): print("get_text:",li.get_text()) print("string:",li.string)
运行结果:
get_text: Foo string: Foo get_text: Bar string: Bar get_text: Jay string: Jay get_text: Foo string: Foo get_text: Bar string: Bar