python学习笔记4

¶urllib库

¶parse模块

¶urlparse()

url解析

代码:

1
2
3

from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result, sep='\n')

结果:

1 2	<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

函数返回结果为ParseResult类型，包括6个部分:

scheme: 协议(://前面)
netloc: 域名(/前面)
path: 访问路径(/后面)
params: 参数(;后面)
query: 查询条件(?后面)，一般用作GET类型的URL
fragment: 锚点(#后面)，由于直接定位页面内部的下拉位置

即 scheme://netloc/path;params?query#fragment

¶参数

urlstring: 必填项

scheme: 默认协议

allow_fragments: 是否忽略fragment，如果是False，fragment部分会被解析为path，parameters或query的一部分。

¶urlunparse()

完成URL构造，接受一个可迭代对象，长度必须为6.

如 http://www.baidu.com/index.html;user?a=6#comment

1
2
3

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

¶urlsplit()

不解析params部分

代码:

1
2
3

from urllib.parse import urlsplit
result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)

结果

1	SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

¶urlunsplit()

同理，与urlunparse()，长度必须是5

¶urljoin()

第一个参数：base_url(基础链接)

第二个参数：新的链接

base_url 提供了三项内容 scheme，netloc，path，如果这三项在新的链接里不存在，就予以补全；如果存在，就使用新的链接的部分。

¶urlencode()

构造GET请求参数

from urllib.parse import urlencode
query = {
    'name': 'germey',
    'age': '12'
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(query)
print(url)

结果为http://www.baidu.com?name=germey&age=12

¶parse_qs()

query转回字典

1
2
3

from urllib.parse import parse_qs
query = 'name=germey&age=12'
print(parse_qs(query))

结果为{‘name’: [‘germey’], ‘age’: [‘12’]}

¶parse_qsl()

另一种方法，转换为元组

1
2
3

from urllib.parse import parse_qsl
query = 'name=germey&age=12'
print(parse_qsl(query))

结果为[(‘name’, ‘germey’), (‘age’, ‘12’)]

¶quote()

中文参数转化为URL编码

from urllib.parse import quote
keyword = '野狼匹'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

结果为https://www.baidu.com/s?wd=%E9%87%8E%E7%8B%BC%E5%8C%B9

¶unquote()

URL编码转换为中文

1 2	from urllib.parse import unquote print(unquote('https://www.baidu.com/s?wd=%E9%87%8E%E7%8B%BC%E5%8C%B9'))

结果为https://www.baidu.com/s?wd=野狼匹

¶robotparser模块

首先创建一个RobotFileParser对象

rp = RobotFileParser()

创建时可传入链接，如

rp = RobotFileParser('http://www.jianshu.com/robots.txt')

几个函数的应用

set_url(): 创建时未传入链接，可以使用这个函数设置robots.txt文件
read(): 读取robots.txt文件并分析(未读取将全是False)

read()

可以直接查看源代码，read()其实就是urlopen()+parse(),但是read()函数有一个坏处，当你打开错误时，只会全部允许或者全不允许，于是我入坑了，简书的robots文件必须伪造成浏览器才能打开，结果是导致403而不知道，结果所有网站都是False，建议直接自行使用urlopen()+parse()。

parse():解析robots文件

urlopen()+parse()方法+伪装浏览器方法

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen
import urllib.request
rp = RobotFileParser()
headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0&#39;}
req = urllib.request.Request(url=&#39;http://www.jianshu.com/robots.txt&#39;, headers=headers)
rp.parse(urlopen(req).read().decode(&#39;utf-8&#39;).splitlines())

主要方便之处如果产生错误直接抛出，自己可以手动解决。

can_fetch(): 传入User-agent 和URL，返回是否可抓取

1	print(rp.can_fetch('', 'http://www.jianshu.com/p/a4be10076a6e'))print(rp.can_fetch('', 'http://www.jianshu.com/search?q=python&page=1&typr=collections'))

结果是True和False

mtime(): 返回上次抓取和分析robots.txt的时间
modified():将当前时间设置为上次抓取和分析robots.txt的时间