python学习笔记3

urllib库

request模块

BaseHandler类

Handler的父类,基本函数有default_open(),protocol_request()

Hander子类

HTTPDefaultErrorHandler

用于处理HTTP响应错误

HTTPRedirectHandler

用于处理重定向

HTTPCookieProcessor

用于处理cookies:

字典处理

使用http.cookiejar.CookieJar()创建cookie

使用HTTPCookieProcessor创建handle并传入cookie

build_opener()创建opener并open()

1
2
3
4
5
6
7
8
import http.cookiejar
import urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
for item in cookie:
print(item.name+'='+item.value)
保存cookie

使用MozillaCookieJar(filename)或LWPCookieJar(filename)创建cookie

使用cookie.save(ignore_expires=True, ignore_discard=True)保存文件

1
2
3
4
5
6
7
8
9
10
import http.cookiejar
import urllib.request
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
# Mozilla格式或LWP格式
# cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_expires=True, ignore_discard=True)
读取cookie

使用cookie.load(filename, ignore_discard=True, ignore_expires=True)读取cookie文件

1
2
3
4
5
6
7
8
9
10
11
import http.cookiejar
import urllib.request
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar()
# Mozilla格式或LWP格式
# cookie = http.cookiejar.LWPCookieJar()
cookie.load(filename, ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
print(response.read().decode('utf-8'))
ProxyHandler

设置代理

使用ProxyHandler,传入参数为字典

使用build_opener(handler)构建opener

使用open()函数打开

1
2
3
4
5
6
7
8
9
from urllib.request import ProxyHandler, build_opener
proxy_handler = ProxyHandler({
'http': 'http://127.0.0.1:8080',
'https': 'https://127.0.0.1:8080'
}
) # 地址自己设置
opener = build_opener(proxy_handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))
HTTPPasswordMgr

用于管理密码

HTTPBasicAuthHandler

用于管理认证:

构建HTTPPasswordMgrWithDefaultRealm对象p

使用add_password加入账号密码和url

构建HTTPBasicAuthHandler对象时参数p

使用build_opener创建opener对象

使用open()函数打开

1
2
3
4
5
6
7
8
9
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
username = 'username'
password = 'password'
url = 'http://www.baidu.com/' # 使用需要身份验证的页面
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
handler = HTTPBasicAuthHandler(p)
opener = build_opener(handler)
result = opener.open(url)

error模块

URLError类

继承自OSError类,是error模块的基类

具有reason属性表示异常内容

HTTPError类

URLError的子类,专门用来处理HTTP请求错误

属性:

  • code: 返回HTTP状态码,404表示网页不存在,500表示服务器内部错误
  • reason: 返回错误原因
  • headers: 返回请求头

例如:

1
2
3
4
5
from urllib import request, error
try:
response = request.urlopen('https://yelangpi.github.io/index.htm') # 不存在的网站
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')

注意

reason属性可能是一个对象而不是字符串,例如socket.timeout类

1
2
3
4
5
6
7
8
9
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
print(type(e.reason))
if isinstance(e.reason, socket.timeout):
print('TIME OUT')