2019.9.8

发表于 2019-09-08 | 更新于: 2019-09-08 | 分类于闲聊

图书馆9楼真是个神仙楼层，可能以后会经常去这里学习吧，14楼如果不下雨的话也是个好去处。

正则表达式总结

发表于 2019-08-09 | 更新于: 2019-08-15 | 分类于 python

¶元字符

表格

代码	说明
.	匹配除换行以外的任意字符
\w	匹配字母或数字或下划线或汉字
\s	匹配任意的空白符
\d	匹配数字
\b	匹配单词的开始或结束
^	匹配字符串的开始
$	匹配字符串的结束

python学习笔记6

发表于 2019-08-04 | 更新于: 2019-08-05 | 分类于 python

¶requests库

¶文件上传

在读取网页时加入参数files

1
2
3

import requests
files = {'file': open('filename', 'rb')}  # 只读方式打开，只写为wb
r = requests.post('http://httpbin.org/post', files=files)

¶cookies

RequestsCookieJar类，可通过cookies.items()获取key和value

import requests
r = requests.get("https://github.com/")
print(r.cookies)
for key, value in r.cookies.items():
    print(key+'='+value)

自定义cookies可以放在headers里

headers = {
    'Cookies': '',
    'Host': '',
    'User-Agent': '',
}

也可以直接放在cookies里，但需要使用split分割字符串，转换为RequestsCookieJar类

cookies = ''
jar = requests.cookies.RequestsCookieJar()
for cookie in cookies.split(';'):
    key, value = cookie.split('=', 1)  # 分隔符/分割次数
r = requests.get('http://www.zhihu.com', cookies=jar, headers=headers)

未完

python学习笔记5

发表于 2019-08-03 | 更新于: 2019-08-03 | 分类于 python

¶requests库

¶基本函数

¶get()

请求网页，返回类型是Response

1
2
3

import requests
r = requests.get('https://www.baidu.com/')
print(type(r))

结果: <class ‘requests.models.Response’>

¶其他请求

r = requests.post('http://httpbin.org/post')
r = requests.put('http://httpbin.org/put')
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')

请求网页时可以同时传一些参数data,headers等，如:

import requests
data = {
    'name': 'germey',
    'age': 22
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
r = requests.get("https://www.baidu.com", headers=headers, data=data)

¶fake_useragent库(小tip)

User-Agent伪装浏览器一长串懒得打的随机生成法

from fake_useragent import UserAgent
ua = UserAgent()
print(ua.ie)
print(ua.opera)
print(ua.chrome)
print(ua.firefox)
print(ua.safari)
print(ua.random)

¶json()

解析返回结果，得到字典格式

1 2	r = requests.get('http://httpbin.org/get') print(r.json())

如果不是JSON格式，出现解析错误: json.decoder.JSONDecodeError

¶基本属性

¶text

网页内容(str)

¶cookies

类型: RequestsCookieJar(requests.cookies.RequestsCookieJar)

¶status_code

状态码

requests.codes可以查询状态码

_codes = {

    # Informational.
    100: ('continue',),
    101: ('switching_protocols',),
    102: ('processing',),
    103: ('checkpoint',),
    122: ('uri_too_long', 'request_uri_too_long'),
    200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
    201: ('created',),
    202: ('accepted',),
    203: ('non_authoritative_info', 'non_authoritative_information'),
    204: ('no_content',),
    205: ('reset_content', 'reset'),
    206: ('partial_content', 'partial'),
    207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
    208: ('already_reported',),
    226: ('im_used',),

    # Redirection.
    300: ('multiple_choices',),
    301: ('moved_permanently', 'moved', '\\o-'),
    302: ('found',),
    303: ('see_other', 'other'),
    304: ('not_modified',),
    305: ('use_proxy',),
    306: ('switch_proxy',),
    307: ('temporary_redirect', 'temporary_moved', 'temporary'),
    308: ('permanent_redirect',
          'resume_incomplete', 'resume',),  # These 2 to be removed in 3.0

    # Client Error.
    400: ('bad_request', 'bad'),
    401: ('unauthorized',),
    402: ('payment_required', 'payment'),
    403: ('forbidden',),
    404: ('not_found', '-o-'),
    405: ('method_not_allowed', 'not_allowed'),
    406: ('not_acceptable',),
    407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
    408: ('request_timeout', 'timeout'),
    409: ('conflict',),
    410: ('gone',),
    411: ('length_required',),
    412: ('precondition_failed', 'precondition'),
    413: ('request_entity_too_large',),
    414: ('request_uri_too_large',),
    415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
    416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
    417: ('expectation_failed',),
    418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
    421: ('misdirected_request',),
    422: ('unprocessable_entity', 'unprocessable'),
    423: ('locked',),
    424: ('failed_dependency', 'dependency'),
    425: ('unordered_collection', 'unordered'),
    426: ('upgrade_required', 'upgrade'),
    428: ('precondition_required', 'precondition'),
    429: ('too_many_requests', 'too_many'),
    431: ('header_fields_too_large', 'fields_too_large'),
    444: ('no_response', 'none'),
    449: ('retry_with', 'retry'),
    450: ('blocked_by_windows_parental_controls', 'parental_controls'),
    451: ('unavailable_for_legal_reasons', 'legal_reasons'),
    499: ('client_closed_request',),

    # Server Error.
    500: ('internal_server_error', 'server_error', '/o\\', '✗'),
    501: ('not_implemented',),
    502: ('bad_gateway',),
    503: ('service_unavailable', 'unavailable'),
    504: ('gateway_timeout',),
    505: ('http_version_not_supported', 'http_version'),
    506: ('variant_also_negotiates',),
    507: ('insufficient_storage',),
    509: ('bandwidth_limit_exceeded', 'bandwidth'),
    510: ('not_extended',),
    511: ('network_authentication_required', 'network_auth', 'network_authentication'),
}

如: requests.codes.ok或requests.codes[‘ok’]表示200

对于关键字不能使用前一种，如100 : continue

¶url

¶headers

¶history

请求历史

python学习笔记4

发表于 2019-08-01 | 更新于: 2019-08-14 | 分类于 python

¶urllib库

¶parse模块

¶urlparse()

url解析

代码:

1
2
3

from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result, sep='\n')

结果:

1 2	<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

函数返回结果为ParseResult类型，包括6个部分:

scheme: 协议(://前面)
netloc: 域名(/前面)
path: 访问路径(/后面)
params: 参数(;后面)
query: 查询条件(?后面)，一般用作GET类型的URL
fragment: 锚点(#后面)，由于直接定位页面内部的下拉位置

即 scheme://netloc/path;params?query#fragment

¶参数

urlstring: 必填项

scheme: 默认协议

allow_fragments: 是否忽略fragment，如果是False，fragment部分会被解析为path，parameters或query的一部分。

¶urlunparse()

完成URL构造，接受一个可迭代对象，长度必须为6.

如 http://www.baidu.com/index.html;user?a=6#comment

1
2
3

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

¶urlsplit()

不解析params部分

代码:

1
2
3

from urllib.parse import urlsplit
result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)

结果

1	SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

¶urlunsplit()

同理，与urlunparse()，长度必须是5

¶urljoin()

第一个参数：base_url(基础链接)

第二个参数：新的链接

base_url 提供了三项内容 scheme，netloc，path，如果这三项在新的链接里不存在，就予以补全；如果存在，就使用新的链接的部分。

¶urlencode()

构造GET请求参数

from urllib.parse import urlencode
query = {
    'name': 'germey',
    'age': '12'
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(query)
print(url)

结果为http://www.baidu.com?name=germey&age=12

¶parse_qs()

query转回字典

1
2
3

from urllib.parse import parse_qs
query = 'name=germey&age=12'
print(parse_qs(query))

结果为{‘name’: [‘germey’], ‘age’: [‘12’]}

¶parse_qsl()

另一种方法，转换为元组

1
2
3

from urllib.parse import parse_qsl
query = 'name=germey&age=12'
print(parse_qsl(query))

结果为[(‘name’, ‘germey’), (‘age’, ‘12’)]

¶quote()

中文参数转化为URL编码

from urllib.parse import quote
keyword = '野狼匹'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

结果为https://www.baidu.com/s?wd=%E9%87%8E%E7%8B%BC%E5%8C%B9

¶unquote()

URL编码转换为中文

1 2	from urllib.parse import unquote print(unquote('https://www.baidu.com/s?wd=%E9%87%8E%E7%8B%BC%E5%8C%B9'))

结果为https://www.baidu.com/s?wd=野狼匹

¶robotparser模块

首先创建一个RobotFileParser对象

rp = RobotFileParser()

创建时可传入链接，如

rp = RobotFileParser('http://www.jianshu.com/robots.txt')

几个函数的应用

set_url(): 创建时未传入链接，可以使用这个函数设置robots.txt文件
read(): 读取robots.txt文件并分析(未读取将全是False)

read()

可以直接查看源代码，read()其实就是urlopen()+parse(),但是read()函数有一个坏处，当你打开错误时，只会全部允许或者全不允许，于是我入坑了，简书的robots文件必须伪造成浏览器才能打开，结果是导致403而不知道，结果所有网站都是False，建议直接自行使用urlopen()+parse()。

parse():解析robots文件

urlopen()+parse()方法+伪装浏览器方法

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen
import urllib.request
rp = RobotFileParser()
headers = {&#39;User-Agent&#39;: &#39;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0&#39;}
req = urllib.request.Request(url=&#39;http://www.jianshu.com/robots.txt&#39;, headers=headers)
rp.parse(urlopen(req).read().decode(&#39;utf-8&#39;).splitlines())

主要方便之处如果产生错误直接抛出，自己可以手动解决。

can_fetch(): 传入User-agent 和URL，返回是否可抓取

1	print(rp.can_fetch('', 'http://www.jianshu.com/p/a4be10076a6e'))print(rp.can_fetch('', 'http://www.jianshu.com/search?q=python&page=1&typr=collections'))

结果是True和False

mtime(): 返回上次抓取和分析robots.txt的时间
modified():将当前时间设置为上次抓取和分析robots.txt的时间

python学习笔记3

发表于 2019-08-01 | 更新于: 2019-08-01 | 分类于 python

¶urllib库

¶request模块

¶BaseHandler类

Handler的父类，基本函数有default_open(),protocol_request()

¶Hander子类

¶HTTPDefaultErrorHandler

用于处理HTTP响应错误

¶HTTPRedirectHandler

用于处理重定向

¶HTTPCookieProcessor

用于处理cookies:

¶字典处理

使用http.cookiejar.CookieJar()创建cookie

使用HTTPCookieProcessor创建handle并传入cookie

build_opener()创建opener并open()

import http.cookiejar
import urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
for item in cookie:
    print(item.name+'='+item.value)

¶保存cookie

使用MozillaCookieJar(filename)或LWPCookieJar(filename)创建cookie

使用cookie.save(ignore_expires=True, ignore_discard=True)保存文件

import http.cookiejar
import urllib.request
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
# Mozilla格式或LWP格式
# cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_expires=True, ignore_discard=True)

¶读取cookie

使用cookie.load(filename, ignore_discard=True, ignore_expires=True)读取cookie文件

import http.cookiejar
import urllib.request
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar()
# Mozilla格式或LWP格式
# cookie = http.cookiejar.LWPCookieJar()
cookie.load(filename, ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
print(response.read().decode('utf-8'))

¶ProxyHandler

设置代理

使用ProxyHandler，传入参数为字典

使用build_opener(handler)构建opener

使用open()函数打开

from urllib.request import ProxyHandler, build_opener
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:8080',
    'https': 'https://127.0.0.1:8080'
}
)  # 地址自己设置
opener = build_opener(proxy_handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

¶HTTPPasswordMgr

用于管理密码

¶HTTPBasicAuthHandler

用于管理认证:

构建HTTPPasswordMgrWithDefaultRealm对象p

使用add_password加入账号密码和url

构建HTTPBasicAuthHandler对象时参数p

使用build_opener创建opener对象

使用open()函数打开

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
username = 'username'
password = 'password'
url = 'http://www.baidu.com/'   # 使用需要身份验证的页面
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
handler = HTTPBasicAuthHandler(p)
opener = build_opener(handler)
result = opener.open(url)

¶error模块

¶URLError类

继承自OSError类，是error模块的基类

具有reason属性表示异常内容

¶HTTPError类

URLError的子类，专门用来处理HTTP请求错误

属性:

code: 返回HTTP状态码，404表示网页不存在，500表示服务器内部错误
reason: 返回错误原因
headers: 返回请求头

例如:

from urllib import request, error
try:
    response = request.urlopen('https://yelangpi.github.io/index.htm')  # 不存在的网站
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

¶注意

reason属性可能是一个对象而不是字符串，例如socket.timeout类

import socket
import urllib.request
import urllib.error
try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

test2

发表于 2019-07-31 | 更新于: 2019-07-31 | 分类于测试

python学习笔记2

发表于 2019-07-31 | 更新于: 2019-07-31 | 分类于 python

¶urllibk库

¶request模块

¶urlopen()

Api:

1	urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None)

url:字符串url或Request类
data:字典类需要转码为字节流，如:

1	data = bytes(urllib.parse.urlencode('word': 'hello'), encoding='utf8')

timeout:设置超时时间，超出这个时间未响应，抛出URLError异常，属于urllib.error模块。
cafile:CA证书
capath:CA证书路径
cadefault:已弃用，默认False
context:ssl.SSLContext类型，指定SSL设置，实现SSL加密传输。

¶Request类

urllib.request.Request

class Request:

    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):

url
data:和上面一样，必须传 bytes (字节流)类型.。
heads:请求头,也可以调用add_header()添加。

包括：

Accept:浏览器端可以接受的媒体类型，例如text/html表示html文档，*表示任意类型
Accept-Encoding
Accept-Language
Connection:

例如：　Connection: keep-alive 当一个网页打开完成后，客户端和服务器之间用于传输HTTP数据的TCP连接不会关闭，如果客户端再次访问这个服务器上的网页，会继续使用这一条已经建立的连接。

例如： Connection: close 代表一个Request完成后，客户端和服务器之间用于传输HTTP数据的TCP连接会关闭，当客户端再次发送Request，需要重新建立TCP连接。
Host
Referer
User-Agent:

User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; CIBA; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; InfoPath.2; .NET4.0E)。

伪装浏览器
Cache-Control
Cookie

用法:

headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}

origin_req_host:请求方的host名称或IP地址
unverifiable:这个请求是否无法验证，默认False
method:字符串，指定请求使用的方法，比如GET,POST,PUT。

¶HTTPResponse

¶urlopen()返回类型

1
2
3

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(type(response))

输出结果：<class ‘http.client.HTTPResponse’>

¶使用函数

read() 返回的网页内容

print(response.read().decode('utf-8'))

readinto()

getheader(name)

getheaders()

fileno()

¶属性

msg

version

status 200表示成功，404表示网页未找到

reason

debuglevel

closed

python学习笔记1

发表于 2019-07-22 | 更新于: 2019-07-22 | 分类于 python

¶Anaconda

“开始 → Anaconda3（64-bit）→ 右键点击Anaconda Prompt → 以管理员身份运行”，在Anaconda Prompt中输入conda list，可以查看已经安装的包名和版本号。

chromedriver – Chrome

geckodriver – Firefox

PhantomJS

git学习笔记

发表于 2019-07-21 | 更新于: 2019-07-21 | 分类于 git

¶安装git

设置名字和邮箱

1 2	git config --global user.name "Your Name" git config --global user.email "email@example.com"

创建目录

mkdir name

显示目录

pwd

创建git

git init

创建SSH Key

ssh-keygen -t rsa -C "youremail@example.com"

可以在用户主目录里找到.ssh目录，里面有id_rsa和id_rsa.pub两个文件，id_rsa是私钥，id_rsa.pub是公钥。

¶使用git

添加、提交文件

git add file

git commit -m "message"

工作区状态

git status

查看修改

git diff file

¶控制版本

HEAD为当前版本，HEAD^{为上一个版本，HEAD}^{为上上一个版本，HEAD}100为往上100个版本

切换版本

git reset --hard commit_id/HEAD^number

查看提交历史/命令历史

1 2	git log git reflog

--pretty=oneline（加上该参数会更简洁）

简洁图表

1	git log --graph --pretty=oneline --abbrev-commit

分叉提交历史整理成直线

git rebase

撤销工作区的修改

git checkout -- file

删除文件/版本库中删除文件

rm file

1 2	git rm file git commit -m "message"

¶远程仓库

关联

1	git remote add origin address

address是github ssh地址，origin是远程仓库地址

国内有类似github的平台：gitee，优点是下载速度比较快

推送

git push (-u) origin master/branch

第一次push要加-u 远程和本地的master关联起来

克隆

git clone address

查看远程信息

git remote (-v)

创建远程库其他分支

1	git checkout -b branch origin/branch

从远程抓取分支

git pull

¶分支的使用

查看分支：git branch

创建分支：git branch <name>

切换分支：git checkout <name>

创建+切换分支：git checkout -b <name>

合并某分支到当前分支：git merge <name>

删除分支：git branch -d <name>

强行删除分支：git branch -D <name>

分支合并图：git log --graph

强制禁用Fast forward模式，在合并时添加信息

1	git merge --no-ff -m "message" branch

¶储存工作区

隐藏：git stash

查看：git stash list

恢复：git stash apply

删除：git stash drop

恢复并删除：git stash pop

恢复指定：git stash apply stash@{number}