Python3标准库urllib

[TOC]

前言

做web开发的，和http请求打交道最多了，不得不熟悉的就是urllib。当然爬虫也经常用。并且有好的第三方库requests。
本文就介绍这些东东。
note: 是在python3.5下测试运行。

urllib

urllib有4个库。分别是:

urllib.request 打开和读url
urllib.error 包含由urllib.request抛出的异常。
urllib.parse 用来解析url
urllib.robotparser 用来解析robots.txt文件。

玩爬虫的同学请关注一下urllib.robotparser,做一个好程序员。^-

request

两行代码拿到页面内容。

get请求

from urllib.request import urlopen
print(urlopen("http://www.baidu.com").read())

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
支持http、https、ftp协议。
urlopen返回一个类文件对象，他提供了如下方法：

read() , readline() , readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样;
info()：返回一个httplib.HTTPMessage 对象，表示远程服务器返回的头信息；
getcode()：返回Http状态码。如果是http请求，200表示请求成功完成;404表示网址未找到；
geturl()：返回请求的url；

参数说明:

url 可以是字符串,也可以是Request对象.
data 用于post请求时的数据。
timeout 超时时间
cafile & capath & cadefault 用于https的ca证书
context 同样是用于https请求。

需要注意的是这个方法使用的HTTP/1.1协议，并且在HTTP header里自动加入了Connection:close。

post请求

同样可以用urllib.request.urlopen实现。传入data即可。
第二种方式是传入的url是个Request对象。Request的构造函数中传入data即可。
如果没有数据传输，只能用第二第方式，只是Request对象的函数函数中需要明确指定method=‘POST’。

完整的Request类定义如下：
urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

参数说明:

url 字符串的网址
data 用于post的数据。如果有,则method自动为’POST’
headers http头。
origin_req_host 原始请求host。
unverifiable https时是否校验证书。
method http方法

示例如下:

from urllib import request
from urllib import parse
#从这可以看出,前端的校验是给小白用的.码农是可以直接发起http请求,绕过前端js校验的.
data = {'name':'test1','email':'[email protected]','passwd':'123456'}
response  = request.urlopen('http://awesome.go2live.cn/api/users',parse.urlencode(data).encode('utf-8'))
print(response.getcode())
print(response.read())

自定义header

前文已经说了。在构造Request对象时，传入headers参数即可。也可以之后调用Request的add_header方法。
这个主要用来模拟User-Agent和HTTP_REFERER。因为这两个经常用来在后端屏蔽掉请求。

譬如我的博客站点就是有屏蔽功能的。

直接请求会失败。

from urllib.request import urlopen
print(urlopen("http://www.go2live.cn").read())

输出结果:

Traceback (most recent call last):
…
urllib.error.HTTPError: HTTP Error 403: Forbidden

通过构造合适的user-agent就可以正常访问了。

from urllib import request
req = request.Request('http://www.go2live.cn',headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'})
response  = request.urlopen(req)
print(response.getcode())

输出结果：

200

还有Referer是用来防盗链的。请求某些资源时，Referer必须是来自指定的网站才可以，否则403拒绝访问。
另一个是Content-Type。
在post请求，自动成了application/x-www-form-urlencoded。
但是在api请求时,现在好多都改成了json的。这个时候需要用application/json。
另外上传文件时也不一样：multipart/form-data。

增加cookie

这个主要用来处理需要登录才能访问的页面。
这个稍微麻烦点。需要借助下opener才能处理。
http.cookiejiar包是来保存cookie的。
request.HTTPCookieProcessor是用来处理cookie的。
完整的代码示例如下：

from urllib import request
from urllib import parse
import http.cookiejar
import hashlib

URL_ROOT = 'http://awesome.go2live.cn'

cookie = http.cookiejar.CookieJar()# 用来保存cookie的对象
handler = request.HTTPCookieProcessor(cookie) #处理cookie的工具
opener = request.build_opener((handler))
response = opener.open(URL_ROOT)
print("before login")
for item in cookie:
    print('Name={0}, Value={1}'.format(item.name, item.value))
    
data = {'email':'[email protected]','passwd':'123456'}
data['passwd'] = data['email']+":"+data['passwd']
data['passwd'] = hashlib.sha1(data['passwd'].encode()).hexdigest()
req = request.Request(URL_ROOT+'/api/authenticate', parse.urlencode(data).encode())
response = opener.open(req)
print(response.read())
print("after login")
for item in cookie:
    print('Name={0}, Value={1}'.format(item.name, item.value))

输出结果:

before login
b’{“admin”: 0, “created_at”: 1486789465.7326, “id”: “001486789465697a2db999d99f84506a24a457bee0eab76000”, “name”: “test”, “email”: “[email protected]”, “passwd”: “******”, “image”: “http://www.gravatar.com/avatar/bf58432148b643a8b4c41c3901b81d1b?d=mm&s=120”}’
after login
Name=awesession, Value=001486789465697a2db999d99f84506a24a457bee0eab76000-1486887360-220a1b13969736bb0f868cfef9076023a7ea3b02

上传文件

用还是urlopen方法。只是data的构造真的好蛋疼。。

#!/usr/bin/env python
urllib.request
import urllib.parse
import random, string
import mimetypes

def random_string (length):
    return ''.join (random.choice (string.ascii_letters) for ii in range (length + 1))

def encode_multipart_data (data, files):

    def get_content_type (filename):
        return mimetypes.guess_type (filename)[0] or 'application/octet-stream'

    def encode_field (field_name):
        return ('-----------------------------96951961826872/r/n',
                'Content-Disposition: form-data; name="%s"' % field_name, '/r/n'
                '', str (data [field_name]))

    def encode_file (field_name):
        filename = files [field_name]
        return ('-----------------------------96951961826872/r/n',
                'Content-Disposition: form-data; name="%s"; filename="%s"' % (field_name, filename), '/r/n'
                'Content-Type: %s' % get_content_type(filename), '/r/n/r/n'
                '', open (filename, 'rb').read ())

    lines = []
    for name in data:
        lines.extend (encode_field (name))
    for name in files:
        lines.extend (encode_file (name))
    lines.extend ('/r/n-----------------------------96951961826872--/r/n')
    body = b''
    for x in lines:
        if(type(x) == str):
            body += x.encode('ascii')
        else:
            body += x
    headers = {'Content-Type': 'multipart/form-data; boundary=---------------------------96951961826872',
               'Content-Length': str (len (body))}

    return body, headers

def main():
    url = 'http://awesome.go2live.cn'
    data = {}
    files = {'notePhoto': '01.jpg'}
    req = urllib.request.Request (url, *encode_multipart_data (data, files))
    response = urllib.request.urlopen(req)
if __name__ == '__main__':
    main()

下载文件

这个其实没啥。

from urllib import request
response = request.urlopen('http://stock.gtimg.cn/data/get_hs_xls.php?id=ranka&type=1&metric=chr')
with open('test1.xls', 'wb') as fd:
    fd.write(response.read())

Debug调试

这个有时候要看看具体的http请求是啥样的。
使用下面的方式会把收发包的内容在屏幕上打印出来。

from urllib import request

http_handler = request.HTTPHandler(debuglevel=1)
https_handler = request.HTTPSHandler(debuglevel=1)
opener = request.build_opener(http_handler,https_handler)
request.install_opener(opener)
response = request.urlopen('http://m.baidu.com')
print(response.read())

输出结果:

send: b’GET http://m.baidu.com HTTP/1.1\r\nAccept-Encoding…
reply: ‘HTTP/1.1 200 OK\r\n’
header: Cache-Control header: Content-Length header: Content-Type…

内容太长。打…省略。

requests库

相当强大的一个第三方库。支持以下特性：

International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Automatic Content Decoding
Unicode Response Bodies
Multipart File Uploads
HTTP(S) Proxy Support
Connection Timeouts
Streaming Downloads
.netrc Support
Chunked Requests
Thread-safety

官方文档

get请求

同样简单的很，两行代码即可。

import requests
print(requests.get('http://blog.go2live.cn').content)

post请求

requests的api好简单。get请求就调用get方法。post请求就调用post方法。

import requests

print(requests.post('http://awesome.go2live.cn/api/users',data={'name':'test1','email':'[email protected]','passwd':'123456'}).content)

两行代码搞定。对比下urllib用了4行代码简单多了。。

而要传入json数据。再加个json参数就ok了。

自定义header

这个好简单，直接添加一个参数就可以了。

>>> url = 'https://api.github.com/some/endpoint'
>>> headers = {'user-agent': 'my-app/0.0.1'}

>>> r = requests.get(url, headers=headers)

处理cookie

cookie可以直接用响应对象的cookies变量拿到。

>>> url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value'

发送cookie可以加个cookies字典。

>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

RequestsCookieJar可以设置得更复杂。

>>> jar = requests.cookies.RequestsCookieJar()
>>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
>>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
>>> url = 'http://httpbin.org/cookies'
>>> r = requests.get(url, cookies=jar)
>>> r.text
'{"cookies": {"tasty_cookie": "yum"}}'

上传文件

上传文件也简单。用post方法，传个files参数。

>>> url = 'http://httpbin.org/post'
>>> files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}

>>> r = requests.post(url, files=files)
>>> r.text
{
  ...
    "files": {
        "file": "<censored...binary...data>"
    },
    ...
}

下载文件

感觉好简单。见下面。

import requests
url = 'http://stock.gtimg.cn/data/get_hs_xls.php?id=ranka&type=1&metric=chr'
r = requests.get(url) 
with open('test.xls', "wb") as code:
    code.write(r.content)

考虑到一次下载，可能文件过大，会消耗过多的内存。
requests推荐的下载方式如下：

import requests
r = requests.get('http://stock.gtimg.cn/data/get_hs_xls.php?id=ranka&type=1&metric=chr',stream=True)
with open('test.xls', 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)

写在后面

对比了urllib和requests库。我决定以后都用requests库了。太方便了。
给个git地址，去fork下吧。

bjmayor的又一个博客

urllib vs request

Python3标准库urllib

前言

urllib

request

get请求

post请求

自定义header

增加cookie

上传文件

下载文件

Debug调试

requests库

get请求

post请求

自定义header

处理cookie

上传文件

下载文件

写在后面