Two Python crawler libraries

Article directory

When using Python [crawlers] , you need to simulate and initiate network requests. The main libraries used are the requests library and python’s built-in urllib library. It is generally recommended to use requests, which is a re-encapsulation of urllib.

Two Python crawler libraries

[urllib] library

The urllib package contains the following modules:

  • urllib.request – open and read URLs.
  • urllib.error – contains exceptions thrown by urllib.request.
  • urllib.parse – Parse URLs.
  • urllib.robotparser – Parse the robots.txt file.

urllib library use

The response object of the urllib library is to first create the http and request objects, and load them into requests.urlopen to complete the http request.

What is returned is the http, response object, which is actually the html attribute. After decoding using .read().decode(), it is converted into str string type, and Chinese characters can be displayed after decoding.

urllib.request

urllib.request defines some functions and classes for opening URLs, including authorization verification, redirection, browser cookies, etc.

urllib.request can simulate a browser request initiation process.

We can use the urlopen method of urllib.request to open a URL, the syntax is as follows:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

  • url: url address.
  • data: Additional data objects to send to the server, defaults to None.
  • timeout: Set the access timeout time.
  • cafile and capath: cafile is the CA certificate, and capath is the path of the CA certificate, which is required to use HTTPS.
  • cadefault: has been deprecated.
  • context: ssl.SSLContext type, used to specify SSL settings.

Experimental case:

import urllib
from urllib.request import urlopen
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf-8')
response = urllib.request.urlopen('http://www.baidu.com', data=data)
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

try:
    response = urllib.request.urlopen("http://www.baidu.com/no.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print(404)   # 404

Mock header information

We generally need to simulate the headers (page header information) when we crawl web pages. At this time, we need to use the urllib.request.Request class:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

  • url: url address.
  • data: Additional data objects to send to the server, defaults to None.
  • headers: The header information of the HTTP request, in dictionary format.
  • origin_req_host: The requested host address, IP or domain name.
  • unverifiable: The entire parameter is rarely used to set whether the web page needs to be verified. The default is False.
  • method: The request method, such as GET, POST, DELETE, PUT, etc.

import urllib
from urllib import request
#Request header
headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'
}
# wd = {"wd": "hello"} 
# url = "http://www.baidu.com/s?" 
url = 'https://www.runoob.com/?s='   # Rookie tutorial search page
keyword = 'Python Tutorial'
key_code = urllib.request.quote(keyword)   # encode the request
url_all = url+key_code

req = request.Request(url_all, headers=headers)
response = request.urlopen(req)
print(type(response))
print(response)
res = response.read().decode()
print(type(res))
print(res)

[requests] library

The requests library calls the requests.get method to pass in the url and parameters, and the returned object is the Response object, which is printed to display the response status code.

Advantages of requests:
For Python crawlers, it is more recommended to use the requests library. Because requests are more convenient than urllib, requests can directly construct get and post requests and initiate them, while urllib.request can only construct get and post requests first, and then initiate them.

Experimental case – get request

import requests
response = requests.get('http://www.baidu.com')
print( 'response\n' ,response)
 # 2. get request with parameters 
response2 = requests. get ( 'http://www.baidu.com/get?name=germy&age=22' )
print( 'response2\n' ,response2)
 # 3. Pass the parameters into the params parameter to achieve the same function as in 2
data = {
    'name': 'germy',
    'age': 22
}
response3 = requests.get('http://www.baidu.com', params=data)
print( 'response3\n' ,response3)
 # 4. Parse jason (if the returned result is a json, call this method to return json directly) 
response4 = requests. get ( 'http://httpbin.org/get' )
print('response4\n',response4)

# 5. Get binary data (image, video...) 
response5 = requests. get ( 'http://github.com/favicon.ico' )
 with open ( 'icon.ico' , 'wb' ) as f:
    f.write(response5.content)

# 6. Add headers ( pass in the headers parameter )
headers = {
    'User-Agent': '...'
}
print('response6\n',response6)

Experimental Case – Crawl the Web

import requests  

url = 'http://httpbin.org/get'
params = {  
    'name': 'germey',  
    'age': 25
}  
r = requests.get(url, params = params)  
print(type(r.json()))
print(r.json())
print(r.json().get('args').get('age'))

Lab Case – Response

Response refers to the data returned by the server after sending the request. In the above example, we obtained the response content through the text and content of the response. In addition, other attribute values ​​can also be obtained through other methods, such as status code, response header, Cookies

import requests
r = requests.get('http://www.baidu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

In the above example, status_code , cookies , history represent the response status code, cookie and request history respectively.

It should be noted here that the status_code status code is the HTTP request status code. For example, 200 means the request is successful, 404 means the resource does not exist, etc. You can refer to the relevant information for details. Therefore, in the crawler code, we can use this status code to determine whether the request is successful, so as to facilitate the corresponding processing.

import requests

r = requests.get('http://www.baidu.com')
if not r.status_code == requests.codes.ok:
 else :
    print('Request Successfully!')

Here, we use requests.codes.ok to represent the state of 200, so that we don’t have to handwrite numbers such as 200 by ourselves, which is more convenient. Of course, there are other built-in status codes, some of the more commonly used ones are listed below for your reference:

# Informational status code   
100: ( 'continue' ,),  
101: ('switching_protocols',),  
102: ('processing',),  
103: ('checkpoint',),  
122: ('uri_too_long', 'request_uri_too_long'),  

# Success status code   
200: ( 'ok' , 'okay' , 'all_ok' , 'all_okay' , 'all_good' , '\\o/' , '✓' ),  
201: ('created',),  
202: ('accepted',),  
203: ('non_authoritative_info', 'non_authoritative_information'),  
204: ('no_content',),  
205: ('reset_content', 'reset'),  
206: ('partial_content', 'partial'),  
207: ( 'multi_status' , 'multiple_status' , 'multi_status' , 'multiple_status' ),  
208: ('already_reported',),  
226: ('im_used',),  

# Redirect status code   
300: ( 'multiple_choices' ,),  
301: ('moved_permanently', 'moved', '\\o-'),  
302: ('found',),  
303: ('see_other', 'other'),  
304: ('not_modified',),  
305: ('use_proxy',),  
306: ('switch_proxy',),  
307: ('temporary_redirect', 'temporary_moved', 'temporary'),  
308: ('permanent_redirect',  
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0  

# Client error status code   
400: ( 'bad_request' , 'bad' ),  
401: ('unauthorized',),  
402: ('payment_required', 'payment'),  
403: ('forbidden',),  
404: ('not_found', '-o-'),  
405: ('method_not_allowed', 'not_allowed'),  
406: ('not_acceptable',),  
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),  
408: ('request_timeout', 'timeout'),  
409: ('conflict',),  
410: ('gone',),  
411: ('length_required',),  
412: ('precondition_failed', 'precondition'),  
413: ('request_entity_too_large',),  
414: ('request_uri_too_large',),  
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),  
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),  
417: ('expectation_failed',),  
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),  
421: ('misdirected_request',),  
422: ('unprocessable_entity', 'unprocessable'),  
423: ('locked',),  
424: ('failed_dependency', 'dependency'),  
425: ('unordered_collection', 'unordered'),  
426: ('upgrade_required', 'upgrade'),  
428: ('precondition_required', 'precondition'),  
429: ('too_many_requests', 'too_many'),  
431: ('header_fields_too_large', 'fields_too_large'),  
444: ('no_response', 'none'),  
449: ('retry_with', 'retry'),  
450: ('blocked_by_windows_parental_controls', 'parental_controls'),  
451: ('unavailable_for_legal_reasons', 'legal_reasons'),  
499: ('client_closed_request',),  

# Server error status code   
500: ( 'internal_server_error' , 'server_error' , '/o\\' , '✗' ),  
501: ('not_implemented',),  
502: ('bad_gateway',),  
503: ('service_unavailable', 'unavailable'),  
504: ('gateway_timeout',),  
505: ('http_version_not_supported', 'http_version'),  
506: ('variant_also_negotiates',),  
507: ('insufficient_storage',),  
509: ('bandwidth_limit_exceeded', 'bandwidth'),  
510: ('not_extended',),  
511: ('network_authentication_required', 'network_auth', 'network_authentication')

Leave a Comment

Your email address will not be published. Required fields are marked *