爬虫中的Proxy-Tunnel自主切换IP

语言: CN / TW / HK

在采集数据的过程这欧中我们经常会遇到这样的问题,目标网站需要登录,获取数据的两个请求在一个IP下完成,这时我们就只需对这组请求设置相同Proxy-Tunnel。例如:Proxy-Tunnel: 12345, 该组请求在代理有效期内使用相同的代理IP。这就是亿牛云的Proxy-Tunnel自主切换IP,该模式适合一些需要登陆、Cookie缓存处理等爬虫需要精确控制IP切换时机的业务。 爬虫程序可以通过设置HTTP头Proxy-Tunnel: 随机数, 当随机数相同时,访问目标网站的代理IP相同。

#! -*- encoding:utf-8 -*-
    import urllib2
    import random
    import httplib


    class HTTPSConnection(httplib.HTTPSConnection):

        def set_tunnel(self, host, port=None, headers=None):
            httplib.HTTPSConnection.set_tunnel(self, host, port, headers)
            if hasattr(self, 'proxy_tunnel'):
                self._tunnel_headers['Proxy-Tunnel'] = self.proxy_tunnel


    class HTTPSHandler(urllib2.HTTPSHandler):
        def https_open(self, req):
            return urllib2.HTTPSHandler.do_open(self, HTTPSConnection, req, context=self._context)


    # 要访问的目标页面
    targetUrlList = [
        "https://httpbin.org/ip",
        "https://httpbin.org/headers",
        "https://httpbin.org/user-agent",
    ]

    # 代理服务器(产品官网 www.16yun.cn)
    proxyHost = "t.16yun.cn"
    proxyPort = "31111"

    # 代理验证信息
    proxyUser = "username"
    proxyPass = "password"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host": proxyHost,
        "port": proxyPort,
        "user": proxyUser,
        "pass": proxyPass,
    }

    # 设置 http和https访问都是用HTTP代理
    proxies = {
        "http": proxyMeta,
        "https": proxyMeta,
    }

    #  设置IP切换头
    tunnel = random.randint(1, 10000)
    headers = {"Proxy-Tunnel": str(tunnel)}
    HTTPSConnection.proxy_tunnel = tunnel


    proxy = urllib2.ProxyHandler(proxies)
    opener = urllib2.build_opener(proxy, HTTPSHandler)
    urllib2.install_opener(opener)

    # 访问三次网站,使用相同的tunnel标志,均能够保持相同的外网IP
    for i in range(3):
        for url in targetUrlList:
            r = urllib2.Request(url)
            print(urllib2.urlopen(r).read())