Decode CAPTCHA by Selenium

参考：http://www.jianshu.com/p/7ed519854be7

获取验证码有两种思路：

1.获取页面源代码，提取验证码图片

2.利用selenium截取页面，定位验证码元素的位置，利用Image进行处理，获取其中验证码部分

下面进行解析：

1.获取页面源代码，提取验证码图片

如何获取源代码并提取验证码图片过程就不在分析了，既然看到这篇文章，相信这些工作都不在话下了。
这里只分析一下缺点：当提取验证码url后发现每次打开该验证码图片，其内容不断变化，
以搜狗验证码为例：http://weixin.sogou.com/antispider/util/seccode.php?tc=1486691901，该验证码是单独加载进页面，而非嵌入，这时候，单纯提取验证码url会发现当前验证码和提取验证码地址打开的内容不一样。这时候，我们需要一个更方便简单的方法。

2.利用selenium截取页面
selenium.webdriver 内置了截取当前页面的功能，其中：

a.WebDriver.Chrome自带的方法只能对当前窗口截屏，若是需要截取的窗口超过了一屏，就只能另辟蹊径了。

b.WebDriver.PhantomJS自带的方法支持对整个网页截屏。

在这里，我们利用两种方法均可，因为验证码界面通常比较简单。

#打开验证码界面
driver = webdriver.Chrome()
url = "http://weixin.sogou.com/antispider/?from=%2fweixinwap%3Fpage%3d2%26_rtype%3djson%26ie%3dutf8%26type%3d2%26query%3d%E6%91%A9%E6%8B%9C%E5%8D%95%E8%BD%A6%26pg%3dwebSearchList%26_sug_%3dn%26_sug_type_%3d%26"
   driver.set_window_size(1200, 800)
   cookies = info['cookies']

#处理cookies
driver.get(url)
   for k,v in cookies.iteritems():
       cookie_dict ={'name':k,'value':v}
       driver.add_cookie(cookie_dict)
   driver.get(url)

#获取截图
driver.get_screenshot_as_file('CrawlResult/screenshot.png')

#获取指定元素位置
   element = driver.find_element_by_id('seccodeImage')
   left = int(element.location['x'])
   top = int(element.location['y'])
   right = int(element.location['x'] + element.size['width'])
   bottom = int(element.location['y'] + element.size['height'])

#通过Image处理图像
   im = Image.open('CrawlResult/screenshot.png')
   im = im.crop((left, top, right, bottom))
   im.save('CrawlResult/code.png')

到这里，我们的验证码就拿下来啦，怎么处理呢？

1.pytesser，tesseract，OCR 等库处理

2.验证码不多，并为了提高识别效率和简化操作，我采用了调用打码平台（ruokuai）API方法，价格大概是1块钱打100-150个（根据验证码位数和是否数字/字母混合）

验证码类型和价格介绍http://www.ruokuai.com/home/pricetype

下面分析一下怎样使用打码平台：

1.先把开发文档贴上：http://wiki.ruokuai.com/

2.官方的调用方法（有两种：DOS版和普通版，下面贴的普通版，基本原理一样）

原理：将验证码图片，打码平台账号，密码等按照指定格式调用API（访问URL）

class RClient(object):

    def __init__(self, username, password, soft_id, soft_key):
        self.username = username
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.soft_key = soft_key
        self.base_params = {
            'username': self.username,
            'password': self.password,
            'softid': self.soft_id,
            'softkey': self.soft_key,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'Expect': '100-continue',
            'User-Agent': 'ben',
        }

    def rk_create(self, im, im_type, timeout=60):
        """
        im: 图片字节
        im_type: 题目类型
        """
        params = {
            'typeid': im_type,
            'timeout': timeout,
        }
        params.update(self.base_params)
        files = {'image': ('a.jpg', im)}
        r = requests.post('http://api.ruokuai.com/create.json', data=params, files=files, headers=self.headers)
        return r.json()

    def rk_report_error(self, im_id):
        """
        im_id:报错题目的ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://api.ruokuai.com/reporterror.json', data=params, headers=self.headers)
        return r.json()

rc = RClient('username', 'password', 'soft_id', 'soft_key')
imagePath = 'CrawlResult/code.png'
im = open(imagePath, 'rb').read()
code_json = rc.rk_create(im, '验证码类型')# 类型和价格介绍http://www.ruokuai.com/home/pricetype

TimorChow

Zehua Zhou {{moment(1560817797000).fromNow()}}

Decode CAPTCHA by Selenium