Breaking CAPTCHA in Python

Usually CAPTCHAs are analyzed by using neural network, it's a good approach, but it may be overcomplicated in simple cases. Presented below, much shorter algorithm can produce sufficient results for uncomplicated CAPTCHAs.

In this algorithm an image with unknown letter is compared with samples of known letters, the letter in the most similar sample is probably also the letter in analyzed image. It was implemented as a Python script, usage presented below:

captcha breaker in python, sample of usage
bash-3.2$ python cracker.py test1.png 
e
other sample of usage of the script for breaking CAPTCHAs in Python
bash-3.2$ python cracker.py test2.png 
p

It can't be directly used on a raw CAPTCHA, firstly small artifacts have to be removed from the CAPTCHA, secondly each letter should be stored in a separate image.

The script below requires samples directory with samples of letters. A sample set and this CAPTCHA breaker can be downloaded from my GitHub (captcha directory).

import sys, os
import math
import string
import Image
import PIL.ImageChops

if __name__=="__main__":
    input = sys.argv[1]
    base = Image.open(input).convert('L')

    class Fit:
        letter = None
        difference = 0 

    best = Fit()

    for letter in string.lowercase:
        current = Fit()
        current.letter = letter

        sample_path = os.path.join('samples', letter + '.png')
        sample = Image.open(sample_path).convert('L').resize(base.size)
        difference = PIL.ImageChops.difference(base, sample)
        
        for x in range(difference.size[0]):
            for y in range(difference.size[1]):
                current.difference += difference.getpixel((x, y))

        if not best.letter or best.difference > current.difference:
            best = current

    print best.letter

I was surprised that this task can be done in less than 50 lines! Of course it's not good enough to break complicated CAPTCHAs, but they also aren't easy task for more complicated algorithms.

7 comments:

  1. wow!!!
    the great blog.the post is very informative and very useful.
    keep blogging.

    image decoding

    ReplyDelete
  2. OCR software can be used as well. The problem with recognition of the letters is that there isn't a good way to recognise cambered letters (popular "fish eye" effects in Google CAPTCHA or Open Captcha).

    ReplyDelete
    Replies
    1. IMHO OCR is overkill in most of the captchas, in addition, it will be probably too slow.

      Delete
    2. @Anonymous, do you know open source libraries that would be good for this?

      Delete
    3. I heards about a project where DSL (in Hasjkell) was used to program route of a satelite.

      Delete
    4. @spainman, sounds cool! I will try to find more info about this project :)

      Delete