Almost all sites use images with text that user has to retype to prove that he's a human not spambots. A lot of those images (called captcha) contains small lines, dots and other noises to make theirs analyze more difficult for spambots. In this post I will present how to easily remove this noise from a captcha.
I used Python and its PIL library for processing captchas. The first step is to transform image to grey-scale (this makes further work easier) and blur it (it makes small objects less visible). PIL's blur filter is a bit poor for that, but SMOTH filter works great (but it needs to be applied twice made it blurred enough). Next step is to check all pixels if their value is higher that some certain constant, if yes, then pixels will become white, if not they will be black. This constant may be set from command line.
Below is example how to use script (first argument is path to image, second is mentioned constant), some samples of usage and source code.
rgawron@vk1004:~/noiseReduction$ python noise.py captcha2.jpg 90pass_factor = 100 pass_factor = 90 pass_factor = 100 pass_factor = 130
import sys import Image import ImageFilter def prepare_image(img): """Transform image to greyscale and blur it""" img = img.filter(ImageFilter.SMOOTH_MORE) img = img.filter(ImageFilter.SMOOTH_MORE) if 'L' != img.mode: img = img.convert('L') return img def remove_noise(img, pass_factor): for column in range(img.size): for line in range(img.size): value = remove_noise_by_pixel(img, column, line, pass_factor) img.putpixel((column, line), value) return img def remove_noise_by_pixel(img, column, line, pass_factor): if img.getpixel((column, line)) < pass_factor: return (0) return (255) if __name__=="__main__": input_image = sys.argv output_image = 'out_' + input_image pass_factor = int(sys.argv) img = Image.open(input_image) img = prepare_image(img) img = remove_noise(img, pass_factor) img.save(output_image)