项目作者: AmritK10

项目描述 :
Recognises the characters in Really Simple CAPTCHA plugin and hence breaks the CAPTCHA.
高级语言: Jupyter Notebook
项目地址: git://github.com/AmritK10/CAPTCHA_Text_Recognition.git
创建时间: 2019-04-04T17:54:01Z
项目社区:https://github.com/AmritK10/CAPTCHA_Text_Recognition

开源协议:MIT License

下载


CAPTCHA_Text_Recognition

Intro

Created a model to recognise the characters in “Really Simple CAPTCHA“ plugin and hence break the CAPTCHA.

Dataset

The plugin generates 4-letter CAPTCHAs using a random mix of four different fonts.

A dataset of 10,000 images of CAPTCHAS generated by the plugin was used. The images are present in the generated_captcha_images folder.

As the dataset is small we don’t train our model on this entire image. We form another dataset of segmented characters (letters) of the image and design a model to train on these letters.

In this way each individual letter is identified by the model and the combined letters are used to break the CAPTCHA.

OpenCV

Opencv was used to perform character segmentation and to form the indivdual letters’ dataset. Some of the functions used are:


  • cv2.imread(path): Reads in the image

  • cv2.cvtColor(img, cv2.COLOR_BGR2GRAY): Converts image to gray scale

  • cv2.copyMakeBorder(gray, 8, 8, 8, 8, cv2.BORDER_REPLICATE): Add some extra padding around the image

  • cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]: Thresholds the image (converts it to pure black and white)

  • cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE): Finds the contours (continuous blobs of pixels) in the image (Segements the letters)

Model

A shallow convolutional model was created and trained on individual letters. It was used to predict the letters in the CAPTCHAS following segmentation hence breaking the CAPTCHA.

Results

Screen Shot 2019-09-02 at 1 02 01 AM
Screen Shot 2019-09-02 at 1 02 07 AM
Screen Shot 2019-09-02 at 1 02 13 AM

Note for running code

Add folder extracted_letter_images in the working directory before running.