`

extract captcha image

阅读更多

Decoding CAPTCHA's
extract captcha image
OCR (Optical Character Recognition) is pretty accurate these days and can easily read printed text.
rails ocr
ruby ocr
break google captcha


http://stackoverflow.com/search?q=rails+ocr
http://www.wausita.com/captcha/

-----------------------------------------------------------

1.tesseract-x.xx.tar.gz contains all the source code.

2.tesseract-2.xx.<lang>.tar.gz contains the Tesseract 2 language data files for <lang>. You need at least one of these or tesseract 2 will not work.

3. <lang>.traineddata.gz contains the Tesseract 3 language data file for <lang>. You need at least one of these or tesseract 3 will not work.

4.Note that tesseract-2.04.tar.gz unpacks to the tesseract-2.04 directory.
tesseract-2.01.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-2.04 directory. It is therefore best to download them
into your tesseract-2.04 directory, so you can use unpack here or equivalent.
 You can unpack as many of the language packs as you care to, as they all
contain different files. Note that if you are using make install you should
unpack your language data to your source tree before you run make install.
If you unpack them as root to the destination directory of make install,
then the user ids and access permissions might be messed up.


If they are not already installed, you need the following libraries (Ubuntu):

sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlibg-dev

E: 无法找到软件包 zlibg-dev => download source
sudo apt-get install zlib1g-dev

download Leptonica from http://www.leptonica.org/source/leptonlib-1.67.tar.gz
tar zxvf leptonlib-1.67.tar.gz


You also need to install Leptonica. There is an apt-get package (name unknown), or the sources are at http://www.leptonica.org/. The instructions at Leptonica README are clear, but basically it is the usual
 
./configure
make
sudo make install
sudo ldconfig

Now back to Tesseract. Download the source from svn:
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only
or package tesseract-3.00.tar.gz from download page. The same build process as usual applies:

http://code.google.com/p/tesseract-ocr/downloads/list

./runautoconf
./configure
make
sudo make install

sudo vi /etc/profile
vi ~/.bashrc

gunzip FileName.gz

   1. Download langugage data file (e.g. 'wget http://tesseract-ocr.googlecode.com/files/eng.traineddata.gz')
   2. Decompress it ('gzip -d eng.traineddata.gz')
   3. Move it to instalation tessdata (e.g. 'mv eng.traineddata $TESSDATA_PREFIX' if defined TESSDATA_PREFIX)


You may still get an error when trying to run tesseract:
$ tesseract foo.png bar

tesseract: error while loading shared libraries: libtesseract_api.so.3 cannot open shared object file: No such file or directory
You need to update the cache for the runtime linker. The following should get you up and running:
$ sudo ldconfig

--------------------------------------------------
copy eng.traineddata  to /usr/local/share/tessdata
pwd
/usr/local/share/tessdata
ls
configs  eng.traineddata  tessconfigs
-------------------------------------------------
tesseract digit only
improve tesseract digits  accuracy
use tesseract to get plain ascii text out of the bitmap.


`curl 'http://www.stc.gov.cn/search/image_code.asp?rnd=0.7641146600113322' > /home/simon/Desktop/weizh/ca.jpg`

tesseract ca.bmp outputbase -l eng
more outputbase.txt

tesseract ca.bmp outputbase nobatch digits
more outputbase.txt

only support jpg:
curl 'http://www.stc.gov.cn/search/image_code.asp?rnd=0.7641146600111234' > ca.jpg
tesseract ca.jpg outputbase nobatch digits
cat outputbase.txt


Reloading /etc/profile

source ~/.profile
$ source /etc/profile

.profile settings overwrite those in /etc/profile. You can also use .bash_profile in your home directory to customize your bash shell's profile.

Basically, if you need to load shell variables from any file just run the .
(dot) command, followed by space and (the absolute path is necessary) the path
 to the file. (Be carefull what file you're loading variables from because
you meight overwrite some important environment variables and your system
could become unstable).

$ tesseract wenzhou.jpeg outputbase -l eng
Error openning data file /usr/local/sharetessdata/eng.traineddata
=> cp eng.traineddata to /usr/local/sharetessdata


cd /home/simon/Desktop/weizh
curl 'http://117.36.53.122:9081/wfcx/servlet/ValidateCodeServlet?t=1304472587796' > xian.png
tesseract xian.png out /usr/local/share/tessdata/tessconfigs/nobatch /usr/local/share/tessdata/configs/digits


<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<script>
alert("验证码错误!");
window.close();
</script>
</head>
</html>

curl --cookie-jar newcookies.txt 'http://117.36.53.122:9081/wfcx/servlet/ValidateCodeServlet?t=1304494360513'  > xian.png

curl --cookie newcookies.txt 'http://117.36.53.122:9081/wfcx/query.do?actiontype=vioSurveil&vcode=2148&hpzl=02&hphm=AUL695&tj=CLSBDH&tj_val=LFV2A11GX93178557'

tesseract xian.png out /usr/local/share/tessdata/tessconfigs/nobatch /usr/local/share/tessdata/configs/digits



-----------------------------------

cd /usr/local/sharetessdata:
eng.traineddata


/usr/local/share/tessdata:
chi_sim.traineddata 
configs 
eng.traineddata 
tessconfigs


-----------------------------------

$ sudo apt-get install imagemagick
$ dpkg -l |grep imagemagick
imagemagick                                                 
imagemagick-doc                           

$ convert
$ whereis convert
$ which is convert
$ convert -compress none -depth 8 -alpha off zhejiang.gif zhejiang.tif

enlarge the image can improve ocr accuracy

I believe the real challenge to apply ocr for plate recognition is
that the plate image are "too dirty" comparing to paper documents.
There are frames, skews, un-even shadows, etc. You have to do your own
work to parse the plate into separate chars and feed the ocr engine. I
don't think tesseract itself can handle this automatically given the
raw image. But I believe it will do pretty well once you get the
binarized separate chars. Basically, plate recognition is more a image
processing problem than ocr problem.

You can use the grammar as post-process to make corrections.


to convert the pdf I used Image Magick convert application. bellow the set command that I use.
convert -density 288 src.pdf -colorspace Gray -depth 8 -alpha off tmp.tif
tesseract tmp.tif out.txt

how to eliminate noise

 

 

 

 

分享到:
评论

相关推荐

    ASP.NET Captcha image

    when you want to create captcha image on asp.net development, can use this module.

    captchaimage-1.4

    Linux下captchaimage-1.4安装包 python-captchaimage is a fast and easy to use Python extension for creating images with distorted text that are easy for humans and difficult for computers to read.

    Captcha-Image-Api:验证码

    : " :copyright: Dhruv " , " font " : " arial.ttf " , " img_url " : " https://Captcha-Image-Api.dhruvnation1.repl.co/captchame/FkciuPXxCnJ5d9Dyg4UA2Dr6d4e5cPWla9A2eABEp0ZdSYs4bmFIVab5iCg "} Dhruv...

    image_captcha.php

    php验证码

    Captcha_breaker

    Captcha breaker can identify the number in captcha image and label them.CNN was trained on custom dataset made out of captcha image

    captcha-1.3.0-API文档-中文版.zip

    赠送jar包:captcha-1.3.0.jar; 赠送原API文档:captcha-1.3.0-javadoc.jar; 赠送源代码:captcha-1.3.0-sources.jar; 赠送Maven依赖信息文件:captcha-1.3.0.pom; 包含翻译后的API文档:captcha-1.3.0-javadoc-...

    b2evo-captcha-1.3.1

    switch($captcha-&gt;validate_submit($_POST['image'],$_POST['attempt'])) { // form was submitted with incorrect key case 0: echo '&lt;p&gt;&lt;br&gt;Sorry. Your code was incorrect.'; echo ' &lt;br...

    cool-php-captcha

    cool-php-captcha 是一个很酷的 PHP 用来生成验证码的库。示例代码:session_start();$captcha = new SimpleCaptcha();// Change configuration...//$captcha-&gt;... // Change session variable$captcha-&gt;CreateImage();

    python的captcha库

    python的captcha库python的captcha库python的captcha库python的captcha库python的captcha库python的captcha库python的captcha库

    captcha.class.php:一个简单的 PHP CAPTCHA 类

    ###参数s: user defined captcha text c: captcha type 可以在课堂上更改更多设置... ###如何使用它只需调用 captcha.php 文件并传递所需的类型和/或预定义的验证码文本。 captcha.php?s=123456 输出: ...

    captcha验证码js文件

    captcha验证码js文件,希望对使用验证码的童鞋有帮助,正在学习中。

    Zend_captcha_image点击刷新图片验证码(dojo_ajax)

    简单的验证码图片点击后实现图片刷新,并且进行输入框失去焦点后验证输入是否正确。

    cool-php-captcha 0.3 验证码生成库代码.rar

    cool-php-captcha 是一个很酷的 PHP 用来生成验证码的库。 示例代码: session_start(); $captcha = new SimpleCaptcha(); // Change configuration...... // Change session variable $captcha-&gt;CreateImage();

    Drupal CAPTCHA模块配置

    Drupal 如何配置CAPTCHA模块; Captcha模块用于表单验证码的配置,开启即可在发表留言,发布文章,用户注册等行为上加载验证码安全校验。

    验证码 captcha

    验证码 captcha

    thinkphp5图片组件解决captcha_src()

    thinkphp5图片组件解决captcha_src()/captcha_img() 已经生成好 直接解压到vendor目录即可 快速解决壁盯墙

    captcha-core-2.2.1-API文档-中英对照版.zip

    赠送jar包:captcha-core-2.2.1.jar; 赠送原API文档:captcha-core-2.2.1-javadoc.jar; 赠送源代码:captcha-core-2.2.1-sources.jar; 赠送Maven依赖信息文件:captcha-core-2.2.1.pom; 包含翻译后的API文档:...

    no-captcha, Laravel 没有 CAPTCHA reCAPTCHA.zip

    no-captcha, Laravel 没有 CAPTCHA reCAPTCHA 没有验证码 reCAPTCHA 对于 Laravel 4,使用 v1 分支。安装composer require anhskohbo/no-captcha Laravel 5设置注意这

    集成aj-captcha实现滑块验证码.zip

    集成aj-captcha实现滑块验证码.zip

    captcha-core-2.2.1-API文档-中文版.zip

    赠送jar包:captcha-core-2.2.1.jar; 赠送原API文档:captcha-core-2.2.1-javadoc.jar; 赠送源代码:captcha-core-2.2.1-sources.jar; 赠送Maven依赖信息文件:captcha-core-2.2.1.pom; 包含翻译后的API文档:...

Global site tag (gtag.js) - Google Analytics