这里我们不用神经网络,我们用向量空间引擎的思想。什么是向量空间引擎呢?
拿文章的例子来说:
你有 3 篇文档,我们要怎么计算它们之间的相似度呢?2 篇文档所使用的相同的单词越多,那这两篇文章就越相似!但是这单词太多怎么办,就由我们来选择几个关键单词,选择的单词又被称作特征,每一个特征就好比空间中的一个维度(x,y,z 等),一组特征就是一个矢量,每一个文档我们都能得到这么一个矢量,只要计算矢量之间的夹角就能得到文章的相似度了。
用 Python 类实现向量空间:
import math class VectorCompare: #计算矢量大小 def magnitude(self,concordance): total = 0 for word,count in concordance.iteritems(): total += count ** 2 return math.sqrt(total) #计算矢量之间的 cos 值 def relation(self,concordance1, concordance2): relevance = 0 topvalue = 0 for word, count in concordance1.iteritems(): if concordance2.has_key(word): topvalue += count * concordance2[word] return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
它会比较两个 python 字典类型并输出它们的相似度(用 0~1 的数字表示)
这里取大量验证码提取单个字符图片作为训练集合的工作,但只要是有好好读上文的同学就一定知道这些工作要怎么做,在这里就略去了。可以直接使用提供的训练集合来进行下面的操作。
iconset目录下放的是我们的训练集。
最后追加的内容:
#将图片转换为矢量 def buildvector(im): d1 = {} count = 0 for i in im.getdata(): d1[count] = i count += 1 return d1 v = VectorCompare() iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] #加载训练集 imageset = [] for letter in iconset: for img in os.listdir('./iconset/%s/'%(letter)): temp = [] if img != "Thumbs.db" and img != ".DS_Store": temp.append(buildvector(Image.open("./iconset/%s/%s"%(letter,img)))) imageset.append({letter:temp}) count = 0 #对验证码图片进行切割 for letter in letters: m = hashlib.md5() im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] )) guess = [] #将切割得到的验证码小片段与每个训练片段进行比较 for image in imageset: for x,y in image.iteritems(): if len(y) != 0: guess.append( ( v.relation(y[0],buildvector(im3)),x) ) guess.sort(reverse=True) print "",guess[0] count += 1
全部代码:
from PIL import Image import hashlib import time import os import math class VectorCompare: def magnitude(self,concordance): total = 0 for word,count in concordance.iteritems(): total += count ** 2 return math.sqrt(total) def relation(self,concordance1, concordance2): relevance = 0 topvalue = 0 for word, count in concordance1.iteritems(): if concordance2.has_key(word): topvalue += count * concordance2[word] return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2)) def buildvector(im): d1 = {} count = 0 for i in im.getdata(): d1[count] = i count += 1 return d1 v = VectorCompare() iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] imageset = [] for letter in iconset: for img in os.listdir('./iconset/%s/'%(letter)): temp = [] if img != "Thumbs.db" and img != ".DS_Store": # windows check... temp.append(buildvector(Image.open("./iconset/%s/%s"%(letter,img)))) imageset.append({letter:temp}) im = Image.open("captcha.gif") im2 = Image.new("P",im.size,255) im.convert("P") temp = {} for x in range(im.size[1]): for y in range(im.size[0]): pix = im.getpixel((y,x)) temp[pix] = pix if pix == 220 or pix == 227: # these are the numbers to get im2.putpixel((y,x),0) inletter = False foundletter=False start = 0 end = 0 letters = [] for y in range(im2.size[0]): # slice across for x in range(im2.size[1]): # slice down pix = im2.getpixel((y,x)) if pix != 255: inletter = True if foundletter == False and inletter == True: foundletter = True start = y if foundletter == True and inletter == False: foundletter = False end = y letters.append((start,end)) inletter=False count = 0 for letter in letters: m = hashlib.md5() im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] )) guess = [] for image in imageset: for x,y in image.iteritems(): if len(y) != 0: guess.append( ( v.relation(y[0],buildvector(im3)),x) ) guess.sort(reverse=True) print "",guess[0] count += 1
关于字符的训练,大家可以参考我前面的文章