[移动开发] 编译运行与使用安卓文字识别项目

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 移动开发 -> 编译运行与使用安卓文字识别项目 -> 正文阅读

[移动开发]编译运行与使用安卓文字识别项目

Tesseract 用于从图片中提取文字，准确率不错，支持混合语言识别。

https://github.com/adaptech-cz/Tesseract4Android

这个项目是直接用Android studio编译c++代码的。模型文件需要自己下载（https://github.com/tesseract-ocr/tessdata），比如英文的模型文件就是https://github.com/tesseract-ocr/tessdata/blob/4.0.0/eng.traineddata，很小，只有二十来兆。下载后放到/sdcard/tesseract/tessdata进行测试。

原仓库提供了编译好的jar包与二进制文件。作为依赖库只有十多兆。也可以自己编译。debug模式下，代码运行速度很慢，原本一秒中的事，debug编译出来的夸张地需要十五秒才能完成识别。试着给cmake添加代码优化的参数，没用，只能在调试时也开启release模式：

defaultConfig {...
externalNativeBuild {
			cmake {
			// 强制编译release模式
                arguments "-DCMAKE_BUILD_TYPE=Release"
                
                //没用
              //  cFlags "-O"
               // cppFlags "-O"
               
                //OpenMP并行版本，编译后运行说找不到libmp.so然后崩溃。
//                cFlags "-fopenmp"
//                cppFlags "-fopenmp"
			}
		}

编译速度还是很快的。记得开启 abiFilters ，先不用所有abi都构建一遍。

defaultConfig {...
ndk {
			abiFilters 'arm64-v8a'
}

测试代码，github主页上有，就不重复了，是传入一张bitmap图片，获得一段识别结果。

getUTF8Text

是简单的文本，没有格式信息。

getBoxText

奇怪的格式，没看懂，如：

A 24 380 44 401 0
    b 46 380 61 403 0
    o 63 380 79 395 0
    u 82 380 96 395 0
    t 98 380 108 399 0
    F 26 305 38 326 0
    o 40 305 55 320 0
    r 50 305 64 328 0
    k 59 305 81 328 0

getHOCRText

返回HTML，每个单词都有位置信息与可信度信息，如：

<div class='ocr_page' id='page_1' title='image "unknown"; bbox 0 0 593 411; ppageno 0; scan_res 70 70'>
       <div class='ocr_carea' id='block_1_1' title="bbox 24 8 572 381">
        <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 24 8 572 381">
         <span class='ocr_line' id='line_1_1' title="bbox 24 8 108 31; baseline 0 0; x_size 29.55574; x_descenders 6.5557399; x_ascenders 8">
          <span class='ocrx_word' id='word_1_1' title='bbox 24 8 108 31; x_wconf 92'>About</span>
         </span>
         <span class='ocr_line' id='line_1_2' title="bbox 26 83 572 106; baseline 0 0; x_size 29.55574; x_descenders 6.5557399; x_ascenders 8">
          <span class='ocrx_word' id='word_1_2' title='bbox 26 83 81 106; x_wconf 93'>Fork</span>
          <span class='ocrx_word' id='word_1_3' title='bbox 91 83 117 106; x_wconf 93'>of</span>
          <span class='ocrx_word' id='word_1_4' title='bbox 125 87 237 106; x_wconf 91'>tess-two</span>
          <span class='ocrx_word' id='word_1_5' title='bbox 248 84 363 106; x_wconf 92'>rewritten</span>
          <span class='ocrx_word' id='word_1_6' title='bbox 373 83 434 106; x_wconf 92'>from</span>
          <span class='ocrx_word' id='word_1_7' title='bbox 445 83 536 106; x_wconf 92'>scratch</span>
          <span class='ocrx_word' id='word_1_8' title='bbox 546 87 572 106; x_wconf 92'>to</span>
         </span>
         <span class='ocr_line' id='line_1_3' title="bbox 25 128 545 158; baseline 0 -7; x_size 30; x_descenders 7; x_ascenders 8">
          <span class='ocrx_word' id='word_1_9' title='bbox 25 132 127 158; x_wconf 89'>support</span>
          <span class='ocrx_word' id='word_1_10' title='bbox 138 128 206 151; x_wconf 89'>latest</span>
          <span class='ocrx_word' id='word_1_11' title='bbox 215 129 308 151; x_wconf 91'>version</span>
          <span class='ocrx_word' id='word_1_12' title='bbox 320 128 346 151; x_wconf 91'>of</span>
          <span class='ocrx_word' id='word_1_13' title='bbox 354 130 472 151; x_wconf 92'>Tesseract</span>
          <span class='ocrx_word' id='word_1_14' title='bbox 482 130 545 151; x_wconf 92'>OCR.</span>
         </span>
         <span class='ocr_line' id='line_1_4' title="bbox 46 202 467 232; baseline 0 -9; x_size 22; x_descenders 5; x_ascenders 5">
          <span class='ocrx_word' id='word_1_15' title='bbox 46 206 123 223; x_wconf 91'>android</span>
          <span class='ocrx_word' id='word_1_16' title='bbox 175 211 207 223; x_wconf 91'>ocr</span>
          <span class='ocrx_word' id='word_1_17' title='bbox 258 202 347 232; x_wconf 91'>tesseract</span>
          <span class='ocrx_word' id='word_1_18' title='bbox 399 206 467 228; x_wconf 92'>libjpeg</span>
         </span>
         <span class='ocr_line' id='line_1_5' title="bbox 47 255 436 285; baseline 0 -9; x_size 22; x_descenders 5; x_ascenders 5">
          <span class='ocrx_word' id='word_1_19' title='bbox 47 259 109 281; x_wconf 92'>libpng</span>
          <span class='ocrx_word' id='word_1_20' title='bbox 162 255 254 285; x_wconf 91'>leptonica</span>
          <span class='ocrx_word' id='word_1_21' title='bbox 305 261 436 276; x_wconf 91'>tesseract-ocr</span>
         </span>
         <span class='ocr_line' id='line_1_6' title="bbox 46 311 341 333; baseline 0 -5; x_size 22; x_descenders 5; x_ascenders 5">
          <span class='ocrx_word' id='word_1_22' title='bbox 46 311 341 333; x_wconf 91'>optical-character-recognition</span>
         </span>
         <span class='ocr_line' id='line_1_7' title="bbox 45 364 221 381; baseline 0 0; x_size 22.244591; x_descenders 5.2445917; x_ascenders 5">
          <span class='ocrx_word' id='word_1_23' title='bbox 45 364 221 381; x_wconf 90'>tesseract-android</span>
         </span>
        </p>
       </div>
      </div>

这个格式叫做hOCR标记，相关内容存于span节点的title属性中，也就是说悬浮鼠标指针的时候，这些信息会以tooltips的形式弹出显示。

https://en.wikipedia.org/wiki/HOCR

bbox

属性名称 = “bbox”
属性值 = uint uint uint uint
示例
bbox 0 0 100 200
元素的 bbox（“边界框”的缩写）是围绕该元素的一个矩形框，由左上角 (x0, y0) 和右下角 (x1, y1) 定义。

这些值是参考文档图像的左上角并以像素为单位测量的

值的顺序是x0 y0 x1 y1 = "left top right bottom"

这种标记可以转成PDF格式，PDF阅读器的OCR功能似乎就是用这种方式实现的。

hOCR标记虽然包含位置信息，但直接渲染出来，还是普通的字符串，没什么格式。先写一个js脚本过过瘾，看看它按照位置信息排版出来，是什么样子的。

第一步，写一个遍历节点的方法。很简单，用firstChild与nextSibling就可以实现

function travel(n) {
	if(n){
		console.log(n);
		travel(n.firstChild);
		travel(n.nextSibling);
	}
}
travel(document.body)

第二步，应用位置信息为css样式。

function travel(n) {
	if(n){
		applyBBox(n);
		travel(n.firstChild);
		travel(n.nextSibling);
	}
}

function applyBBox(n) {
	if(n.title) {
		var t=n.title, idx=t.indexOf('bbox'), bbox = t.slice(idx+4, t.indexOf(';', idx));
		if(bbox) {
			bbox = bbox.trim().split(' ');
			console.log(bbox, n);
			n = n.style;
			n.position='fixed';
			n.left=parseInt(bbox[0])+'px';
			n.top=parseInt(bbox[1])+'px';
			n.width=(parseInt(bbox[2])-parseInt(bbox[0]))+'px';
			n.height=(parseInt(bbox[3])-parseInt(bbox[1]))+'px';
			n.fontSize=n.height;
			n.whiteSpace='nowrap';
		}
	}
}

travel(document.body);