蒙古文文檔圖像版面分析及識別后處理的研究與實現(xiàn)

發(fā)布時間：2018-07-13 07:42

【摘要】：光學(xué)字符識別(Optical Character Recognition,簡稱OCR)技術(shù)的研究在近年來得到了飛速發(fā)展,中文、英文等文字識別技術(shù)的研究已經(jīng)取得了顯著的成果。文字識別率是OCR系統(tǒng)中最重要的一個性能指標,對于印刷體蒙古文字識別系統(tǒng)來說,要想完善整個系統(tǒng),提高蒙古文字的識別率,就要對蒙古文文檔圖像在識別前期的版面分析技術(shù)和后期的識別后處理技術(shù)進行研究和實現(xiàn)。因此,本文的主要研究內(nèi)容包括兩個部分,一個是蒙古文文檔圖像的版面分析,另一個是蒙古文字識別后處理。在印刷體蒙古文字識別過程中,版面分析是一個很重要的基礎(chǔ)工作,而目前對蒙古文文檔圖像的版面分析研究較少,蒙古文文檔圖像的版面形式多種多樣,且存在文字、圖片、表格等多種版面元素混排的情況,這些都給印刷體蒙古文字識別工作帶來諸多困難。本文采用自底向上和自頂向下相結(jié)合的版面分析法,通過標記連通域、合并連通域、去除連通域等相關(guān)流程,將非文字部分去除,只保留文字部分。之后再經(jīng)過段落劃分,獲得各段落的位置信息,這些位置信息可供后續(xù)版面恢復(fù)使用。在蒙古文字識別系統(tǒng)中,文檔圖像經(jīng)過切分和識別得到的識別結(jié)果是蒙古文字形編碼,目前常用的為國際標準編碼,因此要對識別結(jié)果進行編碼轉(zhuǎn)換,本文所關(guān)注的后處理是將字形識別結(jié)果轉(zhuǎn)換為國際標準編碼的過程。文中所采用的是基于對照詞典的編碼轉(zhuǎn)換方式,首先需要將已有的國際標準碼詞典(涵蓋了目前常用的50553個蒙古文單詞)依次轉(zhuǎn)換為WORD文檔、PDF文件,最后轉(zhuǎn)換為圖片并進行版面分析和列切分、字切分以及字元切分,將經(jīng)過切分得到的蒙古文字元圖像作為訓(xùn)練好的卷積神經(jīng)網(wǎng)絡(luò)分類器的輸入,輸出即為蒙古文字形編碼,利用已有的國際標準碼詞典與獲取到的字形編碼按照一一對應(yīng)的關(guān)系整理成編碼轉(zhuǎn)換詞典。進行后處理時在整理好的詞典中查找與識別結(jié)果相同的字形編碼的位置,即可在詞典中找到該字形編碼相對應(yīng)的國際標準碼,完成編碼轉(zhuǎn)換過程。本文研究的蒙古文文檔圖像版面分析技術(shù),能夠?qū)Χ喾N復(fù)雜版面格式的蒙古文文檔圖像進行處理,包括去除非文字部分、將文字區(qū)域劃分段落并標記段落位置等,在一定數(shù)量的樣本集上進行測試,版面分析準確率達到了 97.87%。本文研究的識別后處理,能夠快速、有效、準確的將蒙古文字形編碼識別結(jié)果轉(zhuǎn)換為國際標準碼,使得印刷體蒙古文字識別系統(tǒng)更加完善。
[Abstract]:The research of optical character recognition (OCR) technology has been developed rapidly in recent years. Character recognition rate is the most important performance index in OCR system. For printed Mongolian character recognition system, it is necessary to perfect the whole system and improve the recognition rate of Mongolian characters. It is necessary to study and implement the layout analysis technology of Mongolian document image in the early stage and the post processing technology in the later stage. Therefore, the main content of this paper includes two parts, one is the layout analysis of Mongolian document images, the other is the post-processing of Mongolian text recognition. In the process of printed Mongolian character recognition, layout analysis is a very important basic work, but at present, there are few researches on layout analysis of Mongolian document image, and Mongolian document image has a variety of layout forms, and there are characters and pictures. The mixed arrangement of various layout elements, such as tables, brings many difficulties to the recognition of printed Mongolian characters. In this paper, a bottom-up and top-down layout analysis method is used to remove the non-text part, only the text part, by marking the connected domain, merging the connected domain, removing the connected domain, and so on. After paragraph division, the location information of each paragraph is obtained, which can be used for subsequent page restoration. In Mongolian character recognition system, the result of document image segmentation and recognition is Mongolian font coding. The post-processing of this paper is the process of converting the result of font recognition into international standard coding. The coding conversion method based on contrast dictionary is adopted in this paper. Firstly, we need to convert the existing international standard code dictionary (covering 50553 Mongolian words) into word document and PDF file in turn. Finally, the images are converted into pictures, and the layout analysis and column segmentation, word segmentation and character segmentation are carried out. The Mongolian character element image obtained by the segmentation is used as the input of the trained convolution neural network classifier, and the output is Mongolian font coding. The existing international standard code dictionaries and the obtained glyph codes are arranged into a coding conversion dictionary according to the one-to-one correspondence. After the post-processing, we can find the corresponding international standard code in the dictionary and complete the coding conversion process by looking up the position of the glyph code which is the same as the recognition result in the arranged dictionary. The Mongolian document image layout analysis technology studied in this paper can process the Mongolian document image in many complicated layout formats, including removing the text part, dividing the text area into paragraphs and marking the paragraph position, etc. A certain number of samples were tested, and the accuracy of layout analysis reached 97.87. The post-processing in this paper can quickly, effectively and accurately convert the recognition result of Mongolian font coding into international standard code, which makes the printed Mongolian character recognition system more perfect.
【學(xué)位授予單位】：內(nèi)蒙古大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.4

【參考文獻】

相關(guān)期刊論文前10條

1 楊戈;張威強;黃靜;;一個感知機神經(jīng)網(wǎng)絡(luò)字符識別器的實現(xiàn)[J];電子技術(shù)應(yīng)用;2015年03期

2 單煜翔;陳諧;史永哲;劉加;;基于擴展N元文法模型的快速語言模型預(yù)測算法[J];自動化學(xué)報;2012年10期

3 王健;哈力木拉提·買買提;;印刷體維吾爾文識別后處理[J];新疆大學(xué)學(xué)報(自然科學(xué)版);2011年02期

4 蘇志祁;方康玲;;一種鋼筋圖像自動計數(shù)的方法[J];現(xiàn)代電子技術(shù);2010年06期

5 董廣宇;呂學(xué)強;王濤;施水才;;基于N-gram語言模型的漢字識別后處理研究[J];微計算機信息;2009年10期

6 魏宏喜;高光來;;一種基于連通域的蒙古文文檔圖像版面分析方法[J];內(nèi)蒙古大學(xué)學(xué)報(自然科學(xué)版);2007年05期

7 魏宏喜;高光來;;印刷體蒙古文字識別中蒙古文字特征的選擇[J];內(nèi)蒙古大學(xué)學(xué)報(自然科學(xué)版);2006年06期

8 張廣淵;李晶皎;王愛俠;;基于知識的滿文識別后處理[J];計算機輔助工程;2006年03期

9 趙驥;李晶皎;王麗君;張繼生;;基于HMM的滿文文本識別后處理的研究[J];中文信息學(xué)報;2006年04期

10 徐兆軍,業(yè)寧,王厚立;基于神經(jīng)網(wǎng)絡(luò)的版面分析[J];計算機應(yīng)用;2004年S2期

相關(guān)博士學(xué)位論文前2條

1 趙于前;基于數(shù)學(xué)形態(tài)學(xué)的醫(yī)學(xué)圖像處理理論與方法研究[D];中南大學(xué);2006年

2 劉建勝;文檔圖象版面理解的研究[D];重慶大學(xué);2002年

相關(guān)碩士學(xué)位論文前9條

1 姚志鵬;基于Hadoop平臺的印刷體蒙古文字識別系統(tǒng)的研究與實現(xiàn)[D];內(nèi)蒙古大學(xué);2016年

2 張文杰;基于移動終端的報紙版面分析及識別[D];北京郵電大學(xué);2014年

3 施晟;文檔圖像的版面分析技術(shù)研究[D];中南大學(xué);2011年

4 郭軍;信息資源數(shù)字化文本型數(shù)字圖像OCR識別準確度影響因素及提高策略研究[D];鄭州大學(xué);2011年

5 黨興;復(fù)雜的中文文檔圖像版面分析研究[D];蘇州大學(xué);2010年

6 包艷花;蒙古文識別文本后處理相關(guān)技術(shù)研究[D];內(nèi)蒙古大學(xué);2007年

7 魏宏喜;印刷體蒙古文字識別中關(guān)鍵技術(shù)的研究[D];內(nèi)蒙古大學(xué);2006年

8 鄧立國;基于多層次可信度指導(dǎo)下的自底向上版面分析[D];西華大學(xué);2006年

9 楊芳;基于紋理分析的印刷字體識別研究及應(yīng)用[D];河北大學(xué);2003年

，

本文編號：2118662

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xixikjs/2118662.html

上一篇：PDMA圖樣分割多址技術(shù)在下一代移動通信中的應(yīng)用研究
下一篇：基于BP神經(jīng)網(wǎng)絡(luò)的大學(xué)生科研能力評價

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

蒙古文文檔圖像版面分析及識別后處理的研究與實現(xiàn)