Web數(shù)學公式提取方法的研究
發(fā)布時間:2019-01-10 19:05
【摘要】:隨著信息技術(shù)的發(fā)展,Web技術(shù)對數(shù)學交流的支持目益成熟和完善,用戶在Web上進行數(shù)學公式的獲取和管理數(shù)學公式活動,需要數(shù)學公式搜索引擎的支持。數(shù)學公式搜索引擎是第三代智能化搜索引擎的研究課題之一,而基于數(shù)學公式的爬蟲是數(shù)學公式搜索中極其重要的一部分,其質(zhì)量的好壞直接影響著數(shù)學公式搜索引擎的功能和性能。 本文的工作重點是對基于數(shù)學公式爬蟲的研究,主要涉及Web數(shù)學公式的識別提取和系統(tǒng)設(shè)計。目前,數(shù)學公式的識別研究已經(jīng)取得相當大的進展,但無法應(yīng)用到數(shù)學公式交流和搜索上。本文對用戶可編程的數(shù)學公式的識別做了有針對性的研究工作,以Web文檔中XML格式、LaTeX格式、Infix格式描述的公式以及微軟辦公軟件和OpenOffice中公式為重點?偨Y(jié)分析這些描述形式的公式在Web中的存在形式及其外在的模式特征,利用模式匹配識別提取。在此研究基礎(chǔ)上,以開源軟件Nutch為系統(tǒng)基礎(chǔ)設(shè)計實現(xiàn)了數(shù)學爬蟲系統(tǒng)MathCrawler, MathCrawler有良好的系統(tǒng)架構(gòu),可以在互聯(lián)網(wǎng)上抓取含有數(shù)學公式相關(guān)內(nèi)容的文檔并提取出數(shù)學公式,并用實驗表明系統(tǒng)有良好的性能,可以較準確地提取了數(shù)學公式。
[Abstract]:With the development of information technology, the support of Web technology for mathematical communication becomes more and more mature and perfect. Users need the support of mathematical formula search engine to obtain and manage mathematical formula on Web. The mathematical formula search engine is one of the research topics of the third generation intelligent search engine, and the reptile based on the mathematical formula is an extremely important part of the mathematical formula search. Its quality directly affects the function and performance of mathematical formula search engine. This paper focuses on the research of crawler based on mathematical formula, mainly involved in the identification and extraction of Web mathematical formula and the design of the system. At present, the research of mathematical formula recognition has made great progress, but it can not be applied to the communication and search of mathematical formula. This paper focuses on the identification of user programmable mathematical formulas, focusing on XML format, LaTeX format, Infix format description formula in Web document, Microsoft office software and OpenOffice formula. This paper summarizes and analyzes the existing forms of these descriptive forms in Web and their external pattern features, and extracts them by pattern matching recognition. On the basis of this research, this paper designs and implements the mathematical crawler system MathCrawler, MathCrawler based on open source software Nutch. It has a good system structure, and can grab the documents containing mathematical formula and extract the mathematical formula on the Internet. Experiments show that the system has good performance and can extract the mathematical formula more accurately.
【學位授予單位】:蘭州大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP393.09;TP391.3
本文編號:2406685
[Abstract]:With the development of information technology, the support of Web technology for mathematical communication becomes more and more mature and perfect. Users need the support of mathematical formula search engine to obtain and manage mathematical formula on Web. The mathematical formula search engine is one of the research topics of the third generation intelligent search engine, and the reptile based on the mathematical formula is an extremely important part of the mathematical formula search. Its quality directly affects the function and performance of mathematical formula search engine. This paper focuses on the research of crawler based on mathematical formula, mainly involved in the identification and extraction of Web mathematical formula and the design of the system. At present, the research of mathematical formula recognition has made great progress, but it can not be applied to the communication and search of mathematical formula. This paper focuses on the identification of user programmable mathematical formulas, focusing on XML format, LaTeX format, Infix format description formula in Web document, Microsoft office software and OpenOffice formula. This paper summarizes and analyzes the existing forms of these descriptive forms in Web and their external pattern features, and extracts them by pattern matching recognition. On the basis of this research, this paper designs and implements the mathematical crawler system MathCrawler, MathCrawler based on open source software Nutch. It has a good system structure, and can grab the documents containing mathematical formula and extract the mathematical formula on the Internet. Experiments show that the system has good performance and can extract the mathematical formula more accurately.
【學位授予單位】:蘭州大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP393.09;TP391.3
【參考文獻】
相關(guān)期刊論文 前3條
1 歐陽辰;數(shù)學公式與WEB[J];計算機工程與應(yīng)用;2001年17期
2 靳簡明;江紅英;王慶人;;數(shù)學公式識別系統(tǒng):MatheReader[J];計算機學報;2006年11期
3 盧托;于俊清;廖兆存;聶江;;基于Web的數(shù)學公式檢索系統(tǒng)設(shè)計與實現(xiàn)[J];微處理機;2008年02期
相關(guān)碩士學位論文 前4條
1 劉志偉;數(shù)學搜索引擎研究[D];蘭州大學;2011年
2 吳明;WEB上數(shù)學公式表達技術(shù)研究[D];南京師范大學;2005年
3 景珂;網(wǎng)絡(luò)數(shù)學搜索中的數(shù)學查詢語言與索引的研究[D];蘭州大學;2009年
4 劉東閣;基于MathML的公式檢索系統(tǒng)的設(shè)計與實現(xiàn)[D];東北大學;2009年
,本文編號:2406685
本文鏈接:http://www.sikaile.net/kejilunwen/sousuoyinqinglunwen/2406685.html
最近更新
教材專著