精確Web信息抽取關(guān)鍵技術(shù)與系統(tǒng)研究

發(fā)布時(shí)間：2018-02-24 15:01

本文關(guān)鍵詞： 精確Web信息抽取瀏覽導(dǎo)航數(shù)據(jù)集成數(shù)據(jù)記錄數(shù)據(jù)項(xiàng)　出處：《南京大學(xué)》2017年博士論文　論文類型：學(xué)位論文

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的發(fā)展,Web成為全球企業(yè)與機(jī)構(gòu)進(jìn)行信息發(fā)布與應(yīng)用部署的主要平臺。大量Web網(wǎng)站和Web應(yīng)用的出現(xiàn)使得Web上的數(shù)據(jù)量急劇增長。Web上的海量數(shù)據(jù)蘊(yùn)含很多有價(jià)值的信息。為了獲得并分析利用這些有價(jià)值的信息,通常首先需要從Web上獲取精確有用的結(jié)構(gòu)化數(shù)據(jù),然后對這些結(jié)構(gòu)化數(shù)據(jù)執(zhí)行深度分析處理。然而,Web系統(tǒng)的廣泛分布性和自治性、Web數(shù)據(jù)的異構(gòu)性和非結(jié)構(gòu)化特性、以及Web數(shù)據(jù)的展現(xiàn)結(jié)構(gòu)與目標(biāo)數(shù)據(jù)結(jié)構(gòu)的不一致性,使得從Web中有效地獲取精確有用的結(jié)構(gòu)化數(shù)據(jù)成為一個(gè)較大的技術(shù)難題。Web信息抽取正是為解決這一問題而產(chǎn)生的研究領(lǐng)域。Web信息抽取研究如何從展現(xiàn)結(jié)構(gòu)的Web頁面抽取出用戶感興趣的數(shù)據(jù),并將其轉(zhuǎn)換成結(jié)構(gòu)化數(shù)據(jù)。一個(gè)完整的Web信息抽取過程可以被分為三個(gè)階段:網(wǎng)頁瀏覽導(dǎo)航、網(wǎng)頁數(shù)據(jù)抽取、以及網(wǎng)頁數(shù)據(jù)集成。然而,現(xiàn)有大部分研究工作主要關(guān)注網(wǎng)頁數(shù)據(jù)抽取,忽略了網(wǎng)頁瀏覽導(dǎo)航與網(wǎng)頁數(shù)據(jù)集成,導(dǎo)致缺少完整的Web信息抽取處理能力和過程。與此同時(shí),大多數(shù)現(xiàn)有工作過于強(qiáng)調(diào)理論意義上的全自動化分析抽取處理。相應(yīng)的方法主要有兩種:自動網(wǎng)頁數(shù)據(jù)抽取方法;開放式異構(gòu)網(wǎng)頁數(shù)據(jù)抽取方法。前者不考慮用戶需求,會抽取出很多用戶不感興趣的冗余數(shù)據(jù);這導(dǎo)致分析應(yīng)用需要對數(shù)據(jù)進(jìn)行轉(zhuǎn)換、清洗、過濾等二次處理。后者不使用任何特定于網(wǎng)頁的抽取規(guī)則模板,試圖從描述相同實(shí)體的異構(gòu)網(wǎng)頁抽取出用戶感興趣的數(shù)據(jù);這導(dǎo)致后者的數(shù)據(jù)抽取精確度通常較低。針對現(xiàn)有工作的上述不足,本文力圖綜合自動化方法以及精確Web信息抽取的實(shí)際應(yīng)用需求。面向完整Web信息抽取過程,本文研究精確Web信息抽取基本模型、語言、以及關(guān)鍵技術(shù)方法,并給出相應(yīng)的原型系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)。具體而言,本文主要研究工作和創(chuàng)新點(diǎn)如下:(1)三階段一體化精確Web信息抽取基本模型研究首先,研究并提出完整的三階段一體化精確Web信息抽取模型。然后,分別針對三個(gè)階段研究并提出網(wǎng)頁瀏覽導(dǎo)航模型、網(wǎng)頁數(shù)據(jù)抽取模型、以及網(wǎng)頁數(shù)據(jù)集成模型。網(wǎng)頁瀏覽導(dǎo)航模型通過構(gòu)建交互和瀏覽導(dǎo)航動作模型、網(wǎng)頁瀏覽導(dǎo)航路徑模型、以及網(wǎng)頁鏈接關(guān)系模型,以分別描述用戶交互動作、網(wǎng)頁瀏覽導(dǎo)航過程、以及網(wǎng)頁鏈接關(guān)系。網(wǎng)頁數(shù)據(jù)抽取模型通過構(gòu)建網(wǎng)頁數(shù)據(jù)抽取基本模型、網(wǎng)頁數(shù)據(jù)記錄模型、以及數(shù)據(jù)記錄和數(shù)據(jù)項(xiàng)抽取規(guī)則模型,以分別描述網(wǎng)頁數(shù)據(jù)抽取過程、網(wǎng)頁數(shù)據(jù)記錄結(jié)構(gòu)形式、以及數(shù)據(jù)記錄和數(shù)據(jù)項(xiàng)抽取規(guī)則框架。網(wǎng)頁數(shù)據(jù)集成模型描述了將源網(wǎng)頁數(shù)據(jù)轉(zhuǎn)換成目標(biāo)結(jié)構(gòu)數(shù)據(jù)的基本過程。(2)三階段一體化精確Web信息抽取規(guī)則體系與語言研究基于三階段一體化精確Web信息抽取基本模型,研究并設(shè)計(jì)一種三階段一體化的精確Web信息抽取規(guī)則體系與語言。與精確Web信息抽取過程的三階段相對應(yīng),該規(guī)則體系與語言包含三個(gè)部分:網(wǎng)頁瀏覽導(dǎo)航規(guī)則語言、網(wǎng)頁數(shù)據(jù)抽取規(guī)則語言、以及網(wǎng)頁數(shù)據(jù)集成規(guī)則語言。與現(xiàn)有的Web信息抽取規(guī)則語言相比,該語言的主要優(yōu)點(diǎn)包括:1)網(wǎng)頁瀏覽導(dǎo)航規(guī)則語言可以定義各種復(fù)雜網(wǎng)頁瀏覽導(dǎo)航過程的網(wǎng)頁瀏覽導(dǎo)航規(guī)則;2)網(wǎng)頁數(shù)據(jù)抽取規(guī)則語言可以定義各種復(fù)雜結(jié)構(gòu)數(shù)據(jù)記錄抽取規(guī)則;3)網(wǎng)頁數(shù)據(jù)集成規(guī)則語言可以方便靈活地定義網(wǎng)頁數(shù)據(jù)集成規(guī)則。(3)自動網(wǎng)頁數(shù)據(jù)抽取研究現(xiàn)有自動網(wǎng)頁數(shù)據(jù)抽取方法主要適用于抽取簡單結(jié)構(gòu)數(shù)據(jù)記錄(連續(xù)-定長-線性數(shù)據(jù)記錄),而難以有效抽取復(fù)雜結(jié)構(gòu)數(shù)據(jù)記錄(非連續(xù)、變長、或嵌套數(shù)據(jù)記錄)。針對這一不足,研究并提出兩種自動網(wǎng)頁數(shù)據(jù)抽取方法:基于內(nèi)聚度和DAG(有向無環(huán)圖)的自動網(wǎng)頁數(shù)據(jù)抽取方法,以及基于確定性有窮自動機(jī)的自動網(wǎng)頁數(shù)據(jù)抽取方法。前者適用于抽取連續(xù)-定長(變長)-線性數(shù)據(jù)記錄,而后者可以抽取各種簡單或復(fù)雜結(jié)構(gòu)數(shù)據(jù)記錄。(4)精確Web信息抽取規(guī)則生成研究為了便于用戶高效生成魯棒的精確Web信息抽取規(guī)則,研究并提出一種基于用戶交互、自動網(wǎng)頁結(jié)構(gòu)分析和監(jiān)督式規(guī)則學(xué)習(xí)的精確Web信息抽取規(guī)則生成方法。在網(wǎng)頁瀏覽導(dǎo)航規(guī)則生成上,將通過自動錄制用戶交互和瀏覽導(dǎo)航動作來生成相應(yīng)規(guī)則。在網(wǎng)頁數(shù)據(jù)抽取規(guī)則生成上,對于包含規(guī)整數(shù)據(jù)記錄的頁面,將采用上述自動網(wǎng)頁數(shù)據(jù)抽取方法分析網(wǎng)頁結(jié)構(gòu),繼而基于監(jiān)督式規(guī)則學(xué)習(xí)來自動生成相應(yīng)規(guī)則;對于包含非規(guī)整數(shù)據(jù)記錄的網(wǎng)頁,將基于用戶交互和監(jiān)督式規(guī)則學(xué)習(xí)來生成相應(yīng)規(guī)則。在網(wǎng)頁數(shù)據(jù)集成規(guī)則生成上,將采用簡單的腳本語言編碼方式來生成相應(yīng)規(guī)則。(5)精確Web信息抽取原型系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)為了驗(yàn)證所提出的模型、規(guī)則語言和關(guān)鍵技術(shù)方法的有效性,本文設(shè)計(jì)并實(shí)現(xiàn)一個(gè)精確Web信息抽取原型系統(tǒng)。實(shí)驗(yàn)結(jié)果表明,本文所研究提出的精確Web信息抽取模型與關(guān)鍵技術(shù)方法是有效的,比現(xiàn)有的技術(shù)方法取得更好的抽取精確性、并具有更強(qiáng)的處理能力。
[Abstract]:With the development of Internet technology, Web has become a global enterprise and organization information publication platform and application deployment. The emergence of a large number of Web sites and Web applications on the Web data of the sharp increase in the amount of valuable information contains large amounts of data on the.Web. In order to obtain and analyze the use of these valuable information, usually the first to obtain accurate useful structured data from the Web, then the implementation of structured data depth analysis. However, widely distributed and autonomous Web systems, heterogeneous and unstructured characteristics of Web data, and Web data show the structure and target data structure is not consistent, made from Web effectively. To obtain accurate structured data useful to become a technical problem of.Web information extraction is generated in order to solve this problem the research field of.Web information extraction research How to show the structure of a Web page to extract data of interest to users, and convert it into structured data. A complete Web information extraction process can be divided into three stages: Web navigation, web data extraction, data integration and web page. However, most of the existing research work mainly focus on Web data extraction ignore the web browsing, navigation and web data integration, resulting in the lack of a complete Web information extraction processing capability and process. At the same time, most of the existing work too much emphasis on the theoretical significance on automatic extraction analysis. There are two main types of corresponding methods: automatic web data extraction method; open heterogeneous Web data extraction method. The former is not consider the needs of users, many users will extract the redundant data not interested; this leads to the analysis of application of the need for data conversion, cleaning, filtering and other two time The latter. Do not use any specific web page extraction rule template, trying to extract the user interest from heterogeneous web pages to describe the same entity data; this leads to data extraction accuracy of the latter is usually low. According to the shortages of the existing work, this paper tries to comprehensive automatic method and actual application needs accurate Web information extraction. For a complete Web information extraction process, this paper studies the accurate Web information extraction model, language, and the key technology, design and implementation of the prototype system and the corresponding. Specifically, the main research work and innovation are as follows: (1) study on the basic model of three stage integrated precise information extraction of Web firstly, and put forward the research the three stage of the integration of accurate Web information extraction model complete. Then, according to the three phase of the study and put forward the "navigation model, web page data extraction Model and web data integration model. By constructing interactive web browsing navigation model and navigation model, web browsing and web navigation path model, link model, to describe user interaction, web browsing and navigation, Web links between web data extraction model. By constructing the basic model of Web data extraction, web data recording model and data recording and data extraction rule model to describe Web data extraction process, web data structure, and data recording and data extraction rules. Web data integration framework model describes the source web page data into the basic process of target structure data. (2) the three stage integrated precision Web the rules of information extraction system and language research based on the basic model of three stage integrated accurate Web information extraction, research and design A three stage of the integration of accurate Web information extraction rule system and language. In the three stage and accurate Web information extraction process corresponding to the rules and the language consists of three parts: Web navigation rule language, web data extraction rule language, and web data integration rule language. Compared with the existing Web information extraction rules language, the main advantages include: 1) language web navigation rule language can define various complex web browsing navigation web browsing navigation rules; 2) web data extraction rule language can define various complex data structure records rule extraction; 3) web data integration rule language can easily define web data integration the rules. (3) automatic web data extraction of existing automatic web data extraction method is applied to extract simple structure (- continuous data recording Long - linear data record), and difficult to extract the complex structure of data record (non continuous, variable length, or nested data record). For the lack of research and puts forward two kinds of automatic web data extraction method: cohesion and based on DAG (directed acyclic graph) automatic web data extraction method, and based on the the deterministic finite automaton of the automatic web data extraction method. The former is suitable for continuous extraction of fixed length (variable length) - linear data record, and the latter can extract a variety of simple or complex data structure records. (4) research on the generation of accurate Web information extraction rules in order to facilitate users to efficiently generate robust rules of accurate Web information extraction research. We present a method based on user interaction, method of generating accurate Web information extraction rules of web page structure analysis and supervised learning rule. In the web browser navigation rules, through the automatic recording by User interaction and navigation action to generate the corresponding rules. In the web data extraction rules generation, including structured data records for the analysis of web pages, using the structure of automatic web data extraction method, and then supervised learning rules to automatically generate the corresponding rules based on non structured data record; for a web page, the user interaction and supervised learning rule is generated based on the corresponding rules. In the web data integration rule generation, will use a simple scripting language encoding to generate the corresponding rules. (5) and implementation in order to validate the proposed model design of accurate Web information extraction prototype system, effective rule language and the key technique of the method, this paper design and implement a prototype system of Web information extraction precision. The experimental results show that the proposed accurate Web information extraction model and key techniques in this paper It is effective and is more accurate than the existing technical methods, and has a stronger ability to deal with it.

【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2017
【分類號】：TP391.1;TP393.09
，

本文編號：1530670

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xxkjbs/1530670.html

上一篇：無線網(wǎng)絡(luò)中基于干擾消除的集中式和分布式算法研究
下一篇：有機(jī)薄膜晶體管模型建立及其集成電路設(shè)計(jì)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

精確Web信息抽取關(guān)鍵技術(shù)與系統(tǒng)研究