天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web信息自動標(biāo)引研究

發(fā)布時間:2018-06-27 01:35

  本文選題:Web信息 + 自動標(biāo)引。 參考:《浙江大學(xué)》2014年博士論文


【摘要】:互聯(lián)網(wǎng)絡(luò)的發(fā)展及信息化工程的推進(jìn),促使Web信息逐步累積成為一個能夠提供信息交互、信息共享,并影響人類生活各個層面的巨大資源空間。為了從具有海量性、無序性、異構(gòu)性、實時更新性、多樣性等特征的Web信息中快速、準(zhǔn)確地獲取所需資源,人們開始逐漸認(rèn)識到Web信息組織管理的重要性,并開始探索各種Web信息處理方法,自動標(biāo)引即為其中之一。本研究以自動提取Web信息標(biāo)引詞為切入點(diǎn),以Web坐標(biāo)系、Web頁面組織結(jié)構(gòu)和Web頁面瀏覽者的閱讀習(xí)慣等特點(diǎn)為研究對象,探索Web信息自動標(biāo)引過程中的具體影響因素。在總結(jié)前人研究工作的基礎(chǔ)上,提出設(shè)想:根據(jù)網(wǎng)頁坐標(biāo)系,按照不同站點(diǎn)類型,用不同分割比例把網(wǎng)頁劃分若干區(qū)域;判析Web信息塊歸屬區(qū)域并針對網(wǎng)站類型,探索各區(qū)域信息塊在自動標(biāo)引過程中的權(quán)重,最后編寫程序驗證以上設(shè)想,完成自動標(biāo)引各個環(huán)節(jié)。具體步驟如下:(1)研究實現(xiàn)Web頁面采集。根據(jù)研究需要,分別實現(xiàn)Web頁面批量采集和手動采集,解決Web頁面采集過程中的頁面編碼轉(zhuǎn)換、html轉(zhuǎn)換xml等問題。(2)利用Web頁面坐標(biāo)系,結(jié)合頁面瀏覽者閱讀習(xí)慣,將Web頁面劃分成9個區(qū)域。每個區(qū)域占據(jù)頁面一定比例,且區(qū)域中信息塊被視為一個信息塊集群,在后期運(yùn)算中具有同樣的標(biāo)引權(quán)重并被統(tǒng)一處理。(3)尋找發(fā)現(xiàn)不同類型網(wǎng)站的適宜頁面分割比例。不同類型網(wǎng)站有著自己獨(dú)特的頁面信息發(fā)布方式。如新聞類站點(diǎn),往往圖片較少,文字報道占主要部分;大部分新聞類站點(diǎn)都向頁面瀏覽者提供對某新聞進(jìn)行評價的功能,從而造成網(wǎng)頁高度變動幅度較大。本文分別選擇新聞類、體育類、科學(xué)類站點(diǎn)頁面,用不同頁面分割比例進(jìn)行測試,找出各類型站點(diǎn)的適宜頁面分割比例值。(4)摸索不同區(qū)域信息塊在自動標(biāo)引過程中的權(quán)重。瀏覽者在訪問Web頁面時,總會有視覺焦點(diǎn)、閱讀習(xí)慣等特性,從而Web頁面設(shè)計者在制作網(wǎng)頁時,也會有所重點(diǎn)地安排Web頁面信息。因此能否發(fā)現(xiàn)不同Web頁面區(qū)域的信息重要程度,對后期自動標(biāo)引結(jié)果的準(zhǔn)確性有著直接影響。本文通過樣本實驗,對新聞類、科學(xué)類站點(diǎn)網(wǎng)頁的不同區(qū)域信息塊重要性進(jìn)行了摸索,并分別得出不同類型站點(diǎn)的Web頁面區(qū)域信息塊在自動標(biāo)引中的權(quán)重。(5)實現(xiàn)對Web頁面進(jìn)行自動標(biāo)引。在考慮Web頁面信息噪音和區(qū)域特性的基礎(chǔ)上,結(jié)合文本方法特色,給出一種Web信息自動標(biāo)引的方法,編寫程序予以實現(xiàn)和驗證。此外,本文還分別對網(wǎng)頁寬度、網(wǎng)頁高度與不同頁面分割比例下的信息抽取查全率、準(zhǔn)確率等的相關(guān)性等問題進(jìn)行了探討,以期對以后該領(lǐng)域研究有所幫助。綜上所述,本文對Web信息自動標(biāo)引過程中各環(huán)節(jié)的關(guān)鍵技術(shù)進(jìn)行了探索,探討了不同類型站點(diǎn)網(wǎng)頁的適宜分割比例,研究了網(wǎng)頁坐標(biāo)系與Web信息自動標(biāo)引過程的相互關(guān)系,對相關(guān)研究有著借鑒和參考意義。
[Abstract]:With the development of Internet and the promotion of information engineering, Web information is gradually accumulated into a huge resource space which can provide information exchange, information sharing and influence human life. In order to obtain the required resources quickly and accurately from the Web information with the characteristics of magnanimity, disorder, heterogeneity, real-time update and diversity, people begin to realize the importance of the organization and management of Web information. And began to explore a variety of Web information processing methods, automatic indexing is one of them. In this study, we take the automatic extraction of Web information indexing words as the starting point, take the characteristics of the web page organization structure and the reading habits of the web page visitors in the Web coordinate system as the research object, and explore the specific influencing factors in the process of automatic indexing of Web information. On the basis of summarizing the previous research work, this paper puts forward some tentative ideas: according to the web coordinate system, according to the different site types, the web page is divided into several areas with different proportion, and the Web information block belongs to the area and aims at the website type. The weight of each region information block in the process of automatic indexing is explored. Finally, the program is written to verify the above assumption, and each link of automatic indexing is completed. The concrete steps are as follows: (1) Web page collection is realized. According to the needs of the research, we realize the batch and manual collection of web pages, and solve the problems of page coding conversion / html conversion xml in the process of web page collection. (2) using the web page coordinate system, combining with the reading habits of the page viewer, Divide the Web page into nine regions. Each area occupies a certain proportion of the page, and the information block in the region is regarded as a cluster of information blocks, which has the same indexing weight in the later operation and is uniformly processed. (3) to find the appropriate proportion of page segmentation to find different types of websites. Different types of websites have their own unique way of publishing page information. For example, news sites tend to have fewer pictures and text reports account for the main part; most news sites provide page views with the function of evaluating a certain news, resulting in a large range of page height changes. This article selects the news class, sports class, science type website page separately, carries on the test with the different page partition proportion, finds out each type site suitable page segmentation proportion value. (4) gropes the different area information block in the automatic indexing process weight. When visitors visit Web pages, they always have some features such as visual focus, reading habits and so on, so the web page designer will also arrange Web page information with emphasis when making web pages. Therefore, whether we can find the importance of information in different Web page regions has a direct impact on the accuracy of the automatic indexing results in the later period. In this paper, the importance of different regional information blocks of news and science websites is explored through sample experiments. The weight of Web page area information block in automatic indexing of different types of sites is obtained respectively. (5) automatic indexing of Web pages is realized. On the basis of considering the noise and region characteristics of Web page information, a method of automatic indexing of Web information is presented, which is realized and verified by programming. In addition, this paper also discusses the correlation of information extraction recall rate, accuracy rate and so on under the conditions of page width, page height and different page segmentation ratio respectively, in order to be helpful to the future research in this field. To sum up, this paper explores the key technologies in the process of automatic indexing of Web information, probes into the appropriate proportion of web pages of different types of sites, and studies the relationship between web coordinates and the process of automatic indexing of Web information. It has reference and reference significance to relevant research.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2014
【分類號】:TP393.09

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 丁璇,侯漢清,章成志;中文網(wǎng)頁標(biāo)引源主題表達(dá)能力的調(diào)查統(tǒng)計[J];大學(xué)圖書館學(xué)報;2002年06期

2 徐照財;程顯毅;;基于多Agent系統(tǒng)的定題爬蟲算法[J];計算機(jī)工程;2008年16期

3 索紅光;劉玉樹;曹淑英;;一種基于詞匯鏈的關(guān)鍵詞抽取方法[J];中文信息學(xué)報;2006年06期

4 劉其云,李中言;信息抽取的功能和實現(xiàn)方法[J];情報雜志;2005年05期

5 李紅霞;;網(wǎng)絡(luò)信息資源組織研究述評[J];情報雜志;2006年09期



本文編號:2072167

資料下載
論文發(fā)表

本文鏈接:http://www.sikaile.net/guanlilunwen/ydhl/2072167.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶926d8***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com