基于網(wǎng)絡信息的個性化用戶詞典更新方法
本文關(guān)鍵詞: 網(wǎng)絡信息提取 新詞發(fā)現(xiàn) 新詞分類 個性化加載 拼音輸入法 出處:《哈爾濱工業(yè)大學》2013年碩士論文 論文類型:學位論文
【摘要】:漢字輸入是中文信息處理中非常重要的問題之一,也是智能人機接口的一個重要組成部分。在漢字輸入領(lǐng)域,拼音輸入比較符合人們的使用習慣,目前已經(jīng)進入第三代云輸入法的發(fā)展階段。目前主流輸入法強調(diào)個性化,個性化主要體現(xiàn)為詞頻調(diào)整和詞庫自動擴充。詞頻調(diào)整是指根據(jù)用戶輸入的分詞統(tǒng)計,隨時對詞庫的詞頻做出合理的調(diào)整,給用戶最合理的詞條排序。而詞庫自動擴充是指通過搜索引擎或者互聯(lián)網(wǎng)抓取前所未有的超大訓練語料(TB級別),使得各種各樣的詞語都可以統(tǒng)統(tǒng)納入詞典而不受任何限制。本文正是主要從詞庫擴充來改進輸入法。詞庫擴充最重要的方面是新詞發(fā)現(xiàn),這也是本文的核心內(nèi)容,針對這個問題,本文主要進行了以下研究工作: (1)網(wǎng)絡信息的提取和處理:用網(wǎng)絡爬蟲程序爬取新浪網(wǎng)頁,抽取出其中的網(wǎng)頁內(nèi)容,由于其中的網(wǎng)頁內(nèi)容還有大量垃圾信息(比如廣告,版權(quán)等信息),需要對抽取到的網(wǎng)頁內(nèi)容進行凈化,提取其中有效信息,,標記其中重要信息。網(wǎng)頁凈化是指對原始網(wǎng)頁庫中的每一個網(wǎng)頁進行解析和過濾,提取有效信息,標記重要信息,去掉意義不大的廣告、版權(quán)等信息的過程。原始網(wǎng)頁經(jīng)過凈化,可以轉(zhuǎn)變?yōu)橐粋結(jié)構(gòu)清晰,內(nèi)容緊湊,信息明確的網(wǎng)頁。 (2)設(shè)計實現(xiàn)了新詞的提。簩艋木W(wǎng)頁采用基于普通重復串統(tǒng)計方法提取新詞,對中文按照標點和停用詞表進行切分,然后對每個二字詞、三字詞、四字詞進行出現(xiàn)次數(shù)的統(tǒng)計,次數(shù)超過預先設(shè)置好的閾值的字串作為候選新詞,再基于重復串查找算法刪除重復子串和構(gòu)詞規(guī)則刪除垃圾串,最后將候選新詞和輸入法本身的詞庫進行比對,形成一個新詞詞庫。 (3)新詞分類和詞庫的個性化加載:在所得到的凈化網(wǎng)頁信息中,經(jīng)研究原始網(wǎng)頁發(fā)現(xiàn),標題字段也含有正文的類別信息,用匹配的方法提取出類別。通過這種方法,把新詞進行分類。根據(jù)用戶的使用習慣,有選擇的加載或刪除新詞詞庫其中的一類或者幾類,體現(xiàn)用戶的個性化特點。 最后,為了對系統(tǒng)取得真實、客觀的評價,本文以準確率,召回率,F(xiàn)值來評測新詞提取的性能,以字符準確率,行準確率為評價指標,對輸入法加入新詞詞庫前后的性能進行比較。經(jīng)評測發(fā)現(xiàn),新詞提取的各項標準較好,而加入新詞詞庫后輸入法的性能得到了進一步的提高。
[Abstract]:Chinese character input is one of the most important problems in Chinese information processing, and it is also an important part of intelligent man-machine interface. At present, it has entered the development stage of the third generation cloud input method. At present, the mainstream input method emphasizes personalization, personalization is mainly reflected in word frequency adjustment and word bank automatic expansion. Word frequency adjustment refers to word segmentation statistics according to user input. At any time to make a reasonable readjustment of the vocabulary frequency to give users the most reasonable word ranking. And the automatic expansion of vocabulary refers to the search engine or the Internet to grab unprecedented huge training corpus terabytes). So that all kinds of words can be included in the dictionary without any restrictions. This paper is mainly from the lexicon expansion to improve the input method. The most important aspect of lexicon expansion is the discovery of new words. This is also the core content of this paper, in view of this problem, this paper mainly carried out the following research work: 1) extraction and processing of network information: crawling Sina web page with web crawler program, extracting the web page content, because of the web page content and a large number of spam information (such as advertising, copyright and other information). It is necessary to purify the extracted web page content, extract the effective information and mark the important information. Page purification refers to the analysis and filtering of every page in the original web page library to extract effective information. The process of marking important information and removing information such as advertising and copyright, etc. After purification, the original page can be transformed into a web page with clear structure, compact content and clear information. Design and implementation of the new word extraction: the purification of the web page based on the common repeated string statistics method to extract new words, Chinese according to punctuation and stop word table for segmentation, and then for each two words, three words. The number of occurrences of four words is counted, the number of times exceeding the pre-set threshold as a candidate new word, and then based on repeated string search algorithm to delete repeated substrings and word-formation rules to delete garbage string. Finally, the candidate neologisms are compared with the lexicon of the input method to form a neologism lexicon. 3) Classification of new words and personalized loading of thesaurus: in the purified web page information obtained, it is found that the title field also contains the category information of the text after studying the original web page. Use matching method to extract categories. By this method, new words are classified. According to the usage habits of users, one or more of the categories of neologisms are selectively loaded or deleted. Reflect the personalized characteristics of the user. Finally, in order to obtain a true and objective evaluation of the system, this paper uses accuracy, recall rate and F value to evaluate the performance of neologism extraction, and takes character accuracy and line accuracy as evaluation indicators. This paper compares the performance of the input method before and after adding the new word library. The evaluation shows that the new word extraction standards are better, and the performance of the input method has been further improved after the addition of the new word bank.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.14
【參考文獻】
相關(guān)期刊論文 前9條
1 丁建立;慈祥;黃劍雄;;一種基于免疫遺傳算法的網(wǎng)絡新詞識別方法[J];計算機科學;2011年01期
2 劉峰;王曄晗;湯步洲;王曉龍;王軒;;基于Android的智能中文輸入法[J];計算機工程;2011年07期
3 楊曉東;晏立;尤慧麗;;CCRF與規(guī)則相結(jié)合的中文機構(gòu)名識別[J];計算機工程;2011年08期
4 向曉雯,史曉東,曾華琳;一個統(tǒng)計與規(guī)則相結(jié)合的中文命名實體識別系統(tǒng)[J];計算機應用;2005年10期
5 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務信息抽取的產(chǎn)品命名實體識別研究[J];中文信息學報;2006年01期
6 趙軍;;命名實體識別、排歧和跨語言關(guān)聯(lián)[J];中文信息學報;2009年02期
7 劉挺,吳巖,王開鑄;串頻統(tǒng)計和詞形匹配相結(jié)合的漢語自動分詞系統(tǒng)[J];中文信息學報;1998年01期
8 鄭家恒,李文花;基于構(gòu)詞法的網(wǎng)絡新詞自動識別初探[J];山西大學學報(自然科學版);2002年02期
9 俞鴻魁;張華平;劉群;呂學強;施水才;;基于層疊隱馬爾可夫模型的中文命名實體識別[J];通信學報;2006年02期
本文編號:1456724
本文鏈接:http://www.sikaile.net/wenyilunwen/guanggaoshejilunwen/1456724.html