利用未標記數(shù)據(jù)的機器學習方法研究
發(fā)布時間:2018-04-23 10:26
本文選題:機器學習 + 半監(jiān)督學習 ; 參考:《南京大學》2017年碩士論文
【摘要】:機器學習需要有標記數(shù)據(jù)來訓練模型進行預測,有標記數(shù)據(jù)的獲取通常需要人工參與,因此價格非常昂貴。在很多實際應用中,未標記數(shù)據(jù)可以較為容易地大量獲取,如何利用廉價的未標記數(shù)據(jù)一直以來都是機器學習領域中的研究熱點。目前出現(xiàn)了兩種利用未標記數(shù)據(jù)的方法:一種是自動利用未標記數(shù)據(jù)輔助有標記數(shù)據(jù)提升學習性能的半監(jiān)督學習;雖然該類方法大多能夠提升學習性能,但都基于潛在的模型假設,當模型假設與數(shù)據(jù)分布存在偏差時可能會降低學習性能;另一種是通過眾包以較低的代價給數(shù)據(jù)提供標記,進而可以精確利用未標記數(shù)據(jù)以降低學習風險。本文主要圍繞半監(jiān)督學習和眾包進行研究,取得了以下進展:第一,針對半監(jiān)督學習中的重要風范協(xié)同訓練易受不充分視圖的影響這一問題,提出了一種新型的加權協(xié)同訓練算法。視圖不充分時協(xié)同訓練過程中會出現(xiàn)與最優(yōu)分類器不一致的樣本,該算法通過檢測潛在的不一致樣本并降低其權值以減少這些樣本對訓練過程的影響。實驗結果表明,與標準的協(xié)同訓練算法相比該算法有更好的泛化性能與更強的魯棒性。第二,針對眾包過程中任務標記依賴于任務難度這一特點,提出了一種新型的任務分配算法。該算法通過估計部分任務的難度構建訓練集學得預測難度的模型,將任務分為簡單和困難兩類。對于簡單的任務可利用眾包進行標記;而對于困難的任務,則需雇傭專家為其提供高質量標記。實驗結果表明該算法能夠在提高標記質量的同時降低標記代價。此外,本文還對利用未標記數(shù)據(jù)的模型復用進行了研究,該場景中用戶需要集成多個無法修改的預訓練模型,針對這一問題,本文提出了一種新型的多視圖模型復用算法。該算法通過信念傳播估計預訓練模型的可靠性,并基于未標記數(shù)據(jù)上的多視圖一致性指導這一估計過程,進而利用估計得到的可靠性加權集成多個預訓練模型。實驗結果表明該方法能夠顯著提升分類精度。
[Abstract]:Machine learning requires labeled data to train models for prediction, and the acquisition of labeled data usually requires manual participation, so the price is very expensive. In many practical applications, unlabeled data can be easily obtained in large quantities. How to use cheap unlabeled data has always been a hot topic in the field of machine learning. At present, there are two methods to use unlabeled data: one is to use unlabeled data automatically to assist semi-supervised learning with labeled data to improve learning performance, although most of these methods can improve learning performance. But both are based on underlying model assumptions, which can reduce learning performance when the model assumption deviates from the data distribution; the other is to tag the data at a lower cost through crowdsourcing. Furthermore, unlabeled data can be used accurately to reduce the risk of learning. This paper mainly focuses on semi-supervised learning and crowdsourcing, and has made the following progress: first, aiming at the problem that the important cooperative training in semi-supervised learning is easily affected by insufficient views, A new weighted cooperative training algorithm is proposed. When the view is not sufficient, there will be samples that are inconsistent with the optimal classifier. The algorithm can reduce the influence of these samples on the training process by detecting the potentially inconsistent samples and reducing their weights. Experimental results show that the proposed algorithm has better generalization performance and better robustness than the standard cooperative training algorithm. Secondly, a new task assignment algorithm is proposed to solve the problem that task marking depends on task difficulty in crowdsourcing. By estimating the difficulty of some tasks, the algorithm constructs a training set model to predict the difficulty, and divides the task into two categories: simple and difficult. Simple tasks can be tagged with crowdsourcing; for difficult tasks, specialists are hired to provide high quality tags. Experimental results show that the proposed algorithm can improve the marking quality and reduce the marking cost. In addition, this paper also studies the reuse of models using unlabeled data. In this scenario, users need to integrate several pre-training models that can not be modified. In order to solve this problem, a new multi-view model reuse algorithm is proposed in this paper. The algorithm estimates the reliability of the pre-training model through belief propagation, and guides the estimation process based on multi-view consistency on unlabeled data, and then integrates multiple pre-training models weighted by the estimated reliability. Experimental results show that this method can significantly improve the classification accuracy.
【學位授予單位】:南京大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP181
【共引文獻】
相關期刊論文 前10條
1 朱小香;許金森;薩U喲,
本文編號:1791566
本文鏈接:http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/1791566.html
最近更新
教材專著