大數(shù)據(jù)下的數(shù)據(jù)選擇與學習算法研究

發(fā)布時間：2018-01-03 04:28

本文關(guān)鍵詞：大數(shù)據(jù)下的數(shù)據(jù)選擇與學習算法研究　出處：《西安電子科技大學》2015年博士論文　論文類型：學位論文

【摘要】：信息爆炸時代給我們帶來了無論種類還是數(shù)量上都空前巨大的信息。隨著計算機通信與互聯(lián)網(wǎng)技術(shù)、各種傳感器所帶來的物聯(lián)網(wǎng)技術(shù)的極速發(fā)展與廣泛應用,大量數(shù)據(jù)的收集變得非常容易且成本低廉。這為人工智能領(lǐng)域中迫切需求的機器學習、模式識別與計算機視覺的快速發(fā)展提供了必要的數(shù)據(jù)支撐。然而,如何有效地選擇數(shù)據(jù),如何從數(shù)據(jù)中學習有用的信息,成為擺在科研人員面前的重要問題。本文圍繞數(shù)據(jù)選擇和數(shù)據(jù)內(nèi)在子空間和流形信息學習等問題通過模型建立、算法設(shè)計和分析等方面進行了系統(tǒng)性的研究,并將相關(guān)算法應用于協(xié)同過濾、圖像修補和視頻背景建模等工程領(lǐng)域。本論文的研究成果有:1.針對海量數(shù)據(jù)的人工標記需要花費高昂的人力和時間成本,主動學習作為一種適宜的最小化標記成本的方法被越來越多的研究者所關(guān)注。在已有的主動學習算法中,有的方法利用了未標記數(shù)據(jù)的結(jié)構(gòu)信息,但代表數(shù)據(jù)點的選擇需要額外的計算,例如層次聚類;有的方法需要每次迭代預先訓練多個分類器,從集成的角度找出需要人工標記的數(shù)據(jù);有的方法僅僅考慮每次迭代中最靠近最優(yōu)決策面的數(shù)據(jù)。為了克服上面的不足,我們提出了一種成對K近鄰偽剪輯的主動學習算法。該方法受K近鄰剪輯預處理思想的啟發(fā),并且在每次迭代中僅需要訓練一個分類器和考慮最優(yōu)分類超平面附近的多個數(shù)據(jù)。同時,我們也給出了相應的算法復雜度分析和參數(shù)分析。大量的實驗結(jié)果表明了本章提出的成對K近鄰偽剪輯的主動學習算法相對于其他主流的主動學習算法在僅需查詢并標記少量樣本下就能獲得較好的分類性能。2.低秩矩陣填充與恢復問題是典型的從已知數(shù)據(jù)中學習其內(nèi)在結(jié)構(gòu)和信息的實際問題。最近幾年,這個問題在數(shù)據(jù)池環(huán)境中通過矩陣的跡范數(shù)最小化技術(shù)或其他奇異值分解的變種方法得到了很好的解決。在這種環(huán)境中,海量數(shù)據(jù)的規(guī)模、樣本的大小和視頻幀數(shù)等都是提前獲得的。所以前面的問題能夠通過在每次迭代中對數(shù)據(jù)(稀疏)矩陣進行奇異值分解來解決,但時間復雜度非常高,因此這類方法并不適合應用于實時的環(huán)境中。為了能實時的對視頻流進行背景建模,本文提出了一種-范數(shù)框架下基于Grassmannian流形的在線梯度下降算法模型。應用該模型,能在數(shù)據(jù)流的環(huán)境中在線的解決矩陣填充與恢復問題。通過引入黎曼流形優(yōu)化,沿著Grassmannian流形測地線的最優(yōu)子空間能夠被找到。作為增量學習,在每次迭代中只涉及一個數(shù)據(jù)樣本(向量)的計算。-范數(shù)框架的設(shè)計是為了能從被稀疏大噪聲(局外值)和高斯噪聲污染的數(shù)據(jù)中逼近恢復原始數(shù)據(jù)�；诔俗咏惶娣较蚍ê蚲rassmannian流形優(yōu)化的一種迭代算法被提出以解決在線環(huán)境下的魯棒低秩矩陣填充、魯棒低秩矩陣恢復以及視頻監(jiān)控中的背景建模等問題。此外,一種新穎的自適應步長策略被提出來有效地追蹤子空間的變化。大量的人工和實際數(shù)據(jù)的實驗表明,本文的方法與其他主流的算法相比擁有更好的魯棒性和有效性。3.從已知數(shù)據(jù)中學習其內(nèi)在的子空間信息可以被推廣到學習其滿秩矩陣分解背后的黎曼商流形結(jié)構(gòu),其中低秩約束可以通過滿秩矩陣分解來表示。為了能解決更一般的矩陣填充問題,這其中包括病態(tài)矩陣和大規(guī)模矩陣,本文從測度的角度分析了現(xiàn)有的主流黎曼流形優(yōu)化算法,并首次根據(jù)黎曼幾何結(jié)構(gòu)和目標函數(shù)的尺度信息在黎曼商流形切空間的水平子空間上構(gòu)造一種新穎的黎曼測度。在黎曼商流形上優(yōu)化所需的必要組件被重新設(shè)計和計算。為了驗證所構(gòu)造的黎曼測度的有效性,在黎曼商流形上的非線性共軛梯度法被采用。大量的數(shù)值實驗表明,通過比較算法的收斂性,本文提出的黎曼測度優(yōu)于現(xiàn)有的黎曼測度。采用這種新穎黎曼測度的非線性共軛梯度算法在收斂性上優(yōu)于主流的低秩矩陣填充算法。4.通過結(jié)合多個個體分類器來改善單個分類器的性能近幾年越來越成為一個研究熱點。隨之而來的問題就是在產(chǎn)生的眾多個體分類器中是否都對降低集成系統(tǒng)的泛化誤差有益。平衡個體分類器之間的差異和個體分類器自身的準確率,這本身就是設(shè)計集成學習算法的出發(fā)點同時也是難點。因此,本文提出了一種基于整數(shù)矩陣分解的選擇集成算法。該算法分別從差異性和準確率兩個因素出發(fā),為了增加個體分類器之間的差異,將個體分類器的預測標記作為原始目標,且將正確標記引入,以此構(gòu)造一個代表個體分類器的整數(shù)矩陣,通過對該矩陣進行分解獲得個體分類器的投影方向,最終獲得新的個體。然而,為了保證變換個體的性能,采用標準的性能判別準則去除集成中性能較差的個體。最后,通過雷達一維距離像的實驗結(jié)果表明該算法有效地平衡了個體間差異性和個體自身的準確率這兩個因素,相比單個分類器和其他集成方法,該方法提高了對雷達目標的識別準確率。5.針對在一個有監(jiān)督學習任務中,如果目標域訓練樣本的數(shù)量非常稀少,這勢必產(chǎn)生影響目標域中分類器學習和推廣性能的問題。為了解決這個問題,除了使用主動學習的方法從目標域選擇富含信息的樣本并給與標記以增大訓練樣本外,在某些真實環(huán)境中往往已經(jīng)存在另一些有標記的樣本,且其獲取相比目標域的訓練樣本更加容易,但是這些樣本卻與目標域的樣本具有不同的數(shù)據(jù)分布形式,這些具有不同分布的有標記樣本構(gòu)成源域。因此,遷移學習被引入來處理目標域訓練樣本稀少的這類分類問題。我們提出了兩種新的遷移學習算法:第一種是基于旋轉(zhuǎn)森林空間變換的遷移學習算法,該算法通過旋轉(zhuǎn)森林空間變換將源域樣本向目標域形成的空間進行投影,通過測量變換后源域樣本和目標域樣本的相似度來選擇可利用的源域樣本幫助目標域中分類器的學習。通過文本數(shù)據(jù)的分類實驗表明,該章所提算法相比其他算法獲得了更好的分類性能。第二種為基于數(shù)據(jù)驅(qū)動的線性空間映射遷移集成算法。在該算法中,通過將源域的樣本向目標域中容易被錯分的樣本空間進行投影變換,從而選擇出對目標域分類有幫助的樣本加入到目標域,改善其分類性能。特別地,為了更加有效地選擇源域樣本,本文將源域樣本進行隨機劃分,并分別對于每個子集進行投影變換,然后結(jié)合每個子集獲得的結(jié)果。對于UCI數(shù)據(jù)和合成孔徑雷達目標圖像數(shù)據(jù)的分類實驗表明本章提出的算法相比其他算法有效地提高了目標域的分類性能,且改善了單個遷移的不穩(wěn)定性。
[Abstract]:The era of information explosion brought regardless of the type or quantity are unprecedented information for us. With the development of computer communication and Internet technology, the rapid development and wide application of Internet technology brings a variety of sensors, a large collection of data becomes very easy and low cost. This study is the urgent needs in the field of artificial intelligence machine, provide the necessary data to support the rapid development of computer vision and pattern recognition. However, how to choose the data effectively, how to learn the useful information from the data, has become an important issue in the research workers. This paper focuses on the data selection and data subspace and intrinsic information manifold learning problem through the model of system the algorithm design and analysis, and the application of the relevant algorithm in collaborative filtering, image inpainting and video background modeling engineering. Domain. The research results of this thesis are: 1. for mass data manual marking takes time and manpower cost, active learning as a method of minimizing the cost of suitable markers is concerned by more and more researchers. In the existing active learning algorithm, a method of using unlabeled data structure but, on behalf of data point selection requires additional computation, such as hierarchical clustering; some methods need each iteration pre training multiple classifiers, identify artificial markers data from the point of view of integration; some methods only consider each iteration closest to the optimal decision surface data. In order to overcome the shortcomings above, we propose an active learning algorithm a pair of clips. The pseudo K nearest neighbor method K nearest neighbor heuristic clip pretreatment thought, and only need to train a classifier and in each iteration A plurality of data considering the optimal hyperplane nearby. At the same time, we also give the corresponding algorithm complexity analysis and parameter analysis. Experimental results demonstrate that the pairwise K nearest neighbor pseudo clips of the proposed active learning algorithm with respect to other mainstream active learning algorithm only needs to query and mark can obtain a small sample the classification performance of.2. low rank matrix recovery is better filled with typical examples from the known data to study its internal structure and information. In recent years, this problem by trace norm minimization technique of matrix singular value decomposition method or other variants was solved in the data pool in this environment. In the environment of massive data, the size of the sample size and the video frames are obtained in advance. So in front of the problem can be passed in each iteration of the data (sparse) Matrix singular value decomposition to solve, but the time complexity is very high, so this kind of method is not suitable for real-time environment. In order to real-time video stream on the background modeling, this paper proposes a framework of Grassmannian - norm online gradient descent algorithm based on manifold model. This model is used to solve the matrix can online in the data stream environment in filling and recovery. By introducing the Riemann manifold optimization, along Grassmannian manifold geodesic optimal subspace can be found. As incremental learning, only a data sample involved in each iteration (vector) design calculation. - norm framework was to be from large sparse (outside noise value) and the Gauss noise pollution data approach to recover the original data. By alternating direction method and Grassmannian manifold optimization of an iterative algorithm is proposed to solution based on Is the low rank matrix robust online environment filling, the problem of robust low rank matrix recovery and video monitoring in background modeling. In addition, a novel adaptive step strategy is proposed to effectively change tracking subspace. The artificial and real data show that a large number of experiments, this method with other algorithms compared with better robustness and effectiveness of.3. from known data to study its internal space information can be extended to the Riemann manifold structure learning the full rank decomposition of matrix behind, the low rank constraint can be represented by full rank matrix decomposition. In order to solve the more general problem of filling matrix, which including the ill conditioned matrix and mass matrix, this paper analyzes the mainstream Riemann manifold existing optimization algorithms from the angle of measure, and for the first time according to the Riemann scale information geometry and objective function A novel Riemann measure constructed in Riemann flow shape tangent space level subspace. Optimizing the necessary components required in Riemann manifolds are re designed and calculated. The validity of the Riemann measure in order to verify the structure of the nonlinear conjugate gradient method in Riemann manifold is adopted. Numerical experiments show that a large number of the convergence of the algorithm, by comparison, the Riemann measure is superior to the existing Riemann measure. The performance of this novel nonlinear conjugate gradient algorithm of the Riemann measure of the convergence of low rank matrix is better than that of the mainstream.4. filling algorithm by combining a plurality of individual classifiers to improve single classifier in recent years has become a more and more the focus of research. The problem is in many individual classifier produced is beneficial to reduce the generalization error of integrated system. The balance between individual classifiers The accuracy of individual differences and the classifier itself, the starting point itself is the design of integrated learning algorithm is also difficult. Therefore, this paper proposes an integrated algorithm for integer matrix decomposition based selection. The algorithm separately from the difference and accuracy of two elements, in order to increase the difference between individual classifiers, forecast mark the individual classifier as the original target, and will be marked correctly introduced for constructing a representative individual classifier based on the integer matrix, the matrix decomposition of projection direction to obtain the individual classifier, finally get the new individual. However, in order to ensure the performance of transformation of individuals, using standard criteria for the removal performance of integrated performance is poor individual. Finally, shows that the algorithm can effectively balance the difference between individual and individual through accurate radar range profile of the experimental results The rate of these two factors, compared with the single classifier and other integration methods, this method improves the recognition accuracy of the radar target.5. in a supervised learning task, if the number of training samples of the target domain is very scarce, it is bound to have an impact in the target domain classifier learning and generalization performance. In order to solve this problem. In addition, the use of active learning methods from the target domain selection information rich samples and give marks to increase training samples, in some real environment may exist in some labeled samples, and the obtained compared to the target domain training samples more easily, but these are the sample and target domain samples with different distribution of the data, which have different distribution of labeled samples to form the source domain. Therefore, transfer learning is introduced to deal with the target domain training sample rare such Class problem. We propose two new algorithms of transfer learning: the first is the learning algorithm based on the spatial migration of rotation forest transform, the algorithm through the space rotation forest transform the source domain to the target domain formation sample space projection, similarity of source domain and target domain sample samples by measuring the transformation to select the source domain the sample can be used to help the target domain classifier learning. Through the experiment of text data classification show that the proposed algorithm has better classification performance than other algorithms. For second kinds of linear space mapping algorithm based on integrated data driven migration. In this algorithm, the source domain to the target domain in the sample easy to be misclassified sample space projection transformation, and find out the target domain classification help sample is added to the target domain, improve the classification performance. In particular, in order to more effectively Select the source domain sample, the source domain samples were randomly divided, and separately for each subset of projection transformation, and then combined with the results obtained for each subset. The experimental data of UCI and synthetic aperture radar target image data classification show that the algorithm proposed in this chapter compared to other algorithms can effectively improve the classification performance of the target domain and, to improve the individual migration instability.

【學位授予單位】：西安電子科技大學
【學位級別】：博士
【學位授予年份】：2015
【分類號】：TP181

【共引文獻】

相關(guān)期刊論文前10條

1 平博;蘇奮振;周成虎;高義;;局部SVT算法的遙感反演場數(shù)據(jù)恢復實驗分析[J];地球信息科學學報;2011年05期

2 史加榮;焦李成;尚凡華;;不完全非負矩陣分解的加速算法[J];電子學報;2011年02期

3 林杰;石光明;董偉生;;基于信息自由度采樣的信號重構(gòu)方法研究進展[J];電子學報;2012年08期

4 張芬;張成;程鴻;沈川;韋穗;;基于矩陣填充的相位檢索[J];光學學報;2013年07期

5 李二俊;劉萬林;余濤;謝東海;蔡慶空;;基于SURF算法的無人機航空圖像自動配準研究[J];工程勘察;2013年10期

6 楊兵兵;胡士強;;隨機抽樣一致消除特征錯配的一種加速算法[J];電氣自動化;2013年06期

7 李正浩;曾智洪;曾曉贏;史振寧;付仕清;;農(nóng)村信息化建設(shè)中多媒體數(shù)據(jù)的并行管理框架設(shè)計[J];重慶大學學報;2013年12期

8 馬超;趙西安;王青松;;基于均勻特征匹配的無人機影像拼接[J];北京建筑工程學院學報;2013年04期

9 賈豐蔓;康志忠;于鵬;;影像同名點匹配的SIFT算法與貝葉斯抽樣一致性檢驗[J];測繪學報;2013年06期

10 貢力;余濤;;壓縮感知在隧道病害識別中的應用研究[J];城市道橋與防洪;2013年10期

相關(guān)會議論文前10條

1 崔永超;李秀娟;文成林;;基于凸優(yōu)化方法對逆問題求解測量值數(shù)目的研究[A];第25屆中國控制與決策會議論文集[C];2013年

2 王亞偉;許廷發(fā);王吉暉;;改進的匹配點提純算法mRANSAC[A];2013年中國智能自動化學術(shù)會議論文集（第三分冊）[C];2013年

3 楊鴻;錢X;戴先中;馬旭東;房芳;;基于Kinect傳感器的移動機器人室內(nèi)環(huán)境三維地圖創(chuàng)建[A];2013年中國智能自動化學術(shù)會議論文集（第三分冊）[C];2013年

4 Biao Zhang;Qixin Cao;;3D Point Cloud Based Hybrid Maps Reconstruction for Indoor Environments[A];2013年中國智能自動化學術(shù)會議論文集（第二分冊）[C];2013年

5 Tie Jiang;Guibin Zhu;;Image Mosaic Based on SURF and Results Optimization[A];2013年中國智能自動化學術(shù)會議論文集（第二分冊）[C];2013年

6 張官亮;鄒煥新;孫浩;劉志波;;基于馬氏距離加權(quán)圖轉(zhuǎn)換的點模式匹配[A];2013年中國智能自動化學術(shù)會議論文集（第五分冊）[C];2013年

7 Hao Shen;Chengfei Zhu;Shuxiao Li;Hongxing Chang;;A Machine Vision System for Bearing Greasing Procedure[A];2013年中國智能自動化學術(shù)會議論文集（第一分冊）[C];2013年

8 陳明;王樹鵬;云曉春;吳廣君;;基于二維云模型過濾的重復圖像發(fā)現(xiàn)[A];2013年全國通信軟件學術(shù)會議論文集[C];2013年

9 ZHANG ChengHao;CHEN JiaBin;SONG ChunLei;XU JianHua;;An UAV Navigation Aided with Computer Vision[A];第26屆中國控制與決策會議論文集[C];2014年

10 林福良;劉海峰;張超;黃可嘉;;一種穩(wěn)定高精度的圖像特征點迭代匹配算法[A];第三屆中國指揮控制大會論文集（下冊）[C];2015年

相關(guān)博士學位論文前10條

1 章寒;單倍型的分布估計和關(guān)聯(lián)分析[D];中國科學技術(shù)大學;2011年

2 田彥;基于視頻的人體姿勢預測與跟蹤[D];北京郵電大學;2011年

3 程捷;無線傳感器網(wǎng)絡查詢技術(shù)研究[D];華中科技大學;2011年

4 劉新武;基于偏微分方程的圖像復原技術(shù)研究[D];湖南大學;2011年

5 龐志峰;圖像去噪問題中的幾類非光滑數(shù)值方法[D];湖南大學;2010年

6 羅自炎;Lyapunov-type對稱錐規(guī)劃[D];北京交通大學;2010年

7 陳娜;矩陣恢復算法及誤差分析[D];華中科技大學;2012年

8 申遠;一些求解結(jié)構(gòu)型優(yōu)化的一階算法[D];南京大學;2012年

9 蘇雅茹;高維數(shù)據(jù)的維數(shù)約簡算法研究[D];中國科學技術(shù)大學;2012年

10 尚凡華;基于低秩結(jié)構(gòu)學習數(shù)據(jù)表示[D];西安電子科技大學;2012年

相關(guān)碩士學位論文前10條

1 姚璐;融合社會化標簽的協(xié)同過濾算法研究[D];浙江大學;2011年

2 賈亮;基于矩陣稀疏的視頻目標跟蹤[D];大連理工大學;2011年

3 李寅;基于張量分解的視覺顯著性算法研究[D];上海交通大學;2011年

4 朱顥;全方位舌像特征提取及多核學習分類[D];哈爾濱工業(yè)大學;2011年

5 郭海亮;航拍圖像增強處理與拼接技術(shù)實現(xiàn)[D];大連理工大學;2011年

6 張慧;Bregman迭代方法及其在稀疏問題中的應用[D];國防科學技術(shù)大學;2009年

7 封婷;圖像序列射影重建技術(shù)的研究[D];南京大學;2012年

8 靳正芬;求解矩陣核范數(shù)極小化問題的交替方向法[D];河南大學;2012年

9 鄭錦湖;基于一種魯棒主元分析及其在目標檢測中的應用研究[D];云南大學;2012年

10 張利慶;基于定位的移動廣告?zhèn)€性化推薦系統(tǒng)研究[D];湖南師范大學;2012年

，

本文編號：1372378

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/shoufeilunwen/xxkjbs/1372378.html

上一篇：光纖Bragg光柵監(jiān)測系統(tǒng)研制優(yōu)化及其邊坡工程應用研究
下一篇：單比特合成孔徑雷達稀疏成像技術(shù)的研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

大數(shù)據(jù)下的數(shù)據(jù)選擇與學習算法研究