基于遷移學習和PU學習的軟件故障預測方法研究

發(fā)布時間：2019-01-03 14:10

【摘要】：隨著人工智能的不斷發(fā)展,機器學習技術已被應用于軟件故障預測中,傳統(tǒng)基于機器學習的軟件故障預測需要大量已標注樣本進行模型構建。而現(xiàn)實中,已標注軟件故障數(shù)據(jù)往往通過人工測試后獲取,費時費力代價高昂。為了降低傳統(tǒng)軟件故障預測方法在有監(jiān)督學習場景下對標注樣本的需求,本文從正例未標注學習(Positive and Unlabeled Learning,PU學習)和遷移學習兩方面展開研究,提出針對PU場景下,通過對跨公司、跨項目正例未標注故障數(shù)據(jù)進行知識遷移,對目標故障樣本進行預測,具體工作如下:(1)PU場景下基于隨機森林的實例遷移算法(POSTRF算法)該算法在PU場景下,基于貝葉斯跨類遷移思想,將待預測樣本視為目標領域數(shù)據(jù)集,將跨公司、跨項目的軟件故障樣本視為源領域數(shù)據(jù)集,對源領域數(shù)據(jù)集進行有放回抽樣訓練得到多棵PU隨機決策樹,根據(jù)對目標領域數(shù)據(jù)測試得到的AUC值及采樣集樣本計算樣本權重,通過遷移與目標領域數(shù)據(jù)具有相似分布的樣本與目標領域數(shù)據(jù)共同構建PU數(shù)據(jù)集,基于POSC4.5算法構建模型來對目標領域的軟件故障樣本進行預測。算法首先對源領域數(shù)據(jù)集以bagSize比例進行有放回抽樣得到M份采樣集并訓練M棵PU隨機決策樹,從目標領域中隨機抽取75%樣本作為測試集對M棵隨機決策樹進行分類測試,將每棵樹的AUC值(Area Under the ROC Curve)作為各樹權重,根據(jù)樹權重對采樣集樣本加權,將采樣集樣本權重合并得到最終樣本權重,以遷移比r遷移權重較高樣本完成實例遷移。對遷移樣本和目標領域數(shù)據(jù)集基于完全隨機假設構建PU數(shù)據(jù)集,以正例樣本數(shù)、未標注樣本數(shù)和正例先驗概率計算屬性的不確定信息增益,通過選擇最大不確定信息增益屬性為分支節(jié)點,自上而下遞歸生成樹模型,對目標領域故障樣本進行預測。(2)針對POSTRF算法實驗將NASA數(shù)據(jù)庫的8個軟件故障數(shù)據(jù)集作為實驗數(shù)據(jù)集,分別以0kc3、cm1數(shù)據(jù)集作為目標領域數(shù)據(jù)集,其余數(shù)據(jù)集作為源領域數(shù)據(jù)集,將本文的算法與POSC4.5算法進行對比實驗結果表明,POSTRF算法在0kc3和cm1目標集上通過遷移其他輔助集實例樣本,提升了模型分類性能,且AUC值提高了約3%-12%,故障預測率PD提高了約5%。因此,本文提出的POSTRF算法通過對跨項目、跨公司軟件故障數(shù)據(jù)進行知識遷移,與傳統(tǒng)PU學習算法相比對目標領域故障樣本具有相當或更好的預測性能。
[Abstract]:With the continuous development of artificial intelligence, machine learning technology has been applied to software fault prediction. Traditional software fault prediction based on machine learning requires a large number of labeled samples for modeling. In reality, tagged software fault data are often acquired by manual testing, which is time-consuming and costly. In order to reduce the requirement of traditional software fault prediction methods for labeled samples in supervised learning scenarios, this paper studies the two aspects of positive unannotated learning (Positive and Unlabeled Learning,PU learning and migration learning, and proposes a new approach for PU scenarios. Through knowledge transfer of cross-company, cross-project unannotated fault data, the target fault samples are predicted. The main works are as follows: (1) in PU scenario, the instance migration algorithm based on stochastic forest (POSTRF algorithm). Under the PU scenario, based on Bayesian idea of cross-class migration, the sample to be predicted is regarded as the target domain data set, which will be cross-company. The software fault samples of cross-project are regarded as source domain data sets. The source domain data sets are trained with backward-back sampling to obtain multiple PU random decision trees. The sample weights are calculated according to the AUC values obtained from the test of the target domain data and the samples from the sample sets. The PU data set is constructed by migrating samples with similar distribution to target domain data and building model based on POSC4.5 algorithm to predict software fault samples in target domain. Firstly, M samples are collected by bagSize scale and M PU random decision trees are trained, and 75% samples are randomly extracted from the target domain as test sets to classify M random decision trees. The AUC value (Area Under the ROC Curve) of each tree is taken as the weight of each tree, the sample weight of the sample set is weighted according to the tree weight, and the final sample weight is obtained by combining the sample weight of the sample set, so that the sample with higher migration weight than r is used to complete the sample migration. Based on the complete random assumption, the PU data set is constructed for migrating samples and target domain data sets. The uncertain information gain of attributes is calculated with positive sample number, unlabeled sample number and positive prior probability. By selecting the maximum uncertain information gain attribute as the branch node, the top-down recursive tree model is generated. The target domain fault samples are predicted. (2) eight software fault data sets of NASA database are used as experimental data sets, and 0kc3cm1 data sets are used as target domain data sets respectively. The other data sets are used as source domain data sets. The experimental results show that the POSTRF algorithm improves the classification performance of the model by migrating the sample samples of other auxiliary sets on the 0kc3 and cm1 target sets by comparing the proposed algorithm with the POSC4.5 algorithm. The AUC value increased about 3-12 and the fault prediction rate PD increased about 5%. Therefore, the proposed POSTRF algorithm has comparable or better prediction performance to the target domain fault samples than the traditional PU learning algorithm through knowledge migration of cross-project and cross-company software fault data.
【學位授予單位】：西北農林科技大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.53

【參考文獻】

相關期刊論文前7條

1 張荷;李梅;張陽;蔡曉妍;;基于PU學習的軟件故障檢測研究[J];計算機應用研究;2015年11期

2 石慧;賈代平;苗培;;基于詞頻信息的改進信息增益文本特征選擇算法[J];計算機應用;2014年11期

3 鄭科鵬;馮筠;孫霞;馮宏偉;曹國震;;基于靜態(tài)集成PU學習數(shù)據(jù)流分類的入侵檢測方法[J];西北大學學報(自然科學版);2014年04期

4 莊福振;羅平;何清;史忠植;;遷移學習研究進展[J];軟件學報;2015年01期

5 張汗靈;湯隆慧;周敏;;基于KMM匹配的參數(shù)遷移學習算法[J];湖南大學學報(自然科學版);2011年04期

6 賀濤;曹先彬;譚輝;;基于免疫的中文網(wǎng)絡短文本聚類算法[J];自動化學報;2009年07期

7 于玲;吳鐵軍;;集成學習:Boosting算法綜述[J];模式識別與人工智能;2004年01期

相關碩士學位論文前3條

1 韋余永;基于實例與特征的遷移學習文本分類方法研究[D];西南大學;2015年

2 周興勤;基于選擇性集成的增量學習研究[D];重慶大學;2014年

3 何佳珍;不確定數(shù)據(jù)的PU學習貝葉斯分類器研究[D];西北農林科技大學;2012年

，

本文編號：2399484

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.sikaile.net/kejilunwen/ruanjiangongchenglunwen/2399484.html

上一篇：基于雙樹框架的軟件項目質量管理研究
下一篇：適用于移動互聯(lián)網(wǎng)的門限群簽名方案

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于遷移學習和PU學習的軟件故障預測方法研究