當(dāng)前位置：主頁(yè) > 經(jīng)濟(jì)論文 > 技術(shù)經(jīng)濟(jì)論文 >

基于數(shù)據(jù)挖掘的個(gè)人信用評(píng)分建模與分析

發(fā)布時(shí)間：2019-04-11 07:22

【摘要】：隨著經(jīng)濟(jì)的不斷發(fā)展,人們對(duì)住房、汽車、教育、日常消費(fèi)等有信貸需求的家庭也越來(lái)越多。因此對(duì)于金融機(jī)構(gòu)如何規(guī)避潛在的個(gè)人信用風(fēng)險(xiǎn)是銀行和信貸機(jī)構(gòu)面臨的重大挑戰(zhàn)。所以使用統(tǒng)計(jì)方法或數(shù)據(jù)挖掘技術(shù),建立個(gè)人信用貸款模型,能夠比較準(zhǔn)確的預(yù)測(cè)個(gè)人違約的概率,對(duì)銀行或金融機(jī)構(gòu)有著重要的意義。個(gè)人信用貸款預(yù)測(cè)實(shí)質(zhì)上是需要我們找到一種分類模型,即將個(gè)體消費(fèi)者劃分為能夠按期還本付息(即“好”客戶)和違約(即“壞”客戶)兩類。對(duì)于此類問(wèn)題,本文選擇Logistic回歸和決策樹分類方法進(jìn)行建模并比較兩者之間的優(yōu)缺點(diǎn),選擇最優(yōu)模型。本文以kaggle競(jìng)賽數(shù)據(jù)為實(shí)證數(shù)據(jù)結(jié)合SAS、SPSS軟件進(jìn)行論文研究,首先結(jié)合SAS軟件對(duì)原始數(shù)據(jù)進(jìn)行隨機(jī)抽樣,分成訓(xùn)練集、驗(yàn)證集和測(cè)試集三個(gè)數(shù)據(jù)集,接著對(duì)數(shù)據(jù)集進(jìn)行預(yù)處理,對(duì)缺失值、異常值進(jìn)行檢驗(yàn)和多重共線性檢驗(yàn),并相應(yīng)使用插補(bǔ)法和變量聚類分析進(jìn)行變量篩選得到處理后的數(shù)據(jù)集,最后從xl-x10十個(gè)變量中篩選出五個(gè)變量x1、x2、x4、x8、x9進(jìn)行Logistic回歸建模；然后通過(guò) Logistic回歸分析中的全模型法得到三個(gè)候選模型,對(duì)三個(gè)候選模型進(jìn)行參數(shù)估計(jì)以及模型顯著性檢驗(yàn)擬合數(shù)據(jù)得到兩個(gè)預(yù)測(cè)模型,且計(jì)算得出兩個(gè)模型AUC統(tǒng)計(jì)量都為0.714,說(shuō)明模型預(yù)測(cè)效果較為理想,為了進(jìn)一步選擇穩(wěn)健性高、簡(jiǎn)潔的最優(yōu)模型,再通過(guò)驗(yàn)證集繪制ROC曲線以及計(jì)算AUC值,兩模型在驗(yàn)證數(shù)據(jù)集中AUC值都超過(guò)了70%,最后綜合比較得到最優(yōu)模型,篩選出x2、x8、x9建立Logistic回歸模型；接著結(jié)合SPSS軟件對(duì)訓(xùn)練集使用Exhaustive CHAID算法建立決策樹分類模型,篩選出x1、x3、x4、x7、x9五個(gè)變量,然后通過(guò)驗(yàn)證集檢驗(yàn)?zāi)Ｐ偷姆€(wěn)健性,得到AUC值為0.839,說(shuō)明模型有很好的穩(wěn)健性；最后通過(guò)測(cè)試集比較Logistic回歸模型和決策樹分類模型預(yù)測(cè)效果,Logistic回歸模型與決策樹分類模型預(yù)測(cè)違約概率p與實(shí)際值誤差平方和分別為823.298和231.559,說(shuō)明在模型的預(yù)測(cè)準(zhǔn)確度、穩(wěn)健性上,決策樹模型都優(yōu)于Logistic回歸模型。
[Abstract]:With the continuous development of the economy, there are more and more families in need of credit such as housing, cars, education, daily consumption and so on. Therefore, how to avoid the potential personal credit risk for financial institutions is a major challenge for banks and credit institutions. Therefore using statistical method or data mining technology to establish personal credit loan model can accurately predict the probability of personal default which is of great significance to banks or financial institutions. In essence, the forecast of personal credit needs us to find a classification model, that is, individual consumers can be divided into two categories, namely, "good" customers and "bad" customers, who can pay their debts on schedule (that is, "good" customers) and default ("bad" customers). For this kind of problem, this paper chooses Logistic regression and decision tree classification method to model, compares the advantages and disadvantages of the two methods, and chooses the optimal model. In this paper, kaggle competition data is used as empirical data and SAS,SPSS software is used to carry on the research. Firstly, the original data are randomly sampled with SAS software, and divided into three data sets: training set, verification set and test set, and then the data set is preprocessed. The missing value and abnormal value are tested and multi-collinearity test is carried out, and the data set is selected by interpolation and variable cluster analysis. Finally, five variables x 1, x 2, x 4 are selected from the ten variables of xl-x10. X8, x9 for Logistic regression modeling; Then three candidate models are obtained by the full model method of Logistic regression analysis. The parameters of three candidate models are estimated and the model significance test data are fitted to get two prediction models. The AUC statistics of the two models are both 0.714, which shows that the prediction effect of the model is ideal. In order to select the best model with high robustness and simplicity, the ROC curve is drawn by the verification set and the AUC value is calculated. The AUC value of the two models is over 70% in the verification data set. Finally, the optimal model is obtained by comprehensive comparison, and the Logistic regression model is established by selecting x2, x8 and x9. Then using Exhaustive CHAID algorithm to set up a decision tree classification model with SPSS software, five variables x 1, x 3, x 4, x 7, x 9 were screened out, and then the robustness of the model was verified by verifying the robustness of the model, and the AUC value was 0.839, and the value of AUC was 0.839. It shows that the model has good robustness; Finally, the prediction results of Logistic regression model and decision tree classification model are compared by test set. The sum of square of the error between Logistic regression model and decision tree classification model is 823.298 and 231.559, respectively. It is shown that the decision tree model is superior to the Logistic regression model in the prediction accuracy and robustness of the model.
【學(xué)位授予單位】：華中師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 董艷;;數(shù)據(jù)預(yù)處理方法在移動(dòng)通信行業(yè)中的應(yīng)用[J];計(jì)算機(jī)技術(shù)與發(fā)展;2010年11期

2 丁娟娟;崔媛媛;;個(gè)人信用評(píng)估模型的比較研究[J];商場(chǎng)現(xiàn)代化;2007年15期

3 徐少鋒;;FISHER判別分析在個(gè)人信用評(píng)估中的應(yīng)用[J];統(tǒng)計(jì)與決策;2006年02期

4 李建平,徐偉宣,劉京禮,石勇;消費(fèi)者信用評(píng)估中支持向量機(jī)方法研究[J];系統(tǒng)工程;2004年10期

5 朱興德,馮鐵軍;基于GA神經(jīng)網(wǎng)絡(luò)的個(gè)人信用評(píng)估[J];系統(tǒng)工程理論與實(shí)踐;2003年12期

6 石慶焱,靳云匯;個(gè)人信用評(píng)分的主要模型與方法綜述[J];統(tǒng)計(jì)研究;2003年08期

，

本文編號(hào)：2456202

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.sikaile.net/jingjilunwen/jiliangjingjilunwen/2456202.html

上一篇：旅游型村鎮(zhèn)農(nóng)家樂(lè)建筑用能模式與節(jié)能策略研究
下一篇：農(nóng)村居民重大疾病醫(yī)療救助公平性研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于數(shù)據(jù)挖掘的個(gè)人信用評(píng)分建模與分析