基于機(jī)器學(xué)習(xí)的畢業(yè)生收入預(yù)測(cè)與分析研究
發(fā)布時(shí)間:2018-04-05 18:37
本文選題:教育信息 切入點(diǎn):數(shù)據(jù)挖掘 出處:《吉林大學(xué)》2017年碩士論文
【摘要】:隨著信息化時(shí)代的到來,信息技術(shù)不斷影響并改變著經(jīng)濟(jì)、社會(huì)、文化、生活的方方面面,其中,教育領(lǐng)域同樣由于信息技術(shù)的變革而受到深遠(yuǎn)的影響。教育信息數(shù)據(jù)庫的容量因此而變得越來越大,針對(duì)這些大規(guī)模的數(shù)據(jù),教育領(lǐng)域迫切需要一種高效的信息技術(shù),對(duì)數(shù)據(jù)進(jìn)行處理、分析和運(yùn)用,并且在此基礎(chǔ)上進(jìn)一步挖掘出對(duì)不同層次教育從業(yè)者有用的信息;谝陨涎芯勘尘,本文以機(jī)器學(xué)習(xí)算法為工具,對(duì)美國(guó)大學(xué)推薦網(wǎng)站Score Card上使用的數(shù)據(jù)集進(jìn)行深度分析,建立了以學(xué)校特征為輸入,以學(xué)校畢業(yè)生平均收入為輸出的回歸和分類模型。通過使用該模型,可以通過一所大學(xué)的各項(xiàng)特征參數(shù)來合理預(yù)測(cè)該學(xué)校畢業(yè)生的平均收入,這將會(huì)對(duì)教育部門助學(xué)金等資金的有效分配和私立學(xué)校的創(chuàng)辦都會(huì)起到很好的輔助作用。本文的主要工作如下:1.使用單變量線性回歸算法對(duì)每個(gè)大學(xué)級(jí)別的特征與目標(biāo)值之間的關(guān)系建立模型,分析單個(gè)特征變量對(duì)畢業(yè)生平均收入影響,對(duì)其含義進(jìn)行解讀。對(duì)比多變量回歸模型和KNN回歸模型在預(yù)測(cè)畢業(yè)生平均收入上的表現(xiàn)情況。2.提出了融合KNN回歸的KNN多項(xiàng)式回歸算法。此種算法在驗(yàn)證集上的表現(xiàn)要好于多變量回歸算法和KNN算法,但是訓(xùn)練時(shí)間相對(duì)較長(zhǎng),好在預(yù)測(cè)畢業(yè)生平均收入這個(gè)問題并不是一個(gè)數(shù)據(jù)項(xiàng)會(huì)經(jīng)常變動(dòng)的問題,因此及時(shí)此算法的時(shí)間復(fù)雜度是兩種基礎(chǔ)算法時(shí)間復(fù)雜度之和,它在解決回歸問題時(shí)的優(yōu)勢(shì)還是非常明顯的。3.使用四種方法對(duì)畢業(yè)生的平均收入進(jìn)行分類,這四種方法分別是邏輯回歸、決策樹、KNN和Adaboost。在這四種算法中,Adaboost算法的分類準(zhǔn)確率最高,KNN算法的分類準(zhǔn)確率最低,甚至還不如隨機(jī)預(yù)測(cè)。且使用邏輯回歸算法時(shí)出現(xiàn)了召回率為100%的特殊情況。4.提出了基于召回率的邏輯回歸算法。如果訓(xùn)練出的邏輯回歸模型在驗(yàn)證集和訓(xùn)練集上的召回率或精確率過高,就可以把訓(xùn)練集根據(jù)過高項(xiàng)的指標(biāo)進(jìn)行劃分,對(duì)劃分出的子模塊進(jìn)行訓(xùn)練。這樣原本一層的模型就會(huì)變成兩層,模型的實(shí)際精確度需要在驗(yàn)證集上進(jìn)行驗(yàn)證。模型可以無限遞歸下去,直到模型在驗(yàn)證集上的精確度開始隨著模型深度的增加而下降。
[Abstract]:With the arrival of the information age, information technology is constantly influencing and changing all aspects of economy, society, culture and life, among which, the field of education is also affected by the change of information technology.As a result, the capacity of educational information database becomes larger and larger. In view of these large-scale data, the field of education urgently needs an efficient information technology to process, analyze and use the data.And on this basis, further excavate the useful information for different levels of education practitioners.Based on the above research background, this paper takes the machine learning algorithm as the tool, carries on the deep analysis to the data set used on the Score Card, the American university recommendation website, and establishes takes the school characteristic as the input,A regression and classification model based on the average income of school graduates.By using the model, the average income of a college graduate can be reasonably predicted by the characteristic parameters of the university.This will play a good role in the effective allocation of funds such as educational sector grants and the establishment of private schools.The main work of this paper is as follows: 1.The univariate linear regression algorithm is used to establish a model of the relationship between the characteristics and the target value of each university level. The effect of a single feature variable on the average income of graduates is analyzed and its meaning is interpreted.Compared with multivariate regression model and KNN regression model in predicting the average income of graduates. 2. 2.A KNN polynomial regression algorithm based on KNN regression is proposed.The algorithm performs better on the verification set than the multivariate regression algorithm and the KNN algorithm, but the training time is relatively long. Fortunately, the problem of predicting the average income of graduates is not a matter of constant change of data items.Therefore, the time complexity of this algorithm is the sum of the time complexity of the two basic algorithms, and its advantage in solving the regression problem is still very obvious.Four methods are used to classify the average income of graduates, which are logical regression, decision tree KNN and Adaboost.Among the four algorithms, Adaboost has the highest classification accuracy and KNN has the lowest classification accuracy, even worse than random prediction.And when using the logical regression algorithm, a special case with a recall rate of 100%. 4. 4.A logical regression algorithm based on recall rate is proposed.If the trained logical regression model has a high recall rate or precision rate on the verification set and the training set, the training set can be divided according to the index of too high terms, and the submodules can be trained.In this way, the original one layer model will become two layers, and the actual accuracy of the model needs to be verified on the verification set.The model can be recursion indefinitely until the accuracy of the model on the verification set begins to decline as the depth of the model increases.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 熊才平;何向陽;吳瑞華;;論信息技術(shù)對(duì)教育發(fā)展的革命性影響[J];教育研究;2012年06期
2 常桐善;;構(gòu)建院校智能體系:院校研究發(fā)展的新趨勢(shì)[J];高等教育研究;2009年10期
3 丁衛(wèi)平;王杰華;管致錦;;基于數(shù)據(jù)挖掘技術(shù)的教學(xué)評(píng)估智能輔助決策平臺(tái)的設(shè)計(jì)與實(shí)現(xiàn)[J];電化教育研究;2009年04期
4 陶劍文;黃崇本;;Web Usage Mining在網(wǎng)絡(luò)教學(xué)中的應(yīng)用研究[J];情報(bào)雜志;2006年05期
5 龐先偉;基于數(shù)據(jù)挖掘技術(shù)的資源型學(xué)習(xí)[J];現(xiàn)代遠(yuǎn)程教育研究;2002年03期
,本文編號(hào):1715970
本文鏈接:http://www.sikaile.net/kejilunwen/zidonghuakongzhilunwen/1715970.html
最近更新
教材專著