基于Hadoop平臺的醫(yī)療保險欺詐檢測的研究與應(yīng)用
本文選題:聚類 + 分類; 參考:《電子科技大學(xué)》2017年碩士論文
【摘要】:隨著我國醫(yī)療與經(jīng)濟水平的進一步提高,我國醫(yī)療保險覆蓋面已非常廣,老百姓享受到了醫(yī)保政策帶來的真切好處。與之相對的,醫(yī);馂E用的情況也有愈演愈烈的趨勢,越來越多的基金被套取,打擊非法欺詐行為勢在必行。目前,醫(yī)保經(jīng)辦機構(gòu)主要利用規(guī)則系統(tǒng)對結(jié)算信息進行審核,規(guī)則依賴于少數(shù)指標(biāo),由于規(guī)則的不完善性與更新的滯后性使得相對不變的規(guī)則很容易被精心偽造的數(shù)據(jù)欺騙,利用計算機技術(shù)輔助審查迫在眉睫。本文分析醫(yī)保數(shù)據(jù)特點,使用數(shù)據(jù)挖掘技術(shù)建立了一套欺詐檢測的流程,并結(jié)合業(yè)務(wù)系統(tǒng),實現(xiàn)了醫(yī)保大數(shù)據(jù)欺詐檢測與審核,主要內(nèi)容如下:1.原始數(shù)據(jù)的特征工程處理。由于歷史原因,現(xiàn)有數(shù)據(jù)集存在諸多瑕疵,首先對原始數(shù)據(jù)利用特征工程進行了處理,包括清除噪聲數(shù)據(jù),補全缺失值,結(jié)合實際業(yè)務(wù)流程提取特征等步驟。2.基于DBSCAN的粗粒度欺詐篩查。根據(jù)數(shù)據(jù)極度不平衡的特點,研究無監(jiān)督算法在欺詐檢測中的應(yīng)用,主要對比了各種聚類算法對數(shù)據(jù)集應(yīng)用的效果,并結(jié)合標(biāo)簽信息擬定了使用DBSCAN算法識別異常群簇。3.基于密度抽樣與隨機森林的精準(zhǔn)欺詐檢測。在聚類劃分異常群體的基礎(chǔ)上,提出一種基于密度的抽樣方法對數(shù)據(jù)進行再平衡,并在隨機森林算法中利用抽樣信息對子分類器進行選擇集成,分類與聚類算法的結(jié)合使用使得準(zhǔn)確性大幅提高,最終形成完整的欺詐檢測框架。4.基于Hadoop平臺的并行化實現(xiàn)。針對大規(guī)模數(shù)據(jù)的場景提出了 DBSCAN與隨機森林的并行化算法,并在Hadoop平臺上使用Map-Reduce進行了實現(xiàn),完成了一個欺詐檢測與審核系統(tǒng)。本文將數(shù)據(jù)挖掘技術(shù)應(yīng)用到醫(yī)保異常檢測領(lǐng)域,其創(chuàng)新之處在于不再局限于針對特定欺詐場景進行建模,使得其能識別出一些較為罕見的數(shù)據(jù),具有更強的泛用性;以局部密度為紐帶,提出了一種基于密度的抽樣方法,將DBSCAN算法與隨機森林算法結(jié)合使用,在保證高準(zhǔn)確率的同時有效地控制了過擬合;在實現(xiàn)并行化算法的同時提出了一種高維數(shù)據(jù)的劃分方法,體現(xiàn)了負載均衡的思想。
[Abstract]:With the further improvement of medical and economic level in China, the coverage of medical insurance in China has been very wide, and the common people enjoy the real benefits of medical insurance policy. On the other hand, the abuse of medical insurance fund is becoming more and more serious, and more funds are withdrawn, so it is imperative to crack down on illegal fraud. At present, medical insurance agencies mainly use the rule system to audit the settlement information, and the rules depend on a few indicators. Due to the imperfections of the rules and the lag of updating, the relatively unchanged rules are easy to be deceived by carefully forged data. The use of computer technology to assist the examination is imminent. This paper analyzes the characteristics of medical insurance data, establishes a set of process of fraud detection by using data mining technology, and realizes the fraud detection and audit of medical insurance big data by combining business system. The main contents are as follows: 1. Feature engineering processing of raw data. Because of the historical reasons, there are many defects in the existing data sets. Firstly, the original data utilization feature engineering is processed, including removing the noise data, making up the missing value, and extracting the features according to the actual business process. Coarse granularity fraud screening based on DBSCAN. According to the characteristics of extremely unbalanced data, the application of unsupervised algorithm in fraud detection is studied. The effects of various clustering algorithms on the application of data sets are compared, and the DBSCAN algorithm is used to identify abnormal cluster. 3. Precision fraud detection based on density sampling and random forest. On the basis of clustering and dividing abnormal population, a density-based sampling method is proposed to rebalance the data, and the sampling information is used to select and integrate the sub-classifiers in the random forest algorithm. With the combination of classification and clustering, the accuracy is greatly improved, and a complete fraud detection framework. 4. Parallel implementation based on Hadoop platform. A parallel algorithm of DBSCAN and random forest is proposed for large-scale data scene. A fraud detection and verification system is implemented on Hadoop platform using Map-Reduce. In this paper, data mining technology is applied to the field of medical insurance anomaly detection. Its innovation is that it is no longer limited to the modeling of specific fraud scenarios, so that it can identify some rare data and have more universal use. Based on local density, a density-based sampling method is proposed, which combines DBSCAN algorithm with random forest algorithm to ensure high accuracy and effectively control over-fitting. At the same time, a high dimensional data partition method is proposed, which embodies the idea of load balancing.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:F842.684;TP311.13
【參考文獻】
相關(guān)期刊論文 前10條
1 陳亞琳;王旭明;;基于數(shù)據(jù)挖掘的醫(yī)保欺詐預(yù)警模型研究[J];電腦知識與技術(shù);2016年11期
2 張金霞;;我國醫(yī)保欺詐問題的風(fēng)險防范及管理對策研究[J];商;2016年17期
3 劉格華;;醫(yī)療保險基金欺詐形式分析及對策研究[J];中國總會計師;2015年06期
4 李亞子;尤斌;;醫(yī)療保險騙保特征分析[J];中國社會保障;2015年02期
5 李亞青;;社會醫(yī)療保險財政補貼增長及可持續(xù)性研究——以醫(yī)保制度整合為背景[J];公共管理學(xué)報;2015年01期
6 李德仁;姚遠;邵振峰;;智慧城市中的大數(shù)據(jù)[J];武漢大學(xué)學(xué)報(信息科學(xué)版);2014年06期
7 王蔚臆;;醫(yī)保欺詐的成因及其監(jiān)管探析[J];管理觀察;2014年08期
8 孫翎;;中國社會醫(yī)療保險制度整合的研究綜述[J];華東經(jīng)濟管理;2013年02期
9 沈培;張吉凱;;聚類分析在醫(yī)療費用數(shù)據(jù)挖掘中的應(yīng)用[J];華南預(yù)防醫(yī)學(xué);2012年01期
10 龐洋;徐巧鳳;;基于網(wǎng)格分區(qū)確定DBSCAN參數(shù)的方法[J];計算機與現(xiàn)代化;2010年05期
相關(guān)碩士學(xué)位論文 前5條
1 張海洋;醫(yī)療保險欺詐檢測問題研究[D];山東大學(xué);2016年
2 楊超;基于BP神經(jīng)網(wǎng)絡(luò)的健康保險欺詐識別研究[D];青島大學(xué);2014年
3 彭黎;神經(jīng)網(wǎng)絡(luò)算法在新農(nóng)合醫(yī)療保險欺詐風(fēng)險預(yù)警中的應(yīng)用[D];湖南大學(xué);2014年
4 熊明明;美國醫(yī)療保險欺詐與濫用控制(HCFAC)研究[D];湖南大學(xué);2012年
5 何俊華;數(shù)據(jù)挖掘技術(shù)在醫(yī)保領(lǐng)域中的研究與應(yīng)用[D];復(fù)旦大學(xué);2011年
,本文編號:1999822
本文鏈接:http://www.sikaile.net/jingjilunwen/bxjjlw/1999822.html