MOTOR INSURANCE FRAUD DETECTION USING TEXT ANALYSIS AND MACHINE LEARNING

MOTOR INSURANCE FRAUD DETECTION USING TEXT ANALYSIS AND MACHINE LEARNING

Files

gs631130117.pdf (4.38 MB)

Date

27/5/2022

Publisher

Srinakharinwirot University

Abstract

The purpose of this research was to analyze the data from the text attributes and categorical attributes, in order to generate a model using machine learning techniques. The dataset from motor insurance claims were used and were from the Asia Insurance Company 1950 (Public) and originated in the period from January 2020 to December 2020 and fraudulent claims data from January 2020 to April 2021, which a total of 58,579. The machine learning (ML) algorithms such as Naive Bayes classifier, Logistic regression, Random Forest and support vector machine were applied to the dataset. In this study, two methods were compared to handle an imbalanced dataset: random oversampling and SMOTE. These models were evaluated using Accuracy, Precision, Recall and F1-Score. It was found that Random Forest using SMOTE achieved the best results, with the following values of Accuracy=0.99, Precision=0.803, Recall=0.241, and a F1-Score=0.371.
วัตถุประสงค์ของงานวิจัยเพื่อศึกษาวิเคราะห์ข้อมูลจากข้อความร่วมกันกับการใช้คุณลักษณะอื่นๆมาประกอบร่วมกัน นำมาประยุกต์ใช้กับเทคนิคการเรียนรู้ของเครื่อง(Machine Learning) เพื่อสร้างแบบจำลองเพื่อทำนายการคาดการความน่าจะเป็นว่าเคลมจะเกิดการทุจริต และเปรียบเทียบประสิทธิภาพของแบบจำลองการแยกประเภท(Classification) ร่วมกับการทดลองกับการจัดการความไม่สมดุลกันของข้อมูล โดยใช้ชุดข้อมูลการเคลมสินไหมรถยนต์ของบริษัทเอเชียประกันภัย1950 จำกัด(มหาชน) ที่เกิดเคลมในช่วง ม.ค. 2563 ถึง ธ.ค. 2563 โดยรวบรวมข้อมูลการทุจริตเคลมในช่วง ม.ค. 2563 ถึง เม.ย. 2564 จำนวนข้อมูลทั้งหมด 58,579 แถว โดยได้ทำการทดลองด้วย 4 วิธีหลักดังนี้ 1. สร้างแบบจำลองทดลองกับข้อมูลที่มีความไม่สมดุล 2. สร้างแบบจำลองทดลองกับข้อมูลที่จัดการกับความไม่สมดุลด้วยวิธี Random Oversampling 3. สร้างแบบจำลองทดลองกับข้อมูลที่จัดการกับความไม่สมดุลด้วยวิธี SMOTE 4. นำแบบจำลองและวิธีการจัดการความไม่สมดุลของข้อมูลที่เลือกมาทำการปรับจูนพารามิเตอร์ ผู้วิจัยได้ทำการทดลองโดยเปรียบเทียบจากค่า Accuracy, Precision, Recall และ F1-Score ในแต่ละวิธีการที่ทำการวิจัย ซึ่งแบบจำลองที่ให้ค่าผลลัพธ์ที่ดีที่สุดคือ Random Forest และวิธีการจัดการกับความไม่สมดุลกันของข้อมูลคือ SMOTE โดยให้ค่า Accuracy=0.99, Precision=0.803, Recall=0.241, F1-Score=0.371 โดยใช้เวลาเทรนแบบจำลองเพียง 12นาที จากการทดลองแบบจำลอง Random Forest ร่วมกับการทำ SMOTE สามารถให้ผลลัพธ์ที่ดีกว่าและใช้เวลาในการเทรนที่ไม่มาก ในแง่ของการใช้คุณลักษณะข้อความกับคุณลักษณะที่ไม่ใช่ข้อความพบว่าแบบจำลองยังให้ความสำคัญกับคุณลักษณะที่ไม่ใช่ข้อความมากกว่า

Keywords

ทุจริตเคลมรถยนต์, การวิเคราะห์ข้อความ, การเรียนรู้ของเครื่อง, ความไม่สมดุลกันของข้อมูล, เทคนิคป่าแบบสุ่ม, Motor Claim Fraud, Text Analytics, Machine Learning, Imbalance Data, Random Forest Technique

URI

http://ir-ithesis.swu.ac.th/dspace/handle/123456789/1700

Collections

Faculty of Science

Full item page

MOTOR INSURANCE FRAUD DETECTION USING TEXT ANALYSIS AND MACHINE LEARNING

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By