TGANs with Machine Learning Models in Automobile Insurance Fraud Detection and Comparative Study with Other Data Imbalance Techniques
Rohan Yashraj Gupta1, Satya Sai Mudigonda2, Pallav Kumar Baruah3

1Rohan Yashraj Gupta*, Department of Mathematics and Computer Science, Sri Sathya Sai Institute of Higher Learning, Puttaparthi, India.
2Satya Sai Mudigonda, Department of Mathematics and Computer Science, Sri Sathya Sai Institute of Higher Learning, Puttaparthi, India.
3Pallav Kumar Baruah, Department of Mathematics and Computer Science, Sri Sathya Sai Institute of Higher Learning, Puttaparthi, India.

Manuscript received on September 20, 2020. | Revised Manuscript received on January 24, 2021. | Manuscript published on January 30, 2021. | PP: 236-244 | Volume-9 Issue-5, January 2021. | Retrieval Number: 100.1/ijrte.E5277019521 | DOI: 10.35940/ijrte.E5277.019521
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: A data-driven Fraud detection model for insurance business can be seen as a two-phase method. Phase I is data-preprocessing of a given dataset, in which, handling class imbalance is a major challenge. Phase II is that of classification using Machine Learning models. It is important to comprehend if there is any influence of the technique used in Phase I on the efficiency of the model used for Phase II. A natural query that intrigues one is whether there is a golden combination of a technique in Phase I and a specific model in Phase II for assured best performance of a Fraud Detection Model.In this work, we study a few techniques for handling data imbalance issue namely, SMOTE, MWMOTE, ADASYN and TGAN in combination with various classifier models like Random Forest (RF), Decision Trees (DT), Support Vector Machines (SVM), LightGBM, XGBoost and Gradient Boosting Machines (GBM). The study is conducted on a dataset for motor vehicle insurance fraud detection.We present a comparison of various combinations of data imbalance technique and classifier models. It is observed that the combination of TGAN in Phase I and GBM in Phase II gives the best performance. This combination performs best in terms of important metrics such as false positive rate, precision and specificity. We obtained the lowest false positive rate of 0.0011 and precision of 0.9988 which minimizes the most critical risk for the insurance company of falsely classifying a non-fraud claim as a fraud. Finally, the specificity of 0.9989 indicates that the model was also very good at predicting the non-fraudulent claim. 
Keywords: Fraud Detection, Data Imbalance Techniques, Insurance Fraud, Machine Learning, Synthetic Data Generation, Class Imbalance.