Healthc Inform Res.  2024 Jan;30(1):73-82. 10.4258/hir.2024.30.1.73.

Prediction of Diabetes Using Data Mining and Machine Learning Algorithms: A Cross-Sectional Study

Affiliations
  • 1Infectious Diseases Research Center, Gonabad University of Medical Sciences, Gonabad, Iran
  • 2Telemedicine Research Center, National Research Institute of Tuberculosis and Lung Diseases (NRITLD), Shahid Beheshti University of Medical Sciences, Tehran, Iran
  • 3Preventive Medicine and Public Health Research Center, Psychosocial Health Research Institute, Department of Community and Family Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran
  • 4Vaccine Research Center, Iran University of Medical Sciences, Tehran, Iran

Abstract


Objectives
This study aimed to develop a model to predict fasting blood glucose status using machine learning and data mining, since the early diagnosis and treatment of diabetes can improve outcomes and quality of life.
Methods
This crosssectional study analyzed data from 3376 adults over 30 years old at 16 comprehensive health service centers in Tehran, Iran who participated in a diabetes screening program. The dataset was balanced using random sampling and the synthetic minority over-sampling technique (SMOTE). The dataset was split into training set (80%) and test set (20%). Shapley values were calculated to select the most important features. Noise analysis was performed by adding Gaussian noise to the numerical features to evaluate the robustness of feature importance. Five different machine learning algorithms, including CatBoost, random forest, XGBoost, logistic regression, and an artificial neural network, were used to model the dataset. Accuracy, sensitivity, specificity, accuracy, the F1-score, and the area under the curve were used to evaluate the model.
Results
Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors for predicting fasting blood glucose status. Though the models achieved similar predictive ability, the CatBoost model performed slightly better overall with 0.737 area under the curve (AUC).
Conclusions
A gradient boosted decision tree model accurately identified the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can support planning for diabetes management and prevention.

Keyword

Diabetes Mellitus, Machine Learning, Data Mining, Decision Trees, Risk Factors

Figure

  • Figure 1 Division of subjects into normal and abnormal groups. FBG: fasting blood glucose.

  • Figure 2 General steps of data processing for modeling. SMOTE: synthetic minority over-sampling technique, AUC: area under the receiver operating characteristic curve.

  • Figure 3 The relationships between predictive variables and fasting blood glucose status: (A) sex, (B) age, (C) systolic blood pressure, (D) diastolic blood pressure, (E) smoking, (F) body mass index (BMI), (G) waist-to-hip ratio (WHR), and (H) diabetes family history.

  • Figure 4 Boxplots comparing distributions before and after outlier removal: (A) age, (B) systolic blood pressure, (C) diastolic blood press, (D) body mass index, and (E) waist-to-hip ratio.

  • Figure 5 Shapley diagram showing the relative importance of features. WHR: waist-to-hip ratio, BMI: body mass index.

  • Figure 6 Confusion matrix for different models: (A) CatBoost, (B) logistic regression, (C) random forest, (D) XGBoost, (E) artificial neural network, and (F) ensemble classifier.

  • Figure 7 Receive operating characteristic curves of different models. ANN: artificial neural network, AUC: area under the curve.


Reference

References

1. Khan FA, Zeb K, Al-Rakhami M, Derhab A, Bukhari SA. Detection and prediction of diabetes using data mining: a comprehensive review. IEEE Access. 2021; 9:43711–35. https://doi.org/10.1109/ACCESS.2021.3059343.
Article
2. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J. 2017; 15:104–16. https://doi.org/10.1016/j.csbj.2016.12.005.
Article
3. Saxena R, Sharma SK, Gupta M, Sampada GC. A novel approach for feature selection and classification of diabetes mellitus: machine learning methods. Comput Intell Neurosci. 2022; 2022:3820360. https://doi.org/10.1155/2022/3820360.
Article
4. Olisah CC, Smith L, Smith M. Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. Comput Methods Programs Biomed. 2022; 220:106773. https://doi.org/10.1016/j.cmpb.2022.106773.
Article
5. Jian Y, Pasquier M, Sagahyroon A, Aloul F. A machine learning approach to predicting diabetes complications. Healthcare (Basel). 2021; 9(12):1712. https://doi.org/10.3390/healthcare9121712.
Article
6. Woldemichael FG, Menaria S. In : Prediction of diabetes using data mining techniques. Proceedings of 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI); 2018 May 11–12; Tirunelveli, India. p. 414–8. https://doi.org/10.1109/ICOEI.2018.8553959.
Article
7. Mujumdar A, Vaidehi V. Diabetes prediction using machine learning algorithms. Procedia Comput Sci. 2019; 165:292–9. https://doi.org/10.1016/j.procs.2020.01.047.
Article
8. Llaha O, Rista A. Prediction and detection of diabetes using machine learning. In : Proceedings of the 4th International Conference on Recent Trends and Applications in Computer Science and Information Technology (RTACSIT); 2021 May 21–22; Tirana, Albania. p. 94–102.
9. Shailaja K, Seetharamulu B, Jabbar MA. Machine learning in healthcare: a review. In : Proceedings of 2018 2nd International Conference on Electronics, Communication and Aerospace Technology (ICECA); 2018 May 29–31; Coimbatore, India. p. 910–4. https://doi.org/10.1109/ICECA.2018.8474918.
Article
10. Singla R, Singla A, Gupta Y, Kalra S. Artificial intelligence/machine learning in diabetes care. Indian J Endocrinol Metab. 2019; 23(4):495–7. https://doi.org/10.4103/ijem.IJEM_228_19.
Article
11. Lai H, Huang H, Keshavjee K, Guergachi A, Gao X. Predictive models for diabetes mellitus using machine learning techniques. BMC Endocr Disord. 2019; 19(1):101. https://doi.org/10.1186/s12902-019-0436-6.
Article
12. Wei J, Liu X, Xue H, Wang Y, Shi Z. Comparisons of visceral adiposity index, body shape index, body mass index and waist circumference and their associations with diabetes mellitus in adults. Nutrients. 2019; 11(7):1580. https://doi.org/10.3390/nu11071580.
Article
13. Zhang FL, Ren JX, Zhang P, Jin H, Qu Y, Yu Y, et al. Strong association of waist circumference (WC), body mass index (BMI), waist-to-height ratio (WHtR), and waist-to-hip ratio (WHR) with diabetes: a population-based cross-sectional study in Jilin Province, China. J Diabetes Res. 2021; 2021:8812431. https://doi.org/10.1155/2021/8812431.
Article
14. Saberi-Karimian M, Mansoori A, Bajgiran MM, Hosseini ZS, Kiyoumarsioskouei A, Rad ES, et al. Data mining approaches for type 2 diabetes mellitus prediction using anthropometric measurements. J Clin Lab Anal. 2023; 37(1):e24798. https://doi.org/10.1002/jcla.24798.
Article
15. World Health Organization. WHO STEPS surveillance manual: the WHO STEPwise approach to chronic disease risk factor surveillance. Geneva, Switzerland: World Health Organization;2005.
16. Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res. 2016; 18(12):e323. https://doi.org/10.2196/jmir.5870.
Article
17. Stevens LM, Mortazavi BJ, Deo RC, Curtis L, Kao DP. Recommendations for reporting machine learning analyses in clinical research. Circ Cardiovasc Qual Outcomes. 2020; 13(10):e006556. https://doi.org/10.1161/CIRCOUTCOMES.120.006556.
Article
18. Rayburn WF. Diagnosis and classification of diabetes mellitus: highlights from the American Diabetes Association. J Reprod Med. 1997; 42(9):585–6.
19. Babaee E, Tehrani-Banihashem A, Eshrati B, Purabdollah M, Nojomi M. How much hypertension is attributed to overweight, obesity, and hyperglycemia using adjusted population attributable risk in adults? Int J Hypertens. 2020; 2020:4273456. https://doi.org/10.1155/2020/4273456.
Article
20. Guan S, Fu N. Class imbalance learning with Bayesian optimization applied in drug discovery. Sci Rep. 2022; 12(1):2069. https://doi.org/10.1038/s41598-022-05717-7.
Article
21. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017; 30:4765–74.
22. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002; 16:321–57. https://doi.org/10.1613/jair.953.
Article
23. Chen J, Huang H, Cohn AG, Zhang D, Zhou M. Machine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning. Int J Min Sci Technol. 2022; 32(2):309–22. https://doi.org/10.1016/j.ijmst.2021.08.004.
Article
24. Shameer K, Johnson KW, Glicksberg BS, Dudley JT, Sengupta PP. Machine learning in cardiovascular medicine: are we there yet? Heart. 2018; 104(14):1156–64. https://doi.org/10.1136/heartjnl-2017-311198.
Article
25. van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014; 14:137. https://doi.org/10.1186/1471-2288-14-137.
Article
26. Mamprin M, Zelis JM, Tonino PA, Zinger S, de With PH. Decision trees for predicting mortality in transcatheter aortic valve implantation. Bioengineering (Basel). 2021; 8(2):22. https://doi.org/10.3390/bioengineering8020022.
Article
27. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018; 31:6639–49.
28. Sharma T, Shah M. A comprehensive review of machine learning techniques on diabetes detection. Vis Comput Ind Biomed Art. 2021; 4(1):30. https://doi.org/10.1186/s42492-021-00097-7.
Article
29. Zhang L, Wang Y, Niu M, Wang C, Wang Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: the Henan Rural Cohort Study. Sci Rep. 2020; 10(1):4406. https://doi.org/10.1038/s41598-020-61123-x.
Article
30. Zhou L, Pan S, Wang J, Vasilakos AV. Machine learning on big data: opportunities and challenges. Neurocomputing. 2017; 237:350–61. https://doi.org/10.1016/j.neucom.2017.01.02.
Article
Full Text Links
  • HIR
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr