Volume 6, Issue 3 (2025)                   J Clinic Care Skill 2025, 6(3): 121-128 | Back to browse issues page
Article Type:
Descriptive Study |
Subject:

Print XML PDF HTML Full-Text (HTML)

Ethics code: IR.YUMS.REC.1402.152


History

How to cite this article
Ghaderzadeh M, Salehnasab C. Filter-Based Feature Selection for Type II Diabetes Prediction. J Clinic Care Skill 2025; 6 (3) :121-128
URL: http://jccs.yums.ac.ir/article-1-427-en.html
Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Rights and permissions
1- Department of Medical Informatics, Boukan School of Medical Sciences, Urmia University of Medical Sciences, Urmia, Iran
2- Social Determinants of Health Research Center, Yasuj University of Medical Sciences, Yasuj, Iran
* Corresponding Author Address: Yasuj University of Medical Sciences, Shahid Motahari Boulevard, Yasuj, Kohgiluyeh and Boyer-Ahmad Province, Iran. Postal Code: 7591994799 (cirruse.salehnasab@gmail.com)
Abstract   (482 Views)
Aims: Type 2 diabetes mellitus is a major global health challenge, and early prediction is key to prevention. This study compared three filter-based feature selection methods (ANOVA (f-classif), mutual information, and Chi-square test) for identifying predictors of type 2 diabetes and assessed their impact on the performance of logistic regression.
Instrument & Methods: This retrospective study analyzed data from 3,203 adults aged 35-70 years from Yasuj, Kohgiluyeh and Boyer-Ahmad Province, Iran, gathered between 2020 and 2022 in the Dena-PERSIAN cohort, including 402 (12.55%) individuals with type 2 diabetes. Preprocessing included imputation, normalization, and class balancing using the synthetic minority oversampling technique. Each method ranked predictors, and the top five features were used to train logistic regression models. Model performance was evaluated on a test set using accuracy, precision, recall, and F1-score.
Findings: Fasting blood sugar and age consistently emerged as dominant predictors across methods. ANOVA highlighted metabolic factors (triglycerides, fatty liver, and kidney stones), while mutual information emphasized high-density lipoprotein cholesterol and lifestyle behaviors, and the Chi-square test prioritized categorical comorbidities. Logistic regression achieved the strongest performance with ANOVA and mutual information (accuracy and F1=0.84), slightly outperforming the Chi-square test (accuracy and F1=0.82).
Conclusion: ANOVA and mutual information produced clinically meaningful and stable feature subsets for type 2 diabetes prediction, centered on fasting glucose, age, and fatty liver.
Keywords:

References
1. International Diabetes Federation. IDF diabetes atlas. 11th ed. Brussels: International Diabetes Federation; 2025. [Link]
2. GBD 2021 Diabetes Collaborators. Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: A systematic analysis for the Global Burden of Disease Study 2021. Lancet. 2023;402(10397):203-34. [Link] [DOI:10.1016/S0140-6736(23)01301-6]
3. Parker ED, Lin J, Mahoney T, Ume N, Yang G, Gabbay RA, et al. Economic costs of diabetes in the U.S. in 2022. Diabetes Care. 2024;47(1):26-43. [Link] [DOI:10.2337/dci23-0085]
4. Bommer C, Sagalova V, Heesemann E, Manne-Goehler J, Atun R, Bärnighausen T, et al. Global and regional economic burden of diabetes in adults: Projections from 2015 to 2030. Diabetes Care. 2018;41(5):963-70. [Link] [DOI:10.2337/dc17-1962]
5. Williams R, Karuranga S, Malanda B, Saeedi P, Basit A, Besançon S, Bommer C, Esteghamati A, Ogurtsova K, Zhang P, Colagiuri S. Global and regional estimates and projections of diabetes-related health expenditure: Results from the International Diabetes Federation Diabetes Atlas. Diabetes Res Clin Pract. 2020;162:108072. [Link] [DOI:10.1016/j.diabres.2020.108072]
6. Joshi RD, Dhakal CK. Predicting type 2 diabetes using logistic regression and SMOTE. Int J Environ Res Public Health. 2021;18(14):7346. [Link] [DOI:10.3390/ijerph18147346]
7. Van Buuren S, Groothuis-Oudshoorn K. Mice: Multivariate imputation by chained equations in R. J Stat Softw. 2011;45(3):1-67. [Link] [DOI:10.18637/jss.v045.i03]
8. Yeo IK, Johnson RA. A new family of power transformations to improve normality or symmetry. Biometrika. 2000;87(4):954-9. [Link] [DOI:10.1093/biomet/87.4.954]
9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321-57. [Link] [DOI:10.1613/jair.953]
10. Fernández A, García S, Herrera F, Chawla NV. SMOTE for learning from imbalanced data: Progress and challenges. J Artif Intell Res. 2018;61:863-905. [Link] [DOI:10.1613/jair.1.11192]
11. Alghamdi M, Al-Mallah M, Keteyian S, Brawner C, Ehrman J, Sakr S. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford Exercise Testing (FIT) project. PLoS One. 2017;12(7):e0179805. [Link] [DOI:10.1371/journal.pone.0179805]
12. Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A review of feature selection methods for machine learning-based disease risk prediction. Front Bioinform. 2022;2:927312. [Link] [DOI:10.3389/fbinf.2022.927312]
13. Vergara JR, Estévez PA. A review of feature selection methods based on mutual information. Neural Comput Appl. 2014;24:175-86. [Link] [DOI:10.1007/s00521-013-1368-0]
14. Verleysen M, Rossi F, Francois D. Advances in feature selection with mutual information. Similarity Based Clust. 2009;50(3):670-84. [Link] [DOI:10.1007/978-3-642-01805-3_4]
15. Yki-Järvinen H. Non-alcoholic fatty liver disease as a cause and a consequence of metabolic syndrome. Lancet Diabetes Endocrinol. 2014;2(11):901-10. [Link] [DOI:10.1016/S2213-8587(14)70032-4]
16. Byrne CD, Targher G. NAFLD: A multisystem disease. J Hepatol. 2015;62(1 Suppl):S47-64. [Link] [DOI:10.1016/j.jhep.2014.12.012]
17. Lugner M, Rawshani A, Helleryd E, Eliasson B. Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data. Sci Rep. 2024;14(1):2102. [Link] [DOI:10.1038/s41598-024-52023-5]
18. Li X, Ding F, Zhang L, Zhao S, Hu Z, Ma Z, et al. Interpretable machine learning method to predict the risk of pre-diabetes using a national-wide cross-sectional data: Evidence from CHNSt. BMC Public Health. 2025;25:1145. [Link] [DOI:10.1186/s12889-025-22419-7]
19. Liu Q, Gong C, Geng Y, You J. Elevated alanine transaminase is nonlinearly associated with in-hospital death in ICU-admitted diabetic ketoacidosis patients. Diabetes Res Clin Pract. 2023;197:110555. [Link] [DOI:10.1016/j.diabres.2023.110555]
20. Upadhyay S, Gupta YK. Development of Web-based Novel Machine Learning Model Using Boosting Techniques for Early Prediction of Diabetes in Indian Adults. In2023 12th International Conference on System Modeling & Advancement in Research Trends (SMART). 2023 Dec 22. 592-602. IEEE. [Link] [DOI:10.1109/SMART59791.2023.10428549]
21. Patil R, Patil A, Janrao S, Bankar S, Shah K. A Framework for Prediction of Type II Diabetes through Ensemble Stacking Model. J Electron Electromed Engn Med Inform. 2024 Sep 16;6(4):459-66.. [Link] [DOI:10.35882/jeeemi.v6i4.497]
22. Nadesh RK, Arivuselvan K. Type 2: diabetes mellitus prediction using deep neural networks classifier. Int J Cogn Comput Eng. 2020;1:55-61.. [Link] [DOI:10.1016/j.ijcce.2020.10.002]
23. Healy GN, Matthews CE, Dunstan DW, Winkler EA, Owen N. Sedentary time and cardio-metabolic biomarkers in US adults: NHANES 2003-06. Eur Heart J. 2011;32(5):590-7. [Link] [DOI:10.1093/eurheartj/ehq451]
24. Katzmarzyk PT, Church TS, Craig CL, Bouchard C. Sitting time and mortality from all causes, cardiovascular disease, and cancer. Med Sci Sports Exerc. 2009;41(5):998-1005. [Link] [DOI:10.1249/MSS.0b013e3181930355]
25. Mohtasham F, Pourhoseingholi MA, Hashemi Nazari SS, Kavousi K, Zali MR. Comparative analysis of feature selection techniques for COVID-19 dataset. Sci Rep. 2024;14:18627. [Link] [DOI:10.1038/s41598-024-69209-6]
26. Upadhyay S, Gupta YK. Enhancing Early Diagnosis of Type II Diabetes through Feature Selection and Hybrid Metaheuristic Optimization Techniques. Open Bioinform J. 2025;18(1). [Link] [DOI:10.2174/0118750362382139250502100340]