An Interpretable Machine Learning Strategy for Antimalarial Drug Discovery with LightGBM and SHAP
DOI:
https://doi.org/10.62411/faith.2024-16Keywords:
Classification, Gradient boosting, Molecular descriptor, QSAR, Supervised learningAbstract
Malaria continues to pose a significant global health threat, and the emergence of drug-resistant malaria exacerbates the challenge, underscoring the urgent need for new antimalarial drugs. While several machine learning algorithms have been applied to quantitative structure-activity relationship (QSAR) modeling for antimalarial compounds, there remains a need for more interpretable models that can provide insights into the underlying mechanisms of drug action, facilitating the rational design of new compounds. This study develops a QSAR model using Light Gradient Boosting Machine (LightGBM). The model is integrated with SHapley Additive exPlanations (SHAP) to enhance interpretability. The LightGBM model demonstrated superior performance in predicting antimalarial activity, with an ac-curacy of 86%, precision of 85%, sensitivity of 81%, specificity of 89%, and an F1-score of 83%. SHAP analysis identified key molecular descriptors such as maxdO and GATS2m as significant contributors to antimalarial activity. The integration of LightGBM with SHAP not only enhances the predictive ac-curacy of the QSAR model but also provides valuable insights into the importance of features, aiding in the rational design of new antimalarial drugs. This approach bridges the gap between model accuracy and interpretability, offering a robust framework for efficient and effective drug discovery against drug-resistant malaria strains.
Downloads
References
E. Scholar, “Malaria,” in xPharm: The Comprehensive Pharmacology Reference, Elsevier, 2007, pp. 1–5. doi: 10.1016/B978-008055232-3.60932-8.
World Health Organization, World malaria report 2021. World Health Organization, 2021.
CDC, “Drug Resistance in the Malaria-Endemic World,” 2018.
M. M. Ippolito, K. A. Moser, J.-B. B. Kabuya, C. Cunningham, and J. J. Juliano, “Antimalarial Drug Resistance and Implications for the WHO Global Technical Strategy,” Curr. Epidemiol. Reports, vol. 8, no. 2, pp. 46–62, Mar. 2021, doi: 10.1007/s40471-021-00266-5.
B. J. Neves, R. C. Braga, C. C. Melo-Filho, J. T. Moreira-Filho, E. N. Muratov, and C. H. Andrade, “QSAR-Based Virtual Screening: Advances and Applications in Drug Discovery,” Front. Pharmacol., vol. 9, Nov. 2018, doi: 10.3389/fphar.2018.01275.
M. Abdullahi, G. A. Shallangwa, and A. Uzairu, “In silico QSAR and molecular docking simulation of some novel aryl sulfonamide derivatives as inhibitors of H5N1 influenza A virus subtype,” Beni-Suef Univ. J. Basic Appl. Sci., vol. 9, no. 1, p. 2, Dec. 2020, doi: 10.1186/s43088-019-0023-y.
S. A. Alsenan, I. M. Al-Turaiki, and A. M. Hafez, “Feature Extraction Methods in Quantitative Structure–Activity Relationship Modeling: A Comparative Study,” IEEE Access, vol. 8, pp. 78737–78752, 2020, doi: 10.1109/ACCESS.2020.2990375.
T. R. Noviandy, K. Nisa, G. M. Idroes, I. Hardi, and N. R. Sasmita, “Classifying Beta-Secretase 1 Inhibitor Activity for Alzheimer’s Drug Discovery with LightGBM,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 358–367, Mar. 2024, doi: 10.62411/jcta.10129.
P. Carracedo-Reboredo et al., “A review on machine learning approaches and trends in drug discovery,” Comput. Struct. Biotechnol. J., vol. 19, pp. 4538–4558, 2021, doi: 10.1016/j.csbj.2021.08.011.
S. Kwon, H. Bae, J. Jo, and S. Yoon, “Comprehensive ensemble in QSAR prediction for drug discovery,” BMC Bioinformatics, vol. 20, no. 1, p. 521, Dec. 2019, doi: 10.1186/s12859-019-3135-4.
K. B. Jillahi and A. Iorliam, “A Scoping Literature Review of Artificial Intelligence in Epidemiology: Uses, Applications, Challenges and Future Trends,” J. Comput. Theor. Appl., vol. 1, no. 4, pp. 421–445, Apr. 2024, doi: 10.62411/jcta.10350.
P. G. R. Achary, “Applications of Quantitative Structure-Activity Relationships (QSAR) based Virtual Screening in Drug Design: A Review,” Mini-Reviews Med. Chem., vol. 20, no. 14, pp. 1375–1388, Sep. 2020, doi: 10.2174/1389557520666200429102334.
L. Patel, T. Shukla, X. Huang, D. W. Ussery, and S. Wang, “Machine Learning Methods in Drug Discovery,” Molecules, vol. 25, no. 22, p. 5277, Nov. 2020, doi: 10.3390/molecules25225277.
H. Li, K. Sze, G. Lu, and P. J. Ballester, “Machine‐learning scoring functions for structure‐based drug lead optimization,” WIREs Comput. Mol. Sci., vol. 10, no. 5, Sep. 2020, doi: 10.1002/wcms.1465.
T. R. Noviandy, A. Maulana, T. Bin Emran, G. M. Idroes, and R. Idroes, “QSAR Classification of Beta-Secretase 1 Inhibitor Activity in Alzheimer’s Disease Using Ensemble Machine Learning Algorithms,” Heca J. Appl. Sci., vol. 1, no. 1, pp. 1–7, May 2023, doi: 10.60084/hjas.v1i1.12.
F. Rahman, K. M. Lhaksmana, and I. Kurniawan, “Implementation of Simulated Annealing-Support Vector Machine on QSAR Study of Fusidic Acid Derivatives as Anti-Malarial Agent,” in 2020 6th International Conference on Interactive Digital Media (ICIDM), Dec. 2020, pp. 1–4. doi: 10.1109/ICIDM51048.2020.9339632.
Y. Matsuzaka, T. Hosaka, A. Ogaito, K. Yoshinari, and Y. Uesawa, “Prediction Model of Aryl Hydrocarbon Receptor Activation by a Novel QSAR Approach, DeepSnap–Deep Learning,” Molecules, vol. 25, no. 6, p. 1317, Mar. 2020, doi: 10.3390/molecules25061317.
G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.
T. R. Noviandy et al., “Integrating Genetic Algorithm and LightGBM for QSAR Modeling of Acetylcholinesterase Inhibitors in Alzheimer’s Disease Drug Discovery,” Malacca Pharm., vol. 1, no. 2, pp. 48–54, Jul. 2023, doi: 10.60084/mp.v1i2.60.
L. Patel, T. Shukla, X. Huang, D. W. Ussery, and S. Wang, “Machine Learning Methods in Drug Discovery,” Molecules, vol. 25, no. 22, p. 5277, Nov. 2020, doi: 10.3390/molecules25225277.
R. Dybowski, “Interpretable machine learning as a tool for scientific discovery in chemistry,” New J. Chem., vol. 44, no. 48, pp. 20914–20920, 2020, doi: 10.1039/D0NJ02592E.
T. R. Noviandy, G. M. Idroes, and I. Hardi, “Machine Learning Approach to Predict AXL Kinase Inhibitor Activity for Cancer Drug Discovery Using XGBoost and Bayesian Optimization,” J. Soft Comput. Data Min., vol. 5, no. 1, pp. 46–56, Jun. 2024.
T. Puzyn, J. Leszczynski, and M. T. Cronin, Recent Advances in QSAR Studies, vol. 8. Dordrecht: Springer Netherlands, 2010. doi: 10.1007/978-1-4020-9783-6.
H. F. Azmi, K. M. Lhaksmana, and I. Kurniawan, “QSAR Study of Fusidic Acid Derivative as Anti-Malaria Agents by using Arti-ficial Neural Network-Genetic Algorithm,” in 2020 8th International Conference on Information and Communication Technology (ICoICT), Jun. 2020, pp. 1–4. doi: 10.1109/ICoICT49345.2020.9166158.
S. Egieyeh, J. Syce, S. F. Malan, and A. Christoffels, “Predictive classifier models built from natural products with antimalarial bi-oactivity using machine learning approach,” PLoS One, vol. 13, no. 9, p. e0204644, Sep. 2018, doi: 10.1371/journal.pone.0204644.
Danishuddin, G. Madhukar, M. Z. Malik, and N. Subbarao, “Development and rigorous validation of antimalarial predictive models using machine learning approaches,” SAR QSAR Environ. Res., vol. 30, no. 8, pp. 543–560, Aug. 2019, doi: 10.1080/1062936X.2019.1635526.
M. E. Mswahili, G. L. Martin, J. Woo, G. J. Choi, and Y.-S. Jeong, “Antimalarial Drug Predictions Using Molecular Descriptors and Machine Learning against Plasmodium Falciparum,” Biomolecules, vol. 11, no. 12, p. 1750, Nov. 2021, doi: 10.3390/biom11121750.
O. Daoui, S. Elkhattabi, S. Chtita, R. Elkhalabi, H. Zgou, and A. T. Benjelloun, “QSAR, molecular docking and ADMET properties in silico studies of novel 4,5,6,7-tetrahydrobenzo[D]-thiazol-2-Yl derivatives derived from dimedone as potent anti-tumor agents through inhibition of C-Met receptor tyrosine kinase,” Heliyon, vol. 7, no. 7, p. e07463, Jul. 2021, doi: 10.1016/j.heliyon.2021.e07463.
N. Ashraf et al., “Combined 3D-QSAR, molecular docking and dynamics simulations studies to model and design TTK inhibitors,” Front. Chem., vol. 10, Nov. 2022, doi: 10.3389/fchem.2022.1003816.
R. Idroes et al., “Application of Genetic Algorithm-Multiple Linear Regression and Artificial Neural Network Determinations for Prediction of Kovats Retention Index,” Int. Rev. Model. Simulations, vol. 14, no. 2, p. 137, Apr. 2021, doi: 10.15866/iremos.v14i2.20460.
G. M. Idroes, I. Hardi, I. S. Hilal, R. T. Utami, T. R. Noviandy, and R. Idroes, “Economic Growth and Environmental Impact: Assessing the Role of Geothermal Energy in Developing and Developed Countries,” Innov. Green Dev., vol. 3, no. 3, p. 100144, Sep. 2024, doi: 10.1016/j.igd.2024.100144.
G. M. Idroes, I. Hardi, M. H. Rahman, M. Afjal, T. R. Noviandy, and R. Idroes, “The Dynamic Impact of Non-renewable and Re-newable Energy on Carbon Dioxide Emissions and Ecological Footprint in Indonesia,” Carbon Res., vol. 3, no. 1, p. 35, Apr. 2024, doi: 10.1007/s44246-024-00117-0.
T. R. Noviandy, A. Maulana, G. M. Idroes, I. Irvanizam, M. Subianto, and R. Idroes, “QSAR-Based Stacked Ensemble Classifier for Hepatitis C NS5B Inhibitor Prediction,” in 2023 2nd International Conference on Computer System, Information Technology, and Electrical Engineering (COSITE), Aug. 2023, pp. 220–225. doi: 10.1109/COSITE60233.2023.10250039.
T. R. Noviandy, S. I. Nainggolan, R. Raihan, I. Firmansyah, and R. Idroes, “Maternal Health Risk Detection Using Light Gradient Boosting Machine Approach,” Infolitika J. Data Sci., vol. 1, no. 2, pp. 48–55, Dec. 2023, doi: 10.60084/ijds.v1i2.123.
H. Yang, Z. Chen, H. Yang, and M. Tian, “Predicting Coronary Heart Disease Using an Improved LightGBM Model: Performance Analysis and Comparison,” IEEE Access, vol. 11, pp. 23366–23380, 2023, doi: 10.1109/ACCESS.2023.3253885.
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2623–2631.
S. Shekhar, A. Bansode, and A. Salim, “A Comparative study of Hyper-Parameter Optimization Tools,” in 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Dec. 2021, pp. 1–6. doi: 10.1109/CSDE53843.2021.9718485.
A. N. Safriandono, D. R. I. M. Setiadi, A. Dahlan, F. Z. Rahmanti, I. S. Wibisono, and A. A. Ojugo, “Analyzing Quantum Feature Engineering and Balancing Strategies Effect on Liver Disease Classification,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 51–63, Jun. 2024, doi: 10.62411/faith.2024-12.
G. M. Idroes et al., “Urban Air Quality Classification Using Machine Learning Approach to Enhance Environmental Monitoring,” Leuser J. Environ. Stud., vol. 1, no. 2, pp. 62–68, Nov. 2023, doi: 10.60084/ljes.v1i2.99.
D. R. I. M. Setiadi, H. M. M. Islam, G. A. Trisnapradika, and W. Herowati, “Analyzing Preprocessing Impact on Machine Learning Classifiers for Cryotherapy and Immunotherapy Dataset,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 39–50, Jun. 2024, doi: 10.62411/faith.2024-2.
R. Suhendra et al., “Cardiovascular Disease Prediction Using Gradient Boosting Classifier,” Infolitika J. Data Sci., vol. 1, no. 2, pp. 56–62, Dec. 2023, doi: 10.60084/ijds.v1i2.131.
O. Jaiyeoba, E. Ogbuju, O. T. Yomi, and F. Oladipo, “Development of a Model to Classify Skin Diseases using Stacking Ensemble Machine Learning Techniques,” J. Comput. Theor. Appl., vol. 2, no. 1, pp. 22–38, May 2024, doi: 10.62411/jcta.10488.
F. Mustofa, A. N. Safriandono, A. R. Muslikh, and D. R. I. M. Setiadi, “Dataset and Feature Analysis for Diabetes Mellitus Classi-fication using Random Forest,” J. Comput. Theor. Appl., vol. 1, no. 1, pp. 41–48, Jan. 2023, doi: 10.33633/jcta.v1i1.9190.
D. R. I. M. Setiadi, K. Nugroho, A. R. Muslikh, S. W. Iriananda, and A. A. Ojugo, “Integrating SMOTE-Tomek and Fusion Learning with XGBoost Meta-Learner for Robust Diabetes Recognition,” J. Futur. Artif. Intell. Technol., vol. 1, no. 1, pp. 23–38, May 2024, doi: 10.62411/faith.2024-11.
T. R. Noviandy, G. M. Idroes, I. Hardi, M. Afjal, and S. Ray, “A Model-Agnostic Interpretability Approach to Predicting Customer Churn in the Telecommunications Industry,” Infolitika J. Data Sci., vol. 2, no. 1, pp. 34–44, May 2024, doi: 10.60084/ijds.v2i1.199.
T. R. Noviandy, G. M. Idroes, M. Syukri, and R. Idroes, “Interpretable Machine Learning for Chronic Kidney Disease Diagnosis: A Gaussian Processes Approach,” Indones. J. Case Reports, vol. 2, no. 1, pp. 24–32, Jun. 2024, doi: 10.60084/ijcr.v2i1.204.
A. Moncada-Torres, M. C. van Maaren, M. P. Hendriks, S. Siesling, and G. Geleijnse, “Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival,” Sci. Rep., vol. 11, no. 1, p. 6968, Mar. 2021, doi: 10.1038/s41598-021-86327-7.
T. R. Noviandy, G. M. Idroes, and I. Hardi, “Enhancing Loan Approval Decision-Making: An Interpretable Machine Learning Approach Using LightGBM for Digital Economy Development,” Malaysian J. Comput., vol. 9, no. 1, pp. 1734–1745, Apr. 2024, doi: 10.24191/mjoc.v9i1.25691.
C. Molnar, G. Casalicchio, and B. Bischl, “Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges,” in Communications in Computer and Information Science, Springer, Cham, 2020, pp. 417–431. doi: 10.1007/978-3-030-65965-3_28.
S. M. Lundberg et al., “From local explanations to global understanding with explainable AI for trees,” Nat. Mach. Intell., vol. 2, no. 1, pp. 56–67, Jan. 2020, doi: 10.1038/s42256-019-0138-9.
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?,’” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016, pp. 1135–1144. doi: 10.1145/2939672.2939778.
G. Hooker, L. Mentch, and S. Zhou, “Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance,” Stat. Comput., vol. 31, no. 6, p. 82, Nov. 2021, doi: 10.1007/s11222-021-10057-z.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Future Artificial Intelligence and Technologies
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.