Evaluating Single and Hybrid Feature Selection for Rainfall Prediction Using XGBoost

Bambang Widoyono, Muhammad Fahmy Nadhif, Ridha Adjie Eryadi

Abstract


Rainfall prediction is challenging due to the complex and nonlinear nature of meteorological data. Previous studies using XGBoost with feature selection have demonstrated superior performance compared to other models, but evaluations have focused solely on error metrics (RSME, SME, MAE). Recent research suggests that predictive models should be evaluated for generalization, stability, interpretability, and computational efficiency to ensure their reliability. To close this gap, this study uses 8,750 hourly records obtained from Open-Meteo with 81 engineered features to evaluate XGBoost under three scenarios: without feature selection, single feature selection (MI, Boruta, SHAP, mRMR, ReliefF), and hybrid feature selection. Our findings demonstrate that accuracy is not always increased by feature selection. It does, however, increase interpretability, decrease overfitting, and improve computational efficiency. SHAP provides the most reliable performance among single methods, achieving lower RMSE (0.72632) and improved stability. Hybrid feature selection produces the most balanced performance gap = 0.01325, and stable variance = 0.03315 while reducing feature complexity to 35 variables. This study theoretically shows the value of multidimensional evaluation that goes beyond error metrics. In practical terms, this study suggests a feature selection method for rainfall prediction systems that are effective, reliable, and simple to understand.

Keywords


Feature Selection; Hybrid Feature Selection; Machine Learning; Rainfall Prediction; XGBoost

References


A. Mosavi, P. Ozturk, and K. W. Chau, “Flood prediction using machine learning models: Literature review,” Water (Switzerland), vol. 10, no. 11, p. 1536, 2018, doi: 10.3390/w10111536.

B. T. Pham and others, “A comparative study of rainfall prediction using statistical and machine learning methods,” Theor. Appl. Climatol., vol. 142, no. 3--4, pp. 885–900, 2020, doi: 10.1007/s00704-020-03247-3.

P. Abbaszadeh, H. Moradkhani, and H. Yan, “Enhancing hydrologic prediction with machine learning,” Hydrol. Earth Syst. Sci., vol. 23, no. 8, pp. 3621–3635, 2019, doi: 10.5194/hess-23-3621-2019.

S. Kundu, S. K. Biswas, D. Tripathi, S. Mandal, and S. Majumdar, “Rainfall forecasting using ensemble learning: A review,” e-Prime, vol. 6, p. 100296, 2023, doi: 10.1016/j.prime.2023.100296.

Y. Liu, W. Zhang, X. Yang, and Y. Sun, “Short-term and hourly rainfall forecasting using machine learning approaches,” Atmos. Res., vol. 276, p. 106244, 2022, doi: 10.1016/j.atmosres.2022.106244.

J. Li et al., “Feature selection: A data perspective,” ACM Comput. Surv., vol. 50, no. 6, p. 94, 2018, doi: 10.1145/3136625.

I. Iguyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.

Y. Saeys, T. Abeel, and Y. de Peer, “Robust feature selection using ensemble feature selection techniques,” Bioinformatics, vol. 24, no. 13, pp. i555--i563, 2008, doi: 10.1093/bioinformatics/btn630.

Y. Zhang, X. Li, L. Wang, and H. Zhang, “Ensemble feature selection methods for environmental applications,” Remote Sens., vol. 15, no. 4, p. 1096, 2023, doi: 10.3390/rs15041096.

L. Huang, X. Zhou, L. Shi, and L. Gong, “Time series feature selection based on mutual information,” Appl. Sci., vol. 14, no. 5, p. 1960, 2024, doi: 10.3390/app14051960.

H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005, doi: 10.1109/TPAMI.2005.159.

M. B. Kursa and W. R. Rudnicki, “Feature selection with the Boruta package,” J. Stat. Softw., vol. 36, no. 11, pp. 1–13, 2010, doi: 10.18637/jss.v036.i11.

M. Robnik-Šikonja and I. Kononenko, “Theoretical and empirical analysis of ReliefF,” Mach. Learn., vol. 53, no. 1--2, pp. 23–69, 2003.

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, 2017. doi: 10.48550/arXiv.1705.07874.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

Y. Liu, W. Zhang, X. Yang, and Y. Sun, “Hourly rainfall forecasting using machine learning approaches,” Atmos. Res., vol. 276, p. 106244, 2022, doi: 10.1016/j.atmosres.2022.106244.

H. Hersbach and others, “The ERA5 global reanalysis,” Q. J. R. Meteorol. Soc., vol. 146, no. 730, pp. 1999–2049, 2020, doi: 10.1002/qj.3803.

J. Muñoz-Sabater and others, “ERA5-Land: A state-of-the-art global reanalysis dataset,” Earth Syst. Sci. Data, vol. 13, no. 9, pp. 4349–4383, 2021, doi: 10.5194/essd-13-4349-2021.

D. N. Karger and others, “Climatologies at high resolution for the Earth’s land surface areas,” Sci. Data, vol. 4, p. 170122, 2017, doi: 10.1038/sdata.2017.122.

A. Bárdossy and G. Pegram, “Space--time conditional disaggregation of precipitation,” Water Resour. Res., vol. 52, no. 7, pp. 5216–5232, 2016, doi: 10.1002/2015WR018129.

Q. Yang, H. Wang, and Y. Chen, “Hourly precipitation forecasting using machine learning,” Atmos. Res., vol. 252, p. 105432, 2021, doi: 10.1016/j.atmosres.2021.105432.

R. J. Hyndman and G. Athanasopoulos, “Forecasting: principles and practice,” Int. J. Forecast., vol. 34, no. 4, pp. 693–694, 2018, doi: 10.1016/j.ijforecast.2018.05.001.

I. Lauriola, A. Lavelli, and F. Aiolli, “An introduction to deep learning in natural language processing,” Found. Trends Inf. Retr., vol. 14, no. 1, pp. 1–107, 2020, doi: 10.1561/1500000064.

R. Taormina, K. W. Chau, and R. Sethi, “Artificial neural network simulation of hourly groundwater levels,” J. Hydrol., vol. 531, pp. 102–115, 2015, doi: 10.1016/j.jhydrol.2015.10.037.

F. Kratzert, D. Klotz, C. Brenner, K. Schulz, and M. Herrnegger, “Rainfall--runoff modelling using LSTM networks,” Hydrol. Earth Syst. Sci., vol. 23, no. 12, pp. 5089–5110, 2019, doi: 10.5194/hess-23-5089-2019.

X. H. Le, H. Van Ho, G.-H. Lee, and S. Jung, “Application of Long Short-Term Memory Neural Network for Rainfall Forecasting,” Water, vol. 11, no. 7, p. 1387, 2019, doi: 10.3390/w11071387.

S. Poornima and M. Pushpalatha, “Prediction of rainfall using machine learning techniques,” Int. J. Eng. Adv. Technol., vol. 8, no. 3, pp. 656–660, 2019, doi: 10.35940/ijeat.C1243.0283S19.

C. Shen and others, “Deep learning in hydrology,” Water Resour. Res., vol. 56, no. 1, p. e2019WR025816, 2020, doi: 10.1029/2019WR025816.

J. Fan et al., “Comparison of Support Vector Machine and Extreme Gradient Boosting for Precipitation Forecasting,” Atmosphere (Basel)., vol. 12, no. 6, p. 709, 2021, doi: 10.3390/atmos12060709.

E. Y. Obsie, I. Masih, and A. Shiferaw, “Enhancing drought prediction through machine learning with mutual information-based feature selection,” Sci. Total Environ., vol. 912, p. 168704, 2025, doi: 10.1016/j.scitotenv.2024.168704.

I. E. Kumar, S. Venkatasubramanian, C. Scheidegger, and S. Friedler, “Problems with Shapley-value-based explanations as feature importance measures,” in Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.

Z. He, S. Liu, P.-W. Chan, and K. Zhao, “Integrating SHAP with machine learning for quantitative precipitation estimation,” Front. Environ. Sci., vol. 10, p. 1057081, 2022, doi: 10.3389/fenvs.2022.1057081.

A. Idris, A. Khan, and Y. S. Lee, “Intelligent churn prediction in telecom: Employing mRMR feature selection and RotBoost based ensemble classification,” Appl. Intell. 51, 2021, doi: 10.1007/s10489-020-01871-z.

S. Aljawarneh, M. Aldwairi, and M. B. Yassein, “Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model,” J. Comput. Sci., vol. 25, no. 1, pp. 152–160, 2018, doi: 10.1016/j.jocs.2017.03.006.

A. Zulfa, “Hybrid feature selection framework for weather variable identification,” J. Sist. Komput., vol. 15, no. 1, pp. 45–56, 2025, doi: 10.30743/jsk.v15i1.4157.

A. Zulfa, A. Saikhu, H. Pradana, and I. Budiawan, “An adaptive stacking approach for monthly rainfall prediction with hybrid feature selection,” Ultim. Comput. J. Sist. Komput., vol. 17, no. 1, 2025, doi: 10.31937/sk.v17i1.4157.

J. Fan et al., “Prediction of precipitation using XGBoost and deep learning models,” Atmosphere (Basel)., vol. 12, no. 6, p. 709, 2021, doi: 10.3390/atmos12060709.

R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast accuracy,” Int. J. Forecast., vol. 22, no. 4, pp. 679–688, 2006, doi: 10.1016/j.ijforecast.2006.03.001.

R. J. Hyndman, “Measuring forecast accuracy,” Int. J. Forecast., vol. 34, no. 1, pp. 1–2, 2018, doi: 10.1016/j.ijforecast.2017.08.001.

C. J. Willmott and K. Matsuura, “Advantages of the mean absolute error over the root mean square error in assessing average model performance,” Clim. Res., vol. 30, no. 1, pp. 79–82, 2005, doi: 10.3354/cr030079.

T. Chai and R. R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)?,” Geosci. Model Dev., vol. 7, no. 3, pp. 1247–1250, 2014, doi: 10.5194/gmd-7-1247-2014.

M. El Hafyani, K. El Himdi, and S.-E. El Adlouni, “Improving monthly precipitation prediction accuracy using machine learning models: a multi-view stacking learning technique,” Front. Water, vol. Volume 6-, 2024, doi: 10.3389/frwa.2024.1378598.

V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 988–999, 1999, doi: 10.1109/72.788640.

P. Domingos, “A few useful things to know about machine learning,” Commun. ACM, vol. 55, no. 10, pp. 78–87, 2012, doi: 10.1145/2347736.2347755.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2009.

R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif. Intell., vol. 97, no. 1--2, pp. 273–324, 1997, doi: 10.1016/S0004-3702(97)00043-X.

C. Nadeau and Y. Bengio, “Inference for the generalization error,” Mach. Learn., vol. 52, no. 3, pp. 239–281, 2003, doi: 10.1023/A:1024068626366.

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nat. Mach. Intell., vol. 1, pp. 206–215, 2019, doi: 10.1038/s42256-019-0048-x.

G. Chandrashekar and F. Sahin, “A Survey on Feature Selection Methods,” Comput. & Electr. Eng., vol. 40, no. 1, pp. 16–28, 2014, doi: 10.1016/j.compeleceng.2013.11.024.

L. Zhang et al., “Interpolated or satellite-based precipitation? Implications for hydrological modeling in a meso-scale mountainous watershed on the Qinghai-Tibet Plateau,” J. Hydrol., vol. 583, p. 124629, 2020, doi: 10.1016/j.jhydrol.2020.124629.




DOI: http://dx.doi.org/10.24014/ijaidm.v9i1.39110

Refbacks

  • There are currently no refbacks.


Office and Secretariat:

Big Data Research Centre
Puzzle Research Data Technology (Predatech)
Laboratory Building 1st Floor of Faculty of Science and Technology
UIN Sultan Syarif Kasim Riau

Jl. HR. Soebrantas KM. 18.5 No. 155 Pekanbaru Riau – 28293
Website: http://predatech.uin-suska.ac.id/ijaidm
Email: ijaidm@uin-suska.ac.id
e-Journal: http://ejournal.uin-suska.ac.id/index.php/ijaidm
Phone: 085275359942

Click Here for Information


Journal Indexing:

Google Scholar | ROAD | PKP Index | BASE | ESJI | General Impact Factor | Garuda | Moraref | One Search | Cite Factor | Crossref | WorldCat | Neliti  | SINTA | Dimensions | ICI Index Copernicus 

IJAIDM Stats