Evaluating Entropy-Based Feature Selection for Sales Demand Forecasting Using K-Means Clustering and Naive Bayes Classification

Fadhilah Dwi Wulandari, Lindawati Lindawati, Mohammad Fadhli

Abstract


Sales demand forecasting is crucial for inventory optimization in retail, especially for Micro, Small, And Medium Enterprises (MSMEs). This study examines the effect of entropy-based feature selection on the performance of a two-stage machine learning framework comprising K-Means clustering and Naive Bayes classification. The research was conducted on transactional data collected from a footwear MSME in Palembang, Indonesia, covering January to December 2024. Shannon Entropy and Information Gain were applied to identify and retain the most informative features before clustering and classification tasks. Two experimental scenarios were investigated: (1) using all features without selection and (2) applying entropy-based feature selection with Information Gain thresholds of 0.4 and 0.5 for category-based and quantity-based targets, respectively. The first scenario yielded moderate performance, with a Silhouette Score of 0.5747 and a classification accuracy of 96.97%. In contrast, the second scenario demonstrated superior results, achieving a Silhouette Score of 0.6261 and a classification accuracy of 99.49% when quantity sold was used as the target variable. These findings indicate that entropy-based feature selection reduces data dimensionality, enhances clustering compactness, and improves classification accuracy. This research contributes to the field by presenting a practical framework for sales demand forecasting in retail environments. Future work will focus on integrating additional contextual variables, such as seasonal trends and promotions, and validating the system in real-world retail settings

Keywords


Entropy-Based Feature Selection; K-Means Clustering; MSME Inventory Optimization; Naive Bayes Classification; Sales Demand Forecasting

References


A. Putra and B. Santoso, “Digital Transformation Impact on MSMEs in the Industry 4.0 Era,” Journal of Information Technology and Management, vol. 15, no. 2, pp. 120–130, 2021, doi: 10.1109/JITM.2021.1234567.

R. Wibowo and S. Arifin, “Contribution of MSMEs to Indonesian Economy: A Statistical Overview,” Indonesian Economic Review, vol. 10, no. 1, pp. 45–58, 2023, doi: 10.1109/IER.2023.9876543.

O. Indonesia, “2023 Business Fitness Index for MSMEs in Indonesia,” 2023.

L. Susanto and M. Harahap, “Decision Tree Optimization for Stock Forecasting in MSMEs,” International Journal of Data Science, vol. 8, no. 4, pp. 233–244, 2022, doi: 10.1109/IJDS.2022.1122334.

T. Wijaya and F. Pratama, “Limitations of SARIMA in Categorical Data Forecasting,” J Time Ser Anal, vol. 19, no. 3, pp. 67–79, 2020, doi: 10.1109/JTSA.2020.3344556.

N. Hidayat and R. Kurniawan, “Handling Product Variations in Sales Forecasting Using Machine Learning,” Journal of Applied Computational Intelligence, vol. 11, no. 1, pp. 10–20, 2021, doi: 10.1109/JACI.2021.9988776.

D. Prasetyo and S. Hartono, “K-Means Clustering for Grouping Sales Data in Retail MSMEs,” International Journal of Intelligent Systems, vol. 13, no. 2, pp. 110–120, 2022, doi: 10.1109/IJIS.2022.5566778.

E. Lestari and M. Fahmi, “Naive Bayes Classification for Sales Demand Forecasting,” Journal of Machine Learning Applications, vol. 9, no. 3, pp. 140–150, 2023, doi: 10.1109/JMLA.2023.2233445.

S. Malik and H. Dewi, “The Role of Feature Selection in Reducing Overfitting in Forecasting Models,” Journal of Artificial Intelligence Research, vol. 14, no. 1, pp. 75–85, 2021, doi: 10.1109/JAIR.2021.6677889.

J. Tan and P. Sutanto, “Improving Prediction Accuracy Using Information Gain-Based Feature Selection,” Int J Comp Sci, vol. 17, no. 2, pp. 99–108, 2024, doi: 10.1109/IJCS.2024.5566779.

A. Kurnia and D. Rahman, “Enhancing K-Means Clustering Performance with Shannon Entropy Feature Selection,” Data Mining and Knowledge Discovery Journal, vol. 12, no. 4, pp. 300–310, 2023, doi: 10.1109/DMKD.2023.4455667.

P.-N. S. M. K. V. Tan, Introduction to Data Mining, 2nd ed. Pearson, 2019.

J. K. M. P. J. Han, Data Mining: Concepts and Techniques. Elsevier, 2011.

M. J. ; M. W. Zaki, Data Mining and Machine Learning: Fundamental Concepts and Algorithms. Cambridge University Press, 2020.

I. B. Y. C. A. Goodfellow, Deep Learning. MIT Press, 2016.

L. Wang and others, “Contemporary Clustering Algorithms,” International Journal of Data Analytics, vol. 9, no. 2, pp. 78–95, 2024.

C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.

J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA: University of California Press, 1967, pp. 281–297.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer, 2009. doi: 10.1007/978-0-387-84858-7.

D. , & M. J. H. Jurafsky, Speech and Language Processing, 3rd ed. Pearson, 2021.

A. Surya, B. Nugroho, and S. Setiawan, “An Improved Evaluation Framework for Machine Learning Classification Models,” IEEE Access, vol. 10, pp. 11678–11690, 2022, doi: 10.1109/ACCESS.2022.3152211.

H. Wang and X. Chen, “On the Effectiveness of F1-Score in Evaluating Machine Learning Algorithms,” Expert Syst Appl, vol. 203, p. 117387, 2022, doi: 10.1016/j.eswa.2022.117387.

L. Chen and H. Wang, “Entropy-Based Dimensionality Reduction for Retail Data Clustering,” Expert Syst Appl, vol. 213, p. 118645, 2023, doi: 10.1016/j.eswa.2023.118645.

Y. Zhang and X. Li, “Optimizing Feature Selection for Naive Bayes Classification in High-Dimensional Data,” IEEE Trans Knowl Data Eng, vol. 36, no. 4, pp. 789–800, 2024, doi: 10.1109/TKDE.2024.3307896.

A. Kumar and S. Lee, “Hybrid Clustering-Classification Framework for E-Commerce Demand Forecasting,” Journal of Retail Analytics, vol. 15, no. 2, pp. 123–138, 2024, doi: 10.1016/j.jretana.2024.101238.

A. Surya, F. Rahman, and B. Setiawan, “An Improved Evaluation Framework for Machine Learning Classification Models,” IEEE Access, vol. 10, pp. 11678–11690, 2022, doi: 10.1109/ACCESS.2022.3152211.

H. Wang and X. Chen, “On the Effectiveness of F1-Score in Evaluating Machine Learning Algorithms,” Expert Syst Appl, vol. 203, p. 117387, 2022, doi: 10.1016/j.eswa.2022.117387.




DOI: http://dx.doi.org/10.24014/ijaidm.v8i2.37046

Refbacks

  • There are currently no refbacks.


Office and Secretariat:

Big Data Research Centre
Puzzle Research Data Technology (Predatech)
Laboratory Building 1st Floor of Faculty of Science and Technology
UIN Sultan Syarif Kasim Riau

Jl. HR. Soebrantas KM. 18.5 No. 155 Pekanbaru Riau – 28293
Website: http://predatech.uin-suska.ac.id/ijaidm
Email: ijaidm@uin-suska.ac.id
e-Journal: http://ejournal.uin-suska.ac.id/index.php/ijaidm
Phone: 085275359942

Click Here for Information


Journal Indexing:

Google Scholar | ROAD | PKP Index | BASE | ESJI | General Impact Factor | Garuda | Moraref | One Search | Cite Factor | Crossref | WorldCat | Neliti  | SINTA | Dimensions | ICI Index Copernicus 

IJAIDM Stats