Trends and Advances on The K-Hyperparameter Tuning Techniques in High-Dimensional Space Clustering

Rufus Kinyua Gikera, Jonathan Mwaura, Elizaphan Maina, Shadrack Mambo

Abstract


Clustering is one of the tasks performed during exploratory data analysis with an extensive and wealthy history in a variety of disciplines. Application of clustering in computational medicine is one such application of clustering that has proliferated in the recent past. K-means algorithms are the most popular because of their ability to adapt to new examples besides scaling up to large datasets. They are also easy to understand and implement. However, with k-means algorithms, k-hyperparameter tuning is a long standing challenge. The sparse and redundant nature of the high-dimensional datasets makes the k-hyperparameter tuning in high-dimensional space clustering a more challenging task. A proper k-hyperparameter tuning has a significant effect on the clustering results. A number of state-of-the art k-hyperparameter tuning techniques in high-dimensional space have been proposed.  However, these techniques perform differently in a variety of high-dimensional datasets and data-dimensionality reduction methods. This article uses a five-step methodology  to investigate the trends and advances on the state of the art k-hyperparameter tuning techniques in high-dimensional space clustering, data dimensionality reduction methods used with these techniques, their tuning strategies, nature of the datasets applied with them as well as the challenges associated with the cluster analysis in high-dimensional spaces. The metrics used in evaluating these techniques are also reviewed. The results of this review, elaborated in the discussion section, makes it efficient for data science researchers to undertake an empirical study among these techniques; a study that subsequently forms the basis for creating improved solutions to this k-hyperparameter tuning problem.

Keywords


Clustering Unsupervised Learning K-hyperparameter tuning; High-dimensional space

Full Text:

PDF

References


M. Capó, A. Pérez, and J. A. Lozano, “An efficient K-means clustering algorithm for tall data. ,” Data Min Knowl Discov, pp. 1–36, 2020.

K. Chowdhury, D. Chaudhuri, A. K. Pal, and A. Samal, “ Seed selection algorithm through K-means on optimal number of clusters,” Multimedia Tools, 2019.

J. Di and X. Gou, “Bisecting K-means Algorithm Based on K-valued Self-determining and Clustering Center Tuning. JCP, 13(6),” pp. 588-595., 2018.

A. Dubey and A. P. D. A. Choubey, “A Systematic Review on K-Means Clustering Techniques,” International Journal of Scientific Research Engineering & Technology (IJSRET), pp. 2278–0882, 2018.

M. K. Gupta and P. Chandra, “Pk-means: k- means using partition based cluster initialization method. Available at SSRN 3462549.”.

S. F. Hussain and M. Haris, “ A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data,” Expert Syst Appl, vol. 118, pp. 20–34, 2019.

H. Ismkhan, “Ik-means−+: An iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recognition, 79,” 2018.

A. J. Onumanyi, D. N. Molokomme, S. J. Isaac, and A. M. Abu-Mahfouz, “ AutoElbow: An automatic elbow detection method for estimating the number of clusters in a dataset. ,” Applied Sciences., vol. 12, no. 15, p. 7515, 2022.

V. P. Murugesan and P. Murugesan, “ A new initialization and performance measure for the rough k-means clustering. Soft Computing,” pp. 1-15., 2020.

T. Liu, S. Qu, and K. Zhang, “ A Clustering Algorithm for Automatically Determining the Number of Clusters Based on Coefficient of Variation. ,” In Proceedings of the 2nd International Conference on Big Data Research, pp. 100–106, 2018.

A. R. Mamat, M. A. Mohamed, N. M. Rawi, and M. I. Awang, “ Silhouette index for determining optimal k-means clustering on images in different color models. ,” International Journal of Engineering and Technology , vol. 7, pp. 105–109, 2018.

M. Mughnyanti, S. Efendi, and M. Zarlis, “Analysis of determining centroid clustering x-means algorithm with davies-bouldin index evaluation. ,” In IOP Conference Series: Materials Science and Engineering ). IOP Publishing, vol. 725, no. 1, pp. 012–128, 2020.

J. P. Ortega, N. N. A. Ortega, J. A. Ruiz-Vanoye, S. S. Sánchez, J. M. R. Lelis, and A. M. Rebollar, “A-means: improving the cluster assignment phase of k-means Big Data.,” International Journal of Combinatorial Tuning Problems and Informatics, vol. 9, no. 2, pp. 3-10., 2018.

C. D. Nguyen and T. H. Duong, “ K- means**–a fast and efficient K-means algorithms. ,” International Journal of Intelligent Information and Database Systems, vol. 11, no. 1, pp. 27–45, 2018.

Q. Tao, C. Gu, Z. Wang, and D. Jiang, “ An intelligent clustering algorithm for high-dimensional multiview data in big data applications. ,” Neurocomputing, vol. 393, pp. 234-244., 2020.

W. . . , and, ‘‘ Song, D. Li, Y. Ma, Wu, and D. Ji, “An enhanced clusteringbased method for determining time-of-day breakpoints through process tuning,” IEEE Access, vol. 6, pp. 29241–29253, 2018.

J. W. Harris and H. Stöcker, Handbook of mathematics and computational science. . 1998.

S. Chakraborty and S. Das, “Detecting meaningful clusters from high-dimensional data: A strongly consistent sparse center-based clustering approach. ,” IEEE Trans Pattern Anal Mach Intell, vol. 44, no. 6, pp. 2894–2908, 2020.

S. Sun, Z. Cao, H. Zhu, and J. Zhao, “ A survey of tuning methods from a machine learning perspective. ,” IEEE Trans Cybern, 2019.

Š. Brodinová, P. Filzmoser, T. Ortner, C. Breiteneder, and M. Rohm, “Robust and sparse k-means clustering for high-dimensional data. ,” Adv Data Anal Classif, vol. 13, pp. 905-932., 2019.

S. Pandey and L. kumar Tiwari, “Review of Existing Methods in K-means Clustering Algorithm.,” 2018.

C. Patil and I. Baidari, “Estimating the Optimal Number of Clusters k in a Dataset Using Data Depth. ,” Data Sci Eng, vol. 4, pp. 132–140, 2019.

N. Sandhya and M. R. Sekar, “ Analysis of variant approaches for initial centroid selection in K-means clustering algorithm. In Smart Computing and Informatics ,” Springer, Singapore., pp. 109–121, 2018.

C. Yuan and H. Yang, “Research on K- value selection method of K-means clustering algorithm.,” J—Multidisciplinary Scientific Journal, vol. 2, no. 2, pp. 226-235., 2019.

G. Zhang, C. Zhang, and H. Zhang, “Improved K-means algorithm based on density Canopy. ,” Knowl Based Syst, vol. 145, pp. 289-297., 2018.

S. S. Yu, S. W. Chu, C. M. Wang, C. M. Wang, Y. K. Chan, and T. C. Chang, “Two improved k-means algorithms. ,” Applied Soft Computing, vol. 68, pp. 747-755., 2018.

S. Hess and W. Duivesteijn, “k is the Magic Number--Inferring the Number of Clusters Through Nonparametric Concentration Inequalities. ,” arXiv preprint arXiv:1907, p. 02343, 2019.

H. I. Hayatu, A. Mohammed, and A. B. Isma’eel, “Big Data Clustering Techniques: Recent Advances and Survey. In Machine Learning and Data Mining for Emerging Trend in Cyber Dynamics ,” Springer International Publishing: Berlin/Heidelberg, Germany., pp. 57–79, 2021.

S. Nawrin, M. R. Rahman, and S. Akhter, “Exploring k-means with internal validity indexes for data clustering in traffic management system,” International Journal of Advanced Computer Science and Applications, vol. 8, no. 3, 2017.

W. Lu, “Improved K-means clustering algorithm for big data mining under Hadoop parallel framework. ,” J Grid Comput, vol. 18, pp. 239-250., 2020.

H. Xie et al., “Improving K-means clustering with enhanced Firefly Algorithms. ,” Appl Soft Comput, vol. 84, p. 105763, 2019.

X. D. Wang, R. C. Chen, F. Yan, Z. Q. Zeng, and C. Q. Hong, “Fast adaptive K-means subspace clustering for high-dimensional data. ,” IEEE Access, vol. 7, pp. 42639-42651., 2019.

P. Roy and J. K. Mandal, “Performance evaluation of some clustering indices,” In Computational Intelligence in Data Mining. Springer, New Delhi., vol. 3, pp. 509–517, 2015.

J. Hämäläinen, S. Jauhiainen, and T. Kärkkäinen, “Comparison of internal clustering validation indices for prototype-based clustering. Algorithms,” vol. 10, no. 3, p. 105, 2017.

M. Hassani and T. Seidl, “Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam Journal of Computer Science,” vol. 4, no. 3, pp. 171-183., 2017.

M. Jain, M. Jain, T. AlSkaif, and S. Dev, “Which internal validation indices to use while clustering electric load demand profiles ,” Sustainable Energy, Grids and Networks, vol. 32, p. 100849, 2022.

H. Xiong and Z. Li, “Clustering validation measures. ,” In Data Clustering. Chapman and Hall/CRC, pp. 571–606, 2018.

K. Orkphol and W. Yang, “Sentiment analysis on microblogging with K-means clustering and artificial bee colony. ,” Int J Comput Intell Appl, vol. 18, no. 3, p. 1950017, 2019.

S. Dey, S. Das, and R. Mallipeddi, “The Sparse MinMax k- Means Algorithm for High-Dimensional Clustering. ,” In IJCAI , pp. 2103–2110, Jul. 2020.

Y. Hozumi, R. Wang, C. Yin, and G. W. Wei, “UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.,” Computers in biology and medicine, vol. 131, p. 104264., 2021.

X. F. Song, Y. Zhang, D. W. Gong, and X. Z. Gao, “ A fast hybrid feature selection based on correlation-guided clustering and particle swarm tuning for high-dimensional data. ,” IEEE Trans Cybern, vol. 52, no. 9, pp. 9573-9586., 2021.

T. S. Babu, J. P. Ram, T. Dragicevi, M. M. c, F. Blaabjerg, and N. Rajasekar, “Particle swarm tuning based solar pv arrayreconfiguration of the maximum power extraction under partial shading conditions,” IEEE Transactions on Sustainable Energy , vol. 9, no. 1, 2018.

K. Peng, V. C. Leung, and Q. Huang, “Clustering approach based on mini batch kmeans for intrusion detection system over big data,” IEEE Access, vol. 6, pp. 11897–11906, 2018.

J. Pérez-Ortega, Almanza-Ortega N.N, A. Vega-Villalobos, R. Pazos-Rangel, C. Zavala-Díaz, and A. Martínez-Rebollar, “The K-means algorithm evolution. ,” Introduction to Data Science and Machine Learning, 2019.

Z. Rustam, J. Pandelaki, and A. Siahaan, “ Kernel spherical k-means and support vector machine for acute sinusitis classification. ,” In IOP Conference Series: Materials Science and Engineering , vol. 546, no. 5, p. 052011, Jun. 2019.

Ruiz-Vanoye, Socorro Saenz S ´ anchez, Jos ´ e Mar ´, Rodr´ıguez Lelis, and Alicia Mart´ınez Rebollar, “A-means: improving the cluster assignment phase of k-means for big data.,” International Journal of Combinatorial Tuning Problems and Informatics, vol. 9, no. 2, pp. 3–10, 2018.

F. Yan, X. D. Wang, Z. Q. Zeng, and C. Q. Hong, “Adaptive multi-view subspace clustering for high-dimensional data. Pattern Recognition Letters,” vol. 130, pp. 299-305., 2020.

S. Xia et al., “ Ball k k-Means: Fast Adaptive Clustering With No Bounds. ,” IEEE Trans Pattern Anal Mach Intell, vol. 44, no. 1, pp. 87-99., 2020.

M. J. Rezaee, M. Eshkevari, M. Saberi, and O. Hussain, “ GBK-means clustering algorithm: An improvement to the K-means algorithm based on the bargaining game. ,” Knowl Based Syst, vol. 213, p. 106672, 2021.

E. Rendón, I. Abundez, A. Arizmendi, and E. M. Quiroz, “Internal versus external cluster validation indexes,” International Journal of computers and communications, vol. 5, no. 1, pp. 27-34., 2011.




DOI: http://dx.doi.org/10.24014/ijaidm.v6i2.22718

Refbacks

  • There are currently no refbacks.


Office and Secretariat:

Big Data Research Centre
Puzzle Research Data Technology (Predatech)
Laboratory Building 1st Floor of Faculty of Science and Technology
UIN Sultan Syarif Kasim Riau

Jl. HR. Soebrantas KM. 18.5 No. 155 Pekanbaru Riau – 28293
Website: http://predatech.uin-suska.ac.id/ijaidm
Email: ijaidm@uin-suska.ac.id
e-Journal: http://ejournal.uin-suska.ac.id/index.php/ijaidm
Phone: 085275359942

Click Here for Information


Journal Indexing:

Google Scholar | ROAD | PKP Index | BASE | ESJI | General Impact Factor | Garuda | Moraref | One Search | Cite Factor | Crossref | WorldCat | Neliti  | SINTA | Dimensions | ICI Index Copernicus 

IJAIDM Stats