In recent years, data science has emerged as one of the most significant variables in both the realm of research and the realm of business potential. The existence of missing values is typically observed in real-world datasets, which might present a challenge. There are a variety of methods that can be used to deal with missing values. Imputation methods that are most commonly used to fill in missing data include the mean imputation, the median imputation, and the KNN imputation. The most significant drawback of the mean and mode methods is that, if there are a significant number of missing values, all of those values will be imputed with the same value. This will result in a change to the shape of the distribution, and the variance will be reduced when compared to its value before and after imputation. The more values that are absent, the greater the shrinking that will occur within the variance. In order to address this shortcoming of existing imputations, we have developed a brand-new imputation method. Multiple clustering's serve as the basis for multiple mean calculations (MMMC). When there are missing values in a dataset variable, MMMC imputation will substitute those values with several separate means rather than a single mean. The means obtained from the use of multiple clustering with the other variables contained in the dataset. The findings demonstrate that MMMC is superior to the other imputation strategies in a number of respects.
Published in | International Journal on Data Science and Technology (Volume 8, Issue 3) |
DOI | 10.11648/j.ijdst.20220803.11 |
Page(s) | 48-54 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2022. Published by Science Publishing Group |
Data Preprocessing, Missing Data, Data Imputation, Clustering
[1] | A. V. D. H. G. S. T. a. M. Donders, "A gentle introduction to imputation of missing values," Journal of clinical epidemiology, vol. 59, pp. 1087-1091, 2006. |
[2] | O. C. M. S. G. B. P. H. T. T. R. B. D. a. A. R. Troyanskaya, "Missing value estimation methods for DNA microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001. |
[3] | P. a. H. J. Flyer, "Missing data in confirmatory clinical trials," Journal of biopharmaceutical statistics, vol. 19, pp. 969-979, 2009. |
[4] | A. a. E. C. Baraldi, "An introduction to modern missing data analyses," Journal of school psychology, vol. 48, pp. 5-37, 2010. |
[5] | T. Schneider, "Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values," Journal of climate, vol. 14, pp. 853-871, 2001. |
[6] | R. J. A. L. a. D. B. Rubin, "Statistical Analysis with Missing Data". |
[7] | M. A.-M. A. a. P. P. Osman, "A survey on data imputation techniques: Water distribution system as a use case," IEEE Access, vol. 6, pp. 63279-63291, 2018. |
[8] | J. P. J. a. K. M. Han, Data mining: concepts and techniques, Elsevier, 2011. |
[9] | A. P. D. a. R. K. Jadhav, "Comparison of performance of data imputation methods for numeric dataset," Applied Artificial Intelligence, vol. 33, pp. 913-933, 2019. |
[10] | J. a. G. J. Schafer, "Missing data: our view of the state of the art," Psychological methods, vol. 7, p. 147, 2002. |
[11] | D. Rubin, "Inference and missing data," Biometrika, vol. 63, pp. 581-592, 1976. |
[12] | K. a. R. V. Nishanth, "Probabilistic neural network based categorical data imputation," Neurocomputing, vol. 218, pp. 17-25, 2016. |
[13] | M. A. J. L.-M. P. M. S. a. P. D. Gómez-Carracedo, "A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets," Chemometrics and Intelligent Laboratory Systems, vol. 134, pp. 23-33, 2014. |
[14] | P. S.-G. J. a. F.-V. A. García-Laencina, "Pattern classification with missing data: a review," Neural Computing and Applications, vol. 19, pp. 263-282, 2010. |
[15] | C. L. F. d. C. J. F. a. S. A. Galán, "Missing data imputation of questionnaires by means of genetic algorithms with different fitness functions," Journal of Computational and Applied Mathematics, vol. 311, pp. 704-717, 2017. |
[16] | Y. a. C.-d. B. Wang, "An online Bayesian filtering framework for Gaussian process regression: Application to global surface temperature analysis," Expert Systems with Applications, vol. 67, pp. 285-295, 2017. |
[17] | D. a. M. T. Blend, "Comparison of data imputation techniques and their impact," arXiv preprint arXiv: 0812. 1539, 2008. |
[18] | J. G. L. E. A. a. P. L. Dauwels, "Tensor factorization for missing data imputation in medical questionnaires," in IEEE, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). |
[19] | H. F. G. F. J. W. W. Z. Y. a. L. F. Tan, "A tensor-based method for missing traffic data completion," Transportation Research Part C: Emerging Technologies, vol. 28, pp. 15-27, 2013. |
[20] | M. Mørup, "Applications of tensor (multiway array) factorizations and decompositions in data mining," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, pp. 24-40, 2011. |
[21] | R. a. R. D. Little, "The analysis of social science data with missing values," Sociological Methods & Research, vol. 18, pp. 292-326, 1989. |
[22] | M. Lichman, "UCI Machine Learning Repository," University of California, School of Information and Computer Science, 2013. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed 24 1 2022]. |
[23] | P. M. J. a. G. M. Schmitt, "A comparison of six methods for missing data imputation," Journal of Biometrics & Biostatistics, vol. 6, p. 1, 2015. |
APA Style
Raed Rasheed, Wesam Ashour. (2022). Multiple Means Based on Multiple Clustering (MMMC) Imputation. International Journal on Data Science and Technology, 8(3), 48-54. https://doi.org/10.11648/j.ijdst.20220803.11
ACS Style
Raed Rasheed; Wesam Ashour. Multiple Means Based on Multiple Clustering (MMMC) Imputation. Int. J. Data Sci. Technol. 2022, 8(3), 48-54. doi: 10.11648/j.ijdst.20220803.11
@article{10.11648/j.ijdst.20220803.11, author = {Raed Rasheed and Wesam Ashour}, title = {Multiple Means Based on Multiple Clustering (MMMC) Imputation}, journal = {International Journal on Data Science and Technology}, volume = {8}, number = {3}, pages = {48-54}, doi = {10.11648/j.ijdst.20220803.11}, url = {https://doi.org/10.11648/j.ijdst.20220803.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdst.20220803.11}, abstract = {In recent years, data science has emerged as one of the most significant variables in both the realm of research and the realm of business potential. The existence of missing values is typically observed in real-world datasets, which might present a challenge. There are a variety of methods that can be used to deal with missing values. Imputation methods that are most commonly used to fill in missing data include the mean imputation, the median imputation, and the KNN imputation. The most significant drawback of the mean and mode methods is that, if there are a significant number of missing values, all of those values will be imputed with the same value. This will result in a change to the shape of the distribution, and the variance will be reduced when compared to its value before and after imputation. The more values that are absent, the greater the shrinking that will occur within the variance. In order to address this shortcoming of existing imputations, we have developed a brand-new imputation method. Multiple clustering's serve as the basis for multiple mean calculations (MMMC). When there are missing values in a dataset variable, MMMC imputation will substitute those values with several separate means rather than a single mean. The means obtained from the use of multiple clustering with the other variables contained in the dataset. The findings demonstrate that MMMC is superior to the other imputation strategies in a number of respects.}, year = {2022} }
TY - JOUR T1 - Multiple Means Based on Multiple Clustering (MMMC) Imputation AU - Raed Rasheed AU - Wesam Ashour Y1 - 2022/10/11 PY - 2022 N1 - https://doi.org/10.11648/j.ijdst.20220803.11 DO - 10.11648/j.ijdst.20220803.11 T2 - International Journal on Data Science and Technology JF - International Journal on Data Science and Technology JO - International Journal on Data Science and Technology SP - 48 EP - 54 PB - Science Publishing Group SN - 2472-2235 UR - https://doi.org/10.11648/j.ijdst.20220803.11 AB - In recent years, data science has emerged as one of the most significant variables in both the realm of research and the realm of business potential. The existence of missing values is typically observed in real-world datasets, which might present a challenge. There are a variety of methods that can be used to deal with missing values. Imputation methods that are most commonly used to fill in missing data include the mean imputation, the median imputation, and the KNN imputation. The most significant drawback of the mean and mode methods is that, if there are a significant number of missing values, all of those values will be imputed with the same value. This will result in a change to the shape of the distribution, and the variance will be reduced when compared to its value before and after imputation. The more values that are absent, the greater the shrinking that will occur within the variance. In order to address this shortcoming of existing imputations, we have developed a brand-new imputation method. Multiple clustering's serve as the basis for multiple mean calculations (MMMC). When there are missing values in a dataset variable, MMMC imputation will substitute those values with several separate means rather than a single mean. The means obtained from the use of multiple clustering with the other variables contained in the dataset. The findings demonstrate that MMMC is superior to the other imputation strategies in a number of respects. VL - 8 IS - 3 ER -