Effective Machine Learning Techniques for Handling missing Data

Authors

  • Kapil Prashar Professor, Department of Computer Science & Engineering, PCTE Institute of Engineering & Technology, Ludhiana, India
  • Ankush Rahuvanshi Student, Department of Computer Applications (MCA), PCTE Group of Institutes, Ludhiana, Punjab, India

Keywords:

Imputation, Cross-Validation, Hyperparameter Tuning, Median Imputation, End of Distribution Imputation

Abstract

Missing data in machine learning is a significant challenge, impacting predictive models’ performance. It can be caused by errors in data collection, incomplete responses, or system failures. Incorrect handling can lead to biased or inaccurate predictions. This paper explores various imputation methods, including mean imputation, median imputation, random imputation, and end-of-distribution imputation. Each method has specific applications based on the dataset’s nature and missing information. Mean imputation, which replaces missing values with the average of available data, is most effective when the data follows a normal distribution. The mean is a reliable measure of central tendency in symmetric distributions, but it may not be suitable for skewed data due to its less sensitive nature to outliers. Median imputation, the middle value in a sorted dataset, is ideal for skewed
data distributions. Random imputation, a more flexible technique, replaces missing values with randomly selected values but may require more computational resources, especially in large datasets. End-of- distribution imputation fills missing values with the lowest or highest value. This paper emphasises the significance of hyperparameter tuning in machine learning models, specifically GridSearchCV. This tool systematically explores various model parameter combinations to find the best-performing set, preventing overfitting and ensuring model generalisation to unseen data. It is particularly useful for complex models requiring fine-tuning. The paper emphasises the importance of combining robust imputation techniques with hyperparameter optimisation methods for reliable machine learning models, enhancing predictive power and reliability.

Published

2026-01-22