ILA4: Overcoming Missing Values in Machine Learning Datasets –
An Inductive Learning Approach
Abstract
This article introduces ILA4: A new algorithm designed to handle datasets with missing values. ILA4 is inspired by a series of ILA algorithms that also handle missing data with further enhancements. ILA4 is applied to datasets with varying completeness and other known approaches for handling datasets with missing values. In the majority of cases, ILA4 produced a favourable performance that is on a par with many established techniques for treating missing values, including algorithms that are based on the Most Common Value (MCV), the Most Common Value Restricted to a Concept (MCVRC), and those that utilize the Delete strategy. ILA4 was also compared with three known algorithms: Logistic Regression, Naïve Bayes, and Random Forest; the accuracy obtained by ILA4 is comparable or better than the best results obtained from these three algorithms.
Keywords: Missing Data, Inductive Learning, Noise Data, Incompleteness, Delete Strategy, Most Common Value.
1. Introduction
Treating missing values in datasets used as resources for machine learning is an important task and crucial issue, especially when it is essential to use the complete available data. Knowing that there are large volumes of data available, the ratio of missing values therein is often too large for the learning model. So researchers have to decide either to ignore records with missing values or to have a way to treat this issue and substitute them with correct values. The first choice does not constitute the right approach because it is quite conceivable that some of the missing values are significant for the induction process. For example, Gender data values inpatient data is an essential feature for learning, and missing values would significantly affect the likelihood of accurate diagnoses for breast cancer diagnosis scenarios. Other scenarios include the bank account holder's annual income in deciding to grant loan or otherwise. Additionally, if the ratio of missing values in datasets is stifling, deletion will significantly diminish the depth of the learning data and hence hinder the model's accuracy, in many machine learning scenarios, deletion is probably the least favourable approach [1].
There is a strong argument for solving the missing values problem with more effective, less sweeping solutions to provide optimum reliability as an operational base for the application of predictive models, which, traditionally, are designed for complete datasets to enable these models to apply to the incomplete datasets rather than simply deleting instances and bypassing them. This is in tune with the argument that allocating significance to missing values maximizes the predictive model's effectiveness [2]. The authors accept that it is impossible to solve all missing values cases comprehensively, however, as shown in [3] when missing data constitute less than 1% of the total data then this kind of scenario is considered trivial. In contrast, a ratio of up to 5% is deemed to be manageable. However, rates of over the 5% threshold and approaching 15% necessitate the application of multifaceted methods for treatment, and, finally, ratios over 15% tend to impact the machine learning model's accuracy quite adversely.
The impact of missing values in datasets is significant in several ways, including but not limited to:
(i) Decreased efficiency. Resulting in fewer extracted patterns and classes and weaker statistical content.
(ii) Data preparation and analysis of complications. As the majority of learning models are designed for complete datasets.
(iii) Discrepancies between missing and full datasets quite often result in bias learning, including overfitting and under-fitting
(iv) The causes of data missingness are also varied, including incorrect measurements, faulty sensors/algorithms, human error, censored/anonymous data and many others [1], [4], [5].
The known approaches for dealing with missing data values can be categorized as follows [4]:
(i) Delete strategy, which ignores the data instances with missing feature/attribute values.
(ii) Uniform Treatment approach, which applies the exact solution for all scenarios (Any Value, Ignored Value, Common Value and Special Value Strategies)
(iii) Case-by-case approaches. In contrast to (ii) above, these are case/scenario-specific and apply Pessimistic value, Predicted value, and Distributed value Strategies.
Информация по комментариям в разработке