Overview
When values are missing from a dataset, the analyst needs to determine how to handle the missing data. If the data that is missing is completely random, then the points with the missing data can safely be removed. However, if the fact that the data is missing is somehow correlated with what the missing data is, or correlated to other properties of the data record, removing the data points with missing data can introduce a bias.
Topics
- Remove Records with Missing Values
- Imputing a Value
- uses a model or some statistic to replace the missing value with an actual value
- Mean Value (median or mode) - replace the missing data with the mean of the data points for that property
- k-Nearest Neighbors
- Regression Based Imputation - uses a model such as linear regression to predict the missing value from the other properties on the data record.