Fundamental approach in dealing with missing data

Share At:

As a Data Scientist we always have to deal with messy data, which is not cleaned up. Incorrect or inconsistent data leads to false conclusions. So, before we inject our data to some machine learning algorithm to get some prediction, one has to clean the data first.

In this article we will be dealing with missing data and will look at various scenarios.

Missing Data refers to “non availability” of information for one or more items or whole unit. Many datasets simply arrive with missing data either because it was not collected, or it never existed.

In Pandas, mostly the missing value are represented as ‘NaN’ or ‘None’ format. However, not all missing value come in nice and clean way — sometimes it comes as “?” or “ — ” Character format too.

If we know what kind of characters are used as missing values in dataset, we can handle them while creating the dataframe using “na_values” parameter( We’ll discuss this later).

Now the question is, how to identify missing values in a dataset? Well, following method can be used in identifying missing values:

df.isna() → This will show the missing value in “True/False” format for each cell in the dataframe.

df.isna().sum() → This will show the count of missing values for every column.

df.isna().any() à This will show the missing value in “True/ False” format for every column. If a column has any missing value, it will be True.

This can be explained below. Consider the following dataframe:

Now with the commands discussed above, we can now get below outcomes:

Now with the commands discussed above, we can now get below outcomes:

Handling data with values like “?” and “ — ” :

Consider we have dataset like this:

The characters “?” and “ — ” cannot be detected as missing values by pandas. We can handle them by using “na_values” parameter.

Another option is to use pandas replace function. Consider this example:

How to Clean Missing Values?

Depending on characteristics of the dataset, we can choose to:-

A. Drop missing values

B. Fill missing values

Let’s see them in detail:

A. Drop Missing values:

We can drop a row or column with missing values using dropna() function.

“how” parameter is used to set the condition to drop.

· how=”any” → Drop if there is any missing value.

· how=”all” → Drop if all the values are missing.

Moreover, using “thresh” parameter, we can set a threshold for missing values for row/column to be dropped.

Ex: df.dropna( axis=0, how=”all”, inplace=”True”)

Where,

axis=0 means row

axis=1 means column

The default value for “how” is “any”, means, we need not specify it explicitly.

df.dropna( axis=0, how=”any”, inplace=”True”) and df.dropna( axis=0, inplace=”True”) both are same.

Let’s consider below examples:

The below Command didn’t drop anything as none of the row has complete missing values:

But, below command drops everything.

Now look at how thresh is being used:

Dropping columns with missing values:

Let’s see how dropping in column works.

Now add one more column (column_D) with all missing values. Our dataframe should look like this:

(a)Drop column if they only have missing values:

(b)Dropping Column using thresh parameter.

Now if the column has minimum 2 “non-NaN “value, keep it, otherwise delete the column.

B. Filling Missing values:

Missing data can be filled in many ways and depends on business understanding. The below methods can be used while filling missing data.

  1. Filling Missing values with Mean
  2. Filling missing value with Median

3. Filling missing values with forward fill.

4. Filling missing values with backward fill.

5. Filling categorical value with mode.

6. Interpolate

Let’s look at the above filling methods with examples:

  1. Filling Missing values with Mean:

2. Filling missing value with Median:

3. Filling missing values with forward fill.

4. Filling missing values with backward fill.

5. Filling categorical value with mode.

let’s add new column — column_D with categorical values:

Now we will be filling Column_D it with the mode value of the column. Calculate the mode of column_D:

Now fill the missing values of column_D with mode value:

6. Interpolate:

Pandas dataframe.interpolate() function is basically used to fill NA values in the dataframe or series. This is a very powerful function to fill the missing values.

It uses various interpolation technique to fill the missing values rather than hard-coding the value. Interplolation uses existing values to fill the missing values and treats the missing values as if they should be equally spaced apart.

Let’s consider this:

Hope you have enjoyed the learning. Happy data cleaning !!!


Share At:
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
Back To Top

Contact Us