Data Cleansing In Python

Shivani Upare

4 months ago

Data Cleansing In Python | insideaiml
Data Cleansing In Python | insideaiml
Most of us already know about, how important it is to have a clean dataset for our Machine learning model to do predictions. Almost 60-80 % of the time in any Machine learning projects are required to have a clean and good dataset for predictions. If our data is not cleaned properly then our model will give us very bad accuracy. One of the most important parts in these areas is missing value treatment which is a major point of focus to make our models more accurate and valid for prediction.
According to IBM Data Analytics report, you can expect to spend up to 80% of your time cleaning data.
Data Cleansing | insideaiml
Data Cleansing | insideaiml

When and Why Is Data Missed?

Some of the sources of Missing Values are as follows:
Before we get into the coding part, it’s important to understand the different sources of missing data. Some typical reasons why data is missing:
  • Let’s consider a case where a user forgot to fill in a field.
  • A user does not want to share his personal details.
  • Data was lost while transferring manually one source to another.
  • Due to programming error.
Let’s take an example to understand it in a more proper way.
Take an example of an online survey for a product of a company. Many times, people do not share all the required information in the survey related to their personal information. Few people share their experiences, but not the full details like from how long they are using the product, few people share how long they are using the product, their experiences but not their contact information. Thus, in some or another way a part of data is always missing, and this is very common in real-time.
As of now, I think you have an idea about how much its important to treat the missing values in our data. So, let’s see it now.

How to handle missing values (say NA or NaN) using Pandas?

# import the pandas library
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1', 'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data)
Output:
    Column1   Column2   Column3
a -0.067397 -1.570255 -0.898418
b       NaN       NaN       NaN
c  1.311982  1.972563  0.743876
d       NaN       NaN       NaN
e  0.516474 -0.436298 -0.336320
f  0.587955  0.928367  1.014634
g       NaN       NaN       NaN
h -0.200502 -0.418049 -1.471458
In the above example we have created a DataFrame having missing values. Which is represented as NaNNot a Number.

How to check Missing values using pandas?

Pandas provide us different functions such as isnone() and notnone() to detect missing values in our dataset which makes our life much easier. These methods can be applied to Series and DataFrames objects.
Let’s take an example
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1', 'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data['Column1'].isnone())
Output:
a    false
b     true
c    false
d     true
e    false
f    false
g     true
h    false
Name: Column1, dtype: bool

How to Clean / Fill Missing Data in pandas?

There are different methods to fill or clean the missing values. Its Totally depends on the problems statements and columns type that how to fill the missing values, Here, I will give an example of a simple function fillna() to fill the missing values.
This fillna() function can “fill in” NA values with non-none data in a couple of ways,
Let’s see it one by one

Replacing NaN with a Scalar Value

import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['Column1',
'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c'])
print(data)
print ("NaN replaced with '0':")
print (df.fillna(0))
Output
    Column1   Column2   Column3
a -1.145282 -1.204689 -0.011520
b       NaN       NaN       NaN
c  1.054585  0.450895 -1.765849
NaN replaced with '0':
    Column1   Column2   Column3
a  1.028044 -0.059059  0.814159
b  0.000000  0.000000  0.000000
c -0.093614  0.502746 -0.979775
d  0.000000  0.000000  0.000000
e -0.926268  0.819182  0.057756
f  0.654027  1.196219  1.441782
g  0.000000  0.000000  0.000000
h  0.888539  0.472792 -1.369401
Here, we filled the NaN values with value zero; instead we can also fill with any other value.

Fill NA Forward and Backward

We can also fill the missing values using forward and backward method of fillna() function.
Forward and Backward
Forward and Backward
Example
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1', 'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.fillna(method='pad'))
Output
    Column1   Column2   Column3
a  0.863373  0.113220  0.167150
b  0.863373  0.113220  0.167150
c  0.175815  0.526849  0.074818
d  0.175815  0.526849  0.074818
e -0.203824 -0.921412  1.200571
f  0.864100  1.263429 -0.200021
g  0.864100  1.263429 -0.200021
h  1.774977 -0.118278  0.415756

How to Drop Missing values?

Python pandas package also provides a function dropna() to drop the missing values. This function
is used along with the axis argument. By default, axix = 0, I.e., along the row, which means that if any value within a row is NA then the whole row is dropped.
Example
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1', 'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.dropna())
Output
    Column1   Column2   Column3
a -0.316294  0.890039  0.349166
c -1.297559  0.113461  0.884424
e -2.175159  0.379806  2.231736
f -2.385318  1.803276 -0.342873
h  1.372849  1.482879 -0.349323

How to Replace Missing (or) Generic Values?

Sometimes we need to replace a generic value with some specific value. We can do it by using replace method.
Replacing NaN with any scaler value is equivalent of fillna() function.
Example
import pandas as pd
import numpy as np
data = pd.DataFrame({'Column1':[10,20,30,40,50,2000],
'Column2':[1000,0,30,40,50,60]})
print(data.replace({1000:10,2000:60}))
Output
   Column1  Column2
0       10       10
1       20        0
2       30       30
3       40       40
4       50       50
5       60       60
I hope you enjoyed reading this article and finally, you came to know about Data Cleansing In Python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review

We're Online!

Chat now for any query