Data sets are not perfect. Sometimes they end up with invalid, corrupt, or missing values. For the project I was working on, I could not have any values that are null or empty. This How-To will walk you through writing a simple Python script to see if your data set has null or empty values, and if so, it will propose two options for how to modify your data.
This simple data set shows you a flight and tells you its airline, flight number, and the reason it was cancelled. However, if a flight wasn't cancelled, it will have no cancelled reason, and therefore has a null/empty value.
Example of CSV missing data in excel - there is no "Cancelled Reason" for AwesomeAir flight 456:
Example of same CSV missing data in traditional comma-separated format - there is no "Cancelled Reason" for AwesomeAir flight 456:
- Python
- Python Data Analysis Library (Pandas)
- A CSV dataset
- Optional: iPython interactive shell
- This can significantly increase the size of your data set, because you are adding values to it. With large data sets, the pandas commands can take time.
- These may not be the best solutions for your data. For more information on other ways to handle missing data with pandas, please refer to Handling missing data
I used this kaggle data set to perform these operations, and it includes some more instructions on using Pandas and other Python libraries to explore your data
The following steps can be written in a Python script and run at once, but I find it more interesting to explore in an interactive Python shell like iPython
If you want to see what the scripts look like all together, please check out Solution 1 and Solution 2. Otherwise, keep reading and follow along step by step.
- Import pandas
import pandas as pd
- Import csv into a Pandas DataFrame object
flights = pd.read_csv('flights.csv')
- Check the shape of your data in (rows, columns) format
flights.shape
- (Optional) Check for all null values in your dataset. This will return a boolean stating if each cell is null. This can take a long time and may not be particularly useful in a very large dataset.
flights.isnull()
- Explore how many null values are in each column of your dataset
flights.isnull().sum()
- (Optional) Check how many null values are in a specific column, substituting the name of your column in string form where it says 'col'
flights[col].isnull().sum()
At this point, you will either replace your values with a space or remove them entirely
- Fill all null or empty cells in your original DataFrame with an empty space and set that to a new DataFrame variable, here, called 'modifiedFlights'*.
modifiedFlights=flights.fillna(“ “)
- Verify that you no longer have any null values by running
modifiedFlights.isnull().sum()
- Save your modified dataset to a new CSV, replacing 'modifiedFlights.csv' with whatever you would like to name your new file.
modifiedFlights.to_csv('modifiedFlights.csv',index=False)
*If you wish, you can replace your original DataFrame, using flights=flights.fillna(" ")
- If there are only a few null values and you know that deleting values will not cause adverse effects on your result, remove them from your DataFrame and store that in a new DataFrame*
modifiedFlights = flights.dropna()
- Verify that you no longer have any null values by running
modifiedFlights.isnull().sum()
- Save your modified dataset to a new CSV, replacing 'modifiedFlights.csv' with whatever you would like to name your new file.
modifiedFlights.to_csv('modifiedFlights.csv',index=False)
*If you wish, you can replace your original DataFrame, using flights=flights.dropna()