Pandas Data Cleaning: Handle Duplicates, Missing Values, and Time Logic with Real Examples

Clean up messy data is often the most time consuming part of data analysis, but it is also the most important.

We will talk through some examples to show how to clwan and prepare your dataset using Pandas. From removing duplicates and filling in missing valuse, to checking time logic and adding useful new fields. You will learn pratical tools to turn raw data into something ready for analysis.

Good data leads to good decisions. With a clean dataset, your numbers will be accurate, your charts will make sense, and your analysis will actually be useful.

My Data Cleaning Workflow

Deduplication: Remove duplicate records to prevent skewed analysis
Missing Value Handling: Ensure data completeness and integrity
Data Type Conversion: Prepare data for accurate computations
Column Standardization: Improve readability and consistency
Time Logic Validation: Eliminate chronologically impossible records
Data Enhancement: Add analytical value through feature engineering

These steps will help you turn messy, raw data into something clean, reliable, and analysis-ready.

Not every dataset needs all six steps. It all depends on your data source, your business logic, and the kinds of insights you are after.

Let’s walk through each step with simple code examples and beginner-friendly tips.

duplicated() & drop_duplicates()

In parking data, it is common to see the same vehicle recorded multiple times. This could be due to network delays, system glitches, or errors in license plate recognition.

To clean this up, we use two basic but powerful Pandas tools: duplicated() to find duplicates, and drop_duplicates() to remove them.

Duplicate records can skew your analysis by inflating counts or distorting metrics like revenue or parking duration. However, before removing duplicates, consider your data's time precision:

High precision (seconds): Same vehicle at exact same second is likely a system error
Low precision (day/hour): Same vehicle in same time period might be legitimate multiple entries.

Always verify that duplicate removal won't eliminate valid business scenarios.

Understanding the keep parameter

keep='first': (default) Keep the first occurrence and mark later ones as duplicates.
keep='last': Keep the last occurrence and mark earlier ones as duplicates.
keep=False: Mark all duplicate entries as True — useful for viewing or exporting every copy, including the first.

Using inplace=True

Setting inplace=True applies the change directly to the original DataFrame without creating a copy.

This improves memory efficiency, especially when working with large datasets. This change is permanent unless you make a backup copy of the data first.

import pandas as pd
# Sample parking data
data = {
    'license_plate': ['ABC123', 'ABC123', 'XYZ789', 'DEF456'],
    'entry_time': ['2025-08-17 08:00:00', '2025-08-17 08:00:00', '2025-08-17 09:00:00', '2025-08-17 10:00:00'],
    'exit_time': ['2025-08-17 09:00:00', '2025-08-17 09:00:00', '2025-08-17 10:00:00', '2025-08-17 11:00:00'],
    'parking_lot': ['Lot A', 'Lot A', 'Lot B', 'Lot A']
}
df = pd.DataFrame(data)
# Check for duplicates
print(df.duplicated())
# Output:
# 0    False
# 1     True
# 2    False
# 3    False
# dtype: bool
# Remove duplicates, keeping the first occurrence
df.drop_duplicates(keep='first', inplace=True)
print(df)
# Output:
#   license_plate           entry_time            exit_time parking_lot
# 0       ABC123  2025-08-17 08:00:00  2025-08-17 09:00:00      Lot A
# 2       XYZ789  2025-08-17 09:00:00  2025-08-17 10:00:00      Lot B
# 3       DEF456  2025-08-17 10:00:00  2025-08-17 11:00:00      Lot A

Tip: Always check which columns should define a duplicate. For example, use df.duplicated(subset=['license_plate', 'entry_time']) to consider only specific columns.

Missing Value Handling: isna() & fillna()

In real-world datasets, missing values are almost always present—especially in parking lot systems. They can be caused by system errors, unstable network connections, failed license plate recognition, or incomplete transactions.

To handle them, we use isna() to detect missing values and fillna() to fill in missing values with defaults.

Why this matters?

Missing data can cause errors in analysis or break your calculations altogether—especially when dealing with things like revenue, parking duration, or vehicle counts.

That’s why checking for missing values is usually my second step after removing duplicates.

Using inplace=True

Just like before, inplace=True updates the DataFrame directly without creating a copy — great for saving memory in large datasets.

Example

Let’s say some exit_time and parking_lot values are missing.

import pandas as pd
# Sample data with missing values
data = {
    'license_plate': ['ABC123', 'XYZ789', 'DEF456'],
    'entry_time': ['2025-08-17 08:00:00', '2025-08-17 09:00:00', '2025-08-17 10:00:00'],
    'exit_time': ['2025-08-17 09:00:00', None, '2025-08-17 11:00:00'],
    'parking_lot': ['Lot A', 'Lot B', None]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isna())
# Output:
#    license_plate  entry_time  exit_time  parking_lot
# 0          False       False      False        False
# 1          False       False       True        False
# 2          False       False      False         True
# Fill missing values
df['exit_time'].fillna('2025-08-17 23:59:59', inplace=True) # Default to end of day
df['parking_lot'].fillna('Unknown', inplace=True) # Default to 'Unknown'
print(df)
# Output:
#   license_plate           entry_time            exit_time parking_lot
# 0       ABC123  2025-08-17 08:00:00  2025-08-17 09:00:00      Lot A
# 1       XYZ789  2025-08-17 09:00:00  2025-08-17 23:59:59      Lot B
# 2       DEF456  2025-08-17 10:00:00  2025-08-17 11:00:00    Unknown

Tip: Choose fill values that make sense for your analysis. For example, use the median for numerical data or a placeholder like “Unknown” for categorical data.

Data Type Conversion

Correct data types are crucial for calculations and analysis. In parking data, entry_time and exit_time are often stored as strings but need to be converted to datetime for time-based operations.

Similarly, license_plate should be a string, and any numerical fields (like fees) should be numbers.

Why this matters?

Incorrect data types can lead to errors, like trying to subtract strings or treating numbers as text. Converting to the right types ensures smooth calculations, such as finding parking duration.

Example

import pandas as pd
# Check data types
print(df.dtypes)
# Output:
# license_plate    object
# entry_time       object
# exit_time        object
# parking_lot      object
# dtype: object
# Convert to appropriate types
df['entry_time'] = pd.to_datetime(df['entry_time'])
df['exit_time'] = pd.to_datetime(df['exit_time'])
df['license_plate'] = df['license_plate'].astype(str)
print(df.dtypes)
# Output:
# license_plate    object
# entry_time      datetime64[ns]
# exit_time       datetime64[ns]
# parking_lot      object
# dtype: object

Tip: Use pd.to_datetime() for dates and times, and check for errors with errors='coerce' to handle invalid formats gracefully.

Column Standardization

Standardizing column names and values improves readability and consistency. For example, column names like entry_time and ExitTime should follow a consistent format (e.g., all lowercase with underscores). Similarly, values like Lot A and lot a should be standardized.

Why this matters?

Inconsistent naming or values can cause confusion or errors in analysis, especially when merging datasets or sharing with others.

import pandas as pd
# Standardize column names
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
print(df.columns)
# Output: Index(['license_plate', 'entry_time', 'exit_time', 'parking_lot'], dtype='object')
# Standardize parking_lot values
df['parking_lot'] = df['parking_lot'].str.title()
# Capitalize first letter
print(df['parking_lot'])
# Output:
# 0      Lot A
# 1      Lot B
# 2    Unknown
# Name: parking_lot, dtype: object

Tip: Use str.lower(), str.title(), or str.replace() to clean text data consistently.

Time Logic Validation

In parking data, exit_time should always be after entry_time. Records where this isn’t true are invalid and should be flagged or removed.

Why this matters?

Chronologically impossible records can skew metrics like parking duration or revenue calculations.

Example

import pandas as pd
# Flag invalid records
df['is_valid'] = df['exit_time'] > df['entry_time']
print(df[['license_plate', 'entry_time', 'exit_time', 'is_valid']])
# Output:
#   license_plate           entry_time            exit_time  is_valid
# 0       ABC123  2025-08-17 08:00:00  2025-08-17 09:00:00      True
# 1       XYZ789  2025-08-17 09:00:00  2025-08-17 23:59:59      True
# 2       DEF456  2025-08-17 10:00:00  2025-08-17 11:00:00      True
# Remove invalid records
df = df[df['is_valid']].drop(columns=['is_valid'])

Tip: Always validate time logic before calculating durations or other time-based metrics.

Data Enhancement

Feature engineering adds new columns to make analysis easier. For parking data, we can calculate parking duration or categorize time of day (e.g., morning, afternoon).

Why this matters?

New features can reveal insights, like which parking lots are busiest or how long vehicles typically stay.

Example

import pandas as pd
df['duration_hours'] = (df['exit_time'] - df['entry_time']).dt.total_seconds() / 3600
# Categorize entry time
def categorize_time(hour):
    if 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'
df['time_of_day'] = df['entry_time'].dt.hour.apply(categorize_time)
print(df[['license_plate', 'duration_hours', 'time_of_day']])
# Output:
#   license_plate  duration_hours time_of_day
# 0       ABC123             1.0     Morning
# 1       XYZ789            14.0     Morning
# 2       DEF456             1.0     Morning

Tip: Create features that align with your analysis goals, like revenue per hour or peak parking times.

Conclusion

Cleaning data with Pandas is a critical skill for any data analyst. By following these six steps—deduplication, missing value handling, data type conversion, column standardization, time logic validation, and data enhancement—you can turn raw parking lot data into a clean, analysis-ready dataset.

Practice these techniques on your own data to improve both your Pandas skills and your ability to communicate insights clearly. Check out my blog at jakedaily.com for more data science tips!