Pandas Data Cleaning: Handle Duplicates, Missing Values, and Time Logic with Real Examples

Clean up messy data is often the most time consuming part of data analysis, but it is also the most important.
We will talk through some examples to show how to clwan and prepare your dataset using Pandas. From removing duplicates and filling in missing valuse, to checking time logic and adding useful new fields. You will learn pratical tools to turn raw data into something ready for analysis.
Good data leads to good decisions. With a clean dataset, your numbers will be accurate, your charts will make sense, and your analysis will actually be useful.
My Data Cleaning Workflow
- Deduplication: Remove duplicate records to prevent skewed analysis
- Missing Value Handling: Ensure data completeness and integrity
- Data Type Conversion: Prepare data for accurate computations
- Column Standardization: Improve readability and consistency
- Time Logic Validation: Eliminate chronologically impossible records
- Data Enhancement: Add analytical value through feature engineering
These steps will help you turn messy, raw data into something clean, reliable, and analysis-ready.
Not every dataset needs all six steps. It all depends on your data source, your business logic, and the kinds of insights you are after.
Let’s walk through each step with simple code examples and beginner-friendly tips.
duplicated() & drop_duplicates()
In parking data, it is common to see the same vehicle recorded multiple times. This could be due to network delays, system glitches, or errors in license plate recognition.
To clean this up, we use two basic but powerful Pandas tools: duplicated() to find duplicates, and drop_duplicates() to remove them.
Duplicate records can skew your analysis by inflating counts or distorting metrics like revenue or parking duration. However, before removing duplicates, consider your data's time precision:
- High precision (seconds): Same vehicle at exact same second is likely a system error
- Low precision (day/hour): Same vehicle in same time period might be legitimate multiple entries.
Always verify that duplicate removal won't eliminate valid business scenarios.
Understanding the keep parameter
- keep='first': (default) Keep the first occurrence and mark later ones as duplicates.
- keep='last': Keep the last occurrence and mark earlier ones as duplicates.
- keep=False: Mark all duplicate entries as True — useful for viewing or exporting every copy, including the first.
Using inplace=True
Setting inplace=True applies the change directly to the original DataFrame without creating a copy.
This improves memory efficiency, especially when working with large datasets. This change is permanent unless you make a backup copy of the data first.
import pandas as pd
# Sample parking data
data = {
'license_plate': ['ABC123', 'ABC123', 'XYZ789', 'DEF456'],
'entry_time': ['2025-08-17 08:00:00', '2025-08-17 08:00:00', '2025-08-17 09:00:00', '2025-08-17 10:00:00'],
'exit_time': ['2025-08-17 09:00:00', '2025-08-17 09:00:00', '2025-08-17 10:00:00', '2025-08-17 11:00:00'],
'parking_lot': ['Lot A', 'Lot A', 'Lot B', 'Lot A']
}
df = pd.DataFrame(data)
# Check for duplicates
print(df.duplicated())
# Output:
# 0 False
# 1 True
# 2 False
# 3 False
# dtype: bool
# Remove duplicates, keeping the first occurrence
df.drop_duplicates(keep='first', inplace=True)
print(df)
# Output:
# license_plate entry_time exit_time parking_lot
# 0 ABC123 2025-08-17 08:00:00 2025-08-17 09:00:00 Lot A
# 2 XYZ789 2025-08-17 09:00:00 2025-08-17 10:00:00 Lot B
# 3 DEF456 2025-08-17 10:00:00 2025-08-17 11:00:00 Lot A
Tip: Always check which columns should define a duplicate. For example, use df.duplicated(subset=['license_plate', 'entry_time']) to consider only specific columns.
Missing Value Handling: isna() & fillna()
In real-world datasets, missing values are almost always present—especially in parking lot systems. They can be caused by system errors, unstable network connections, failed license plate recognition, or incomplete transactions.
To handle them, we use isna() to detect missing values and fillna() to fill in missing values with defaults.
Why this matters?
Missing data can cause errors in analysis or break your calculations altogether—especially when dealing with things like revenue, parking duration, or vehicle counts.
That’s why checking for missing values is usually my second step after removing duplicates.
Using inplace=True
Just like before, inplace=True updates the DataFrame directly without creating a copy — great for saving memory in large datasets.
Example
Let’s say some exit_time and parking_lot values are missing.
import pandas as pd
# Sample data with missing values
data = {
'license_plate': ['ABC123', 'XYZ789', 'DEF456'],
'entry_time': ['2025-08-17 08:00:00', '2025-08-17 09:00:00', '2025-08-17 10:00:00'],
'exit_time': ['2025-08-17 09:00:00', None, '2025-08-17 11:00:00'],
'parking_lot': ['Lot A', 'Lot B', None]
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isna())
# Output:
# license_plate entry_time exit_time parking_lot
# 0 False False False False
# 1 False False True False
# 2 False False False True
# Fill missing values
df['exit_time'].fillna('2025-08-17 23:59:59', inplace=True) # Default to end of day
df['parking_lot'].fillna('Unknown', inplace=True) # Default to 'Unknown'
print(df)
# Output:
# license_plate entry_time exit_time parking_lot
# 0 ABC123 2025-08-17 08:00:00 2025-08-17 09:00:00 Lot A
# 1 XYZ789 2025-08-17 09:00:00 2025-08-17 23:59:59 Lot B
# 2 DEF456 2025-08-17 10:00:00 2025-08-17 11:00:00 Unknown
Tip: Choose fill values that make sense for your analysis. For example, use the median for numerical data or a placeholder like “Unknown” for categorical data.
Data Type Conversion
Correct data types are crucial for calculations and analysis. In parking data, entry_time and exit_time are often stored as strings but need to be converted to datetime for time-based operations.
Similarly, license_plate should be a string, and any numerical fields (like fees) should be numbers.
Why this matters?
Incorrect data types can lead to errors, like trying to subtract strings or treating numbers as text. Converting to the right types ensures smooth calculations, such as finding parking duration.
Example
import pandas as pd
# Check data types
print(df.dtypes)
# Output:
# license_plate object
# entry_time object
# exit_time object
# parking_lot object
# dtype: object
# Convert to appropriate types
df['entry_time'] = pd.to_datetime(df['entry_time'])
df['exit_time'] = pd.to_datetime(df['exit_time'])
df['license_plate'] = df['license_plate'].astype(str)
print(df.dtypes)
# Output:
# license_plate object
# entry_time datetime64[ns]
# exit_time datetime64[ns]
# parking_lot object
# dtype: object
Tip: Use pd.to_datetime() for dates and times, and check for errors with errors='coerce' to handle invalid formats gracefully.
Column Standardization
Standardizing column names and values improves readability and consistency. For example, column names like entry_time and ExitTime should follow a consistent format (e.g., all lowercase with underscores). Similarly, values like Lot A and lot a should be standardized.
Why this matters?
Inconsistent naming or values can cause confusion or errors in analysis, especially when merging datasets or sharing with others.
import pandas as pd
# Standardize column names
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
print(df.columns)
# Output: Index(['license_plate', 'entry_time', 'exit_time', 'parking_lot'], dtype='object')
# Standardize parking_lot values
df['parking_lot'] = df['parking_lot'].str.title()
# Capitalize first letter
print(df['parking_lot'])
# Output:
# 0 Lot A
# 1 Lot B
# 2 Unknown
# Name: parking_lot, dtype: object
Tip: Use str.lower(), str.title(), or str.replace() to clean text data consistently.
Time Logic Validation
In parking data, exit_time should always be after entry_time. Records where this isn’t true are invalid and should be flagged or removed.
Why this matters?
Chronologically impossible records can skew metrics like parking duration or revenue calculations.
Example
import pandas as pd
# Flag invalid records
df['is_valid'] = df['exit_time'] > df['entry_time']
print(df[['license_plate', 'entry_time', 'exit_time', 'is_valid']])
# Output:
# license_plate entry_time exit_time is_valid
# 0 ABC123 2025-08-17 08:00:00 2025-08-17 09:00:00 True
# 1 XYZ789 2025-08-17 09:00:00 2025-08-17 23:59:59 True
# 2 DEF456 2025-08-17 10:00:00 2025-08-17 11:00:00 True
# Remove invalid records
df = df[df['is_valid']].drop(columns=['is_valid'])
Tip: Always validate time logic before calculating durations or other time-based metrics.
Data Enhancement
Feature engineering adds new columns to make analysis easier. For parking data, we can calculate parking duration or categorize time of day (e.g., morning, afternoon).
Why this matters?
New features can reveal insights, like which parking lots are busiest or how long vehicles typically stay.
Example
import pandas as pd
df['duration_hours'] = (df['exit_time'] - df['entry_time']).dt.total_seconds() / 3600
# Categorize entry time
def categorize_time(hour):
if 6 <= hour < 12:
return 'Morning'
elif 12 <= hour < 18:
return 'Afternoon'
else:
return 'Evening'
df['time_of_day'] = df['entry_time'].dt.hour.apply(categorize_time)
print(df[['license_plate', 'duration_hours', 'time_of_day']])
# Output:
# license_plate duration_hours time_of_day
# 0 ABC123 1.0 Morning
# 1 XYZ789 14.0 Morning
# 2 DEF456 1.0 Morning
Tip: Create features that align with your analysis goals, like revenue per hour or peak parking times.
Conclusion
Cleaning data with Pandas is a critical skill for any data analyst. By following these six steps—deduplication, missing value handling, data type conversion, column standardization, time logic validation, and data enhancement—you can turn raw parking lot data into a clean, analysis-ready dataset.
Practice these techniques on your own data to improve both your Pandas skills and your ability to communicate insights clearly. Check out my blog at jakedaily.com for more data science tips!