"Data scientists spend 80% of their time cleaning and preparing data, and only 20% actually building models."
If you've heard this statistic before, you understand why mastering data manipulation is non-negotiable in machine learning. Enter Pandas—the Swiss Army knife of tabular data in Python. Whether you're loading a CSV, cleaning messy data, or engineering new features, Pandas is your best friend.
In this guide, you'll go from Pandas novice to confident practitioner, learning how to wrangle data efficiently for machine learning. By the end, you'll be able to:
✅ Load and explore datasets
✅ Clean and preprocess messy data
✅ Engineer meaningful features
✅ Merge, group, and reshape data
✅ Prepare data for Scikit-learn
Let’s dive in!
1. Getting Started with Pandas
Installation
First, install Pandas (if you haven’t already):
pip install pandas
Importing Pandas
The standard convention is:
import pandas as pd
Meet the DataFrame and Series
DataFrame: A 2D table (like an Excel sheet) with rows and columns.
Series: A single column (a 1D array).
# Creating a Series
s = pd.Series([10, 20, 30, 40], name='My_Numbers')
print("Series:\n", s)
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']
}
df = pd.DataFrame(data)
print("\nDataFrame:\n", df)
Loading Data
Pandas can read data from multiple sources:
# From CSV
df = pd.read_csv('data.csv')
# From Excel
df = pd.read_excel('data.xlsx')
# From JSON
df = pd.read_json('data.json')
# From SQL (example with SQLite)
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
Quick Data Inspection
df.head() # First 5 rows
df.info() # Data types and missing values
df.describe() # Summary statistics
df.shape # (rows, columns)
2. Data Exploration and Selection
Accessing Columns
df['column_name'] # Returns a Series
df.column_name # Same as above (if column name has no spaces)
Selecting Rows with .loc[]
and .iloc[]
.loc[] -
(Label-based indexing): Used for selecting rows and columns by their labels (index and column names)..iloc[] -
(Integer-location based indexing): Used for selecting rows and columns by their integer position (0-indexed).
# Select first 3 rows and 'age' column
df.loc[0:2, 'age']
# Select first 3 rows and first 2 columns (by position)
df.iloc[0:3, 0:2]
Filtering Data
# Filter rows where age > 30
df[df['age'] > 30]
# Multiple conditions
df[(df['age'] > 30) & (df['income'] > 50000)]
3. Data Cleaning and Preprocessing
Handling Missing Data
df.isnull().sum() # Count missing values per column
# Drop rows with missing values
df.dropna()
# Fill missing values
df.fillna(0)
df.fillna(df.mean()) # Fill with column mean
Changing Data Types
df['age'] = df['age'].astype(int) # Convert to integer
Renaming Columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
String Operations
df['name'] = df['name'].str.lower() # Lowercase all names
df['email'] = df['email'].str.contains('gmail') # Check for Gmail
Applying Functions
.apply()
: Applies a function along an axis of the DataFrame or Series..map()
: Maps values from one Series to another using a dictionary or function (Series only).lambda
: Anonymous (unnamed) functions, often used inline withapply
ormap
.
# Apply a function to a column
df['age_squared'] = df['age'].apply(lambda x: x ** 2)
# Map 'gender' to numerical values using .map()
gender_mapping = {'male': 0, 'female': 1}
df['gender_encoded'] = df['Sex'].map(gender_mapping)
4. Feature Engineering
Binning (Converting Numeric to Categorical)
Transforming continuous numerical features into discrete categories.
pd.cut()
: Bins values into fixed-width bins.pd.qcut()
: Bins values into equal-sized bins (based on quantiles).
# Create age groups using fixed bins
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young', 'adult', 'senior'])
# Create Fare quartiles (equal number of passengers in each bin)
titanic_df['Fare_Quartile'] = pd.qcut(titanic_df['Fare'], q=4, labels=False, duplicates='drop') # labels=False gives integer labels
One-Hot Encoding (for Categorical Variables)
Converts categorical variables into a format that machine learning algorithms can understand. Creates new binary columns for each category.
df = pd.get_dummies(df, columns=['gender'])
DateTime Features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
5. Grouping and Aggregation
GroupBy
df.groupby('department')['salary'].mean() # Avg salary per department
Pivot Tables
pd.pivot_table(df, values='salary', index='department', columns='gender', aggfunc='mean')
6. Merging and Joining
Concatenation
pd.concat([df1, df2], axis=0) # Stack vertically
Merging (Like SQL JOIN)
pd.merge(df1, df2, on='key_column', how='inner') # inner, left, right, outer
7. Preparing Data for Machine Learning
Separate Features (X) and Target (y)
X = df.drop('target_column', axis=1)
y = df['target_column']
Scaling (Brief Mention)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
8. Best Practices
Chain operations for cleaner code:
(df.fillna(0)
.rename(columns={'old': 'new'})
.query('age > 30'))
Avoid chained indexing (use
.loc[]
instead).Profile your data with
pandas_profiling
:
from pandas_profiling import ProfileReport
profile = ProfileReport(df)
profile.to_file('report.html')
Conclusion
You've just walked through the essential functionalities of Pandas for a complete Machine Learning workflow! From initially loading raw data to cleaning, transforming, creating features, and finally preparing your datasets for model training, Pandas is your indispensable ally.
Your next step? Practice! Grab an open dataset from Kaggle (like the original Titanic dataset, Iris, or Boston Housing) or UCI Machine Learning Repository. Apply what you've learned. Experiment, make mistakes, and learn from them. The more you use Pandas, the more intuitive and powerful it becomes.
Official Pandas Documentation: https://pandas.pydata.org/docs/
Pandas Cheat Sheet: https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html#cheat-sheet
Feel free to drop any questions or share your own favorite Pandas tips in the comments below!
Support My Work
If you found this post valuable, consider supporting my work with a small contribution via UPI. Your support:
Helps me and my team spend more time simplifying complex ML topics
Keeps deep-dive content accessible to everyone
Fuels more research and better-quality posts
UPI ID: manishmazumder-1@okaxis (scan below QR!)
Thank you for being part of this learning journey!