Pandas for ML Interviews: Master Data Prep, Ace the Technical Round

Jun 20, 2025

"Data scientists spend 80% of their time cleaning and preparing data, and only 20% actually building models."

If you've heard this statistic before, you understand why mastering data manipulation is non-negotiable in machine learning. Enter Pandas—the Swiss Army knife of tabular data in Python. Whether you're loading a CSV, cleaning messy data, or engineering new features, Pandas is your best friend.

In this guide, you'll go from Pandas novice to confident practitioner, learning how to wrangle data efficiently for machine learning. By the end, you'll be able to:

✅ Load and explore datasets
✅ Clean and preprocess messy data
✅ Engineer meaningful features
✅ Merge, group, and reshape data
✅ Prepare data for Scikit-learn

Let’s dive in!

1. Getting Started with Pandas

Installation

First, install Pandas (if you haven’t already):

pip install pandas

Importing Pandas

The standard convention is:

import pandas as pd

Meet the DataFrame and Series

DataFrame: A 2D table (like an Excel sheet) with rows and columns.
Series: A single column (a 1D array).

# Creating a Series
s = pd.Series([10, 20, 30, 40], name='My_Numbers')
print("Series:\n", s)

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo']
}
df = pd.DataFrame(data)
print("\nDataFrame:\n", df)

Loading Data

Pandas can read data from multiple sources:

# From CSV
df = pd.read_csv('data.csv')

# From Excel
df = pd.read_excel('data.xlsx')

# From JSON
df = pd.read_json('data.json')

# From SQL (example with SQLite)
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

Quick Data Inspection

df.head()       # First 5 rows
df.info()       # Data types and missing values
df.describe()   # Summary statistics
df.shape        # (rows, columns)

2. Data Exploration and Selection

Accessing Columns

df['column_name']  # Returns a Series
df.column_name     # Same as above (if column name has no spaces)

Selecting Rows with `.loc[]` and `.iloc[]`

.loc[] - (Label-based indexing): Used for selecting rows and columns by their labels (index and column names).
.iloc[] - (Integer-location based indexing): Used for selecting rows and columns by their integer position (0-indexed).

# Select first 3 rows and 'age' column
df.loc[0:2, 'age']  

# Select first 3 rows and first 2 columns (by position)
df.iloc[0:3, 0:2]

Filtering Data

# Filter rows where age > 30
df[df['age'] > 30]  

# Multiple conditions
df[(df['age'] > 30) & (df['income'] > 50000)]

3. Data Cleaning and Preprocessing

Handling Missing Data

df.isnull().sum()  # Count missing values per column  

# Drop rows with missing values
df.dropna()  

# Fill missing values
df.fillna(0)  
df.fillna(df.mean())  # Fill with column mean

Changing Data Types

df['age'] = df['age'].astype(int)  # Convert to integer

Renaming Columns

df.rename(columns={'old_name': 'new_name'}, inplace=True)

String Operations

df['name'] = df['name'].str.lower()  # Lowercase all names
df['email'] = df['email'].str.contains('gmail')  # Check for Gmail

Applying Functions

.apply(): Applies a function along an axis of the DataFrame or Series.
.map(): Maps values from one Series to another using a dictionary or function (Series only).
lambda: Anonymous (unnamed) functions, often used inline with apply or map.

# Apply a function to a column
df['age_squared'] = df['age'].apply(lambda x: x ** 2)

# Map 'gender' to numerical values using .map()
gender_mapping = {'male': 0, 'female': 1}
df['gender_encoded'] = df['Sex'].map(gender_mapping)

4. Feature Engineering

Binning (Converting Numeric to Categorical)

Transforming continuous numerical features into discrete categories.

pd.cut(): Bins values into fixed-width bins.
pd.qcut(): Bins values into equal-sized bins (based on quantiles).

# Create age groups using fixed bins
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young', 'adult', 'senior'])

# Create Fare quartiles (equal number of passengers in each bin)
titanic_df['Fare_Quartile'] = pd.qcut(titanic_df['Fare'], q=4, labels=False, duplicates='drop') # labels=False gives integer labels

One-Hot Encoding (for Categorical Variables)

Converts categorical variables into a format that machine learning algorithms can understand. Creates new binary columns for each category.

df = pd.get_dummies(df, columns=['gender'])

DateTime Features

df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

5. Grouping and Aggregation

GroupBy

df.groupby('department')['salary'].mean()  # Avg salary per department

Pivot Tables

pd.pivot_table(df, values='salary', index='department', columns='gender', aggfunc='mean')

6. Merging and Joining

Concatenation

pd.concat([df1, df2], axis=0)  # Stack vertically

Merging (Like SQL JOIN)

pd.merge(df1, df2, on='key_column', how='inner')  # inner, left, right, outer

7. Preparing Data for Machine Learning

Separate Features (X) and Target (y)

X = df.drop('target_column', axis=1)  
y = df['target_column']

Scaling (Brief Mention)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

8. Best Practices

Chain operations for cleaner code:

(df.fillna(0)
   .rename(columns={'old': 'new'})
   .query('age > 30'))

Avoid chained indexing (use .loc[] instead).
Profile your data with pandas_profiling:

from pandas_profiling import ProfileReport
profile = ProfileReport(df)
profile.to_file('report.html')

Conclusion

You've just walked through the essential functionalities of Pandas for a complete Machine Learning workflow! From initially loading raw data to cleaning, transforming, creating features, and finally preparing your datasets for model training, Pandas is your indispensable ally.

Your next step? Practice! Grab an open dataset from Kaggle (like the original Titanic dataset, Iris, or Boston Housing) or UCI Machine Learning Repository. Apply what you've learned. Experiment, make mistakes, and learn from them. The more you use Pandas, the more intuitive and powerful it becomes.

Official Pandas Documentation: https://pandas.pydata.org/docs/
Pandas Cheat Sheet: https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html#cheat-sheet

Feel free to drop any questions or share your own favorite Pandas tips in the comments below!

Support My Work

If you found this post valuable, consider supporting my work with a small contribution via UPI. Your support:

Helps me and my team spend more time simplifying complex ML topics
Keeps deep-dive content accessible to everyone
Fuels more research and better-quality posts

UPI ID: manishmazumder-1@okaxis (scan below QR!)

Thank you for being part of this learning journey!

Manish Mazumder

Discussion about this post