Exploratory Data Analysis for Machine Learning
Abstract:
Exploratory Data Analysis (EDA) is a foundational step in any machine learning workflow. This section focuses on the practical aspects of retrieving, cleaning, transforming, and exploring data to prepare it for modeling. You will learn how to access data from various sources (CSV, JSON, SQL databases, APIs, and cloud storage), inspect its structure, manage missing values and outliers, encode categorical features, and apply key statistical transformations. The goal is to develop a robust data preprocessing pipeline that ensures data quality and model readiness.
Loading Data from CSV, JSON, SQL, APIs, and Cloud Storage
In real-world data science projects, data originates from diverse sources. Understanding how to load data into a Python environment is a critical first step in exploratory data analysis.
Common Sources and Tools :
- CSV Files : Comma-separated files are widely used for tabular data.
-
Tool:
pandas.read_csv()
- JSON Files : JavaScript Object Notation, suitable for hierarchical data.
-
Tool:
pandas.read_json()
- SQL Databases : Structured data stored in relational databases.
-
Tools:
sqlite3
,SQLAlchemy
,pandas.read_sql()
- APIs : Data accessed over the internet using HTTP requests.
-
Tools:
requests
,pandas.json_normalize()
- Cloud Storage : Files stored in services like Google Drive, AWS S3, or Google Cloud Storage.
-
Tools:
gcsfs
,boto3
,fsspec
, or authenticated URLs withpandas.read_csv()
The goal is to create a unified interface for loading heterogeneous data sources into pandas DataFrames for further processing.
Example
Multi-source Customer Data Aggregation
:
A company collects customer data from CRM exports (CSV), online behavior logs (JSON via API), and transaction records from a cloud-based SQL database. Loading and integrating these datasets is the first step before performing segmentation or predictive modeling.
import pandas as pd
import sqlite3
import requests
import json
# Load CSV
csv_df = pd.read_csv('data/customers.csv')
# Load JSON
json_df = pd.read_json('data/web_logs.json')
# Load from SQL (SQLite example)
conn = sqlite3.connect('data/transactions.db')
sql_df = pd.read_sql("SELECT * FROM transactions", conn)
# Load from API (simulate a GET request returning JSON)
url = "https://jsonplaceholder.typicode.com/posts"
response = requests.get(url)
api_df = pd.json_normalize(response.json())
# Load from Google Cloud Storage (public file, using fsspec)
# Requires: pip install gcsfs
gcs_path = 'gs://your-bucket-name/data/file.csv'
cloud_df = pd.read_csv(gcs_path) # if publicly accessible
# Summary
print("CSV Data Shape:", csv_df.shape)
print("JSON Data Shape:", json_df.shape)
print("SQL Data Shape:", sql_df.shape)
print("API Data Shape:", api_df.shape)
print("Cloud Storage Data Shape:", cloud_df.shape)
Accessing Data from Databases
Databases are a common source of structured data in real-world applications. In data science, accessing data from databases allows analysts to query large datasets efficiently and retrieve only what is needed for analysis.
Two primary database types: - Relational Databases (SQL) : Data is stored in tables with fixed schemas (e.g., PostgreSQL, MySQL, SQLite). - NoSQL Databases : Flexible schemas, suited for unstructured or hierarchical data (e.g., MongoDB).
Accessing data involves: 1. Establishing a connection to the database. 2. Writing and executing SQL queries (for relational DBs). 3. Reading query results into a DataFrame for analysis.
Tools:
-
sqlite3
,
psycopg2
,
mysql.connector
, or
SQLAlchemy
for SQL-based databases.
-
pymongo
for MongoDB.
-
pandas.read_sql()
to directly load query results into DataFrames.
Example
Customer Analytics Dashboard
:
A company stores customer and transaction data in a PostgreSQL database. A data scientist writes SQL queries to extract relevant data—like customer segments and recent purchases—then loads it into a DataFrame for modeling or visualization in a dashboard.
import sqlite3
import pandas as pd
# Create a connection to an SQLite database (in-memory for demo)
conn = sqlite3.connect(':memory:')
# Create a sample table and insert data
conn.execute('CREATE TABLE customers (id INTEGER, name TEXT, region TEXT)')
conn.executemany('INSERT INTO customers VALUES (?, ?, ?)', [
(1, 'Alice', 'North'),
(2, 'Bob', 'East'),
(3, 'Carol', 'West')
])
# Use SQL to retrieve data
query = "SELECT * FROM customers"
df = pd.read_sql(query, conn)
# Close the connection
conn.close()
# Display the result
print(df)
Reading and Exploring Data in Jupyter Notebooks
Jupyter Notebooks provide an interactive environment ideal for data exploration. After loading data into a DataFrame, the next step is to inspect its structure, quality, and content before cleaning or modeling.
Key exploratory commands in
pandas
:
-
df.head(n)
: View the first
n
rows
-
df.tail(n)
: View the last
n
rows
-
df.shape
: Get number of rows and columns
-
df.columns
: List column names
-
df.info()
: Summary of data types and non-null counts
-
df.describe()
: Statistical summary of numerical columns
-
df.sample(n)
: Random subset of the data
-
df.nunique()
: Count of unique values per column
These tools help identify missing values, inconsistent types, suspicious outliers, and the overall structure of the dataset.
Example
Initial Data Audit for Sales Data
:
A data analyst loads transaction data in Jupyter and uses basic
.info()
,
.describe()
, and
.head()
commands to check data quality before continuing with transformation or modeling.
import pandas as pd
# Simulated retail dataset
data = {
'customer_id': [1001, 1002, 1003, 1004, 1005, 1006],
'region': ['North', 'South', 'East', 'West', 'North', 'South'],
'purchase_amount': [250.5, 300.0, 150.0, None, 200.0, 180.0],
'is_returned': [False, False, True, False, True, False],
'signup_date': pd.to_datetime(['2022-01-10', '2022-01-12', '2022-01-15', '2022-01-17', '2022-01-19', '2022-01-21'])
}
df = pd.DataFrame(data)
# View the first and last few records
print("First 3 rows:\n", df.head(3))
print("\nLast 2 rows:\n", df.tail(2))
# Check dimensions and column names
print("\nData Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
# Summary info on data types and missing values
print("\nDataFrame Info:")
df.info()
# Descriptive statistics for numeric columns
print("\nSummary Statistics:\n", df.describe())
# Random sample of rows
print("\nRandom Sample:\n", df.sample(2))
# Count unique values in each column
print("\nUnique Values Per Column:\n", df.nunique())
Data Cleaning Techniques and Best Practices
Data cleaning is a critical step in preparing raw data for analysis or modeling. Real-world datasets often contain issues such as missing values, inconsistent types, duplicates, or invalid entries. Cleaning ensures data quality, consistency, and reliability.
Key Techniques
-
Inspect structure
: Use
.info()
,.describe()
,.head()
,.shape
-
Drop or impute missing values
:
df.dropna()
,df.fillna()
-
Fix data types
:
df.astype()
for conversions -
Remove duplicates
:
df.drop_duplicates()
-
Rename columns
:
df.rename(columns=...)
-
Filter rows and columns
:
df.loc[]
,df.iloc[]
, or conditionals -
Standardize values
: Use
.str.lower()
,.replace()
, etc.
Example
Cleaning a Product Dataset Before Analysis
import pandas as pd
import numpy as np
# Sample raw data with typical issues
raw_data = {
'ProductID': [101, 102, 103, 103, 104, 105],
'ProductName': ['Widget', 'Gadget', 'Widget', 'Widget', np.nan, 'Gizmo'],
'Price': ['25.00', '30', 'N/A', '25.00', '20.5', '15.0'],
'Stock': [10, None, 5, 5, 0, 20],
'Category': ['Tools', 'Tools', 'Tools', 'Tools', 'Misc', 'misc']
}
df = pd.DataFrame(raw_data)
# Inspect structure
print("Initial Data Overview:")
print(df.info())
print(df.head())
# Clean 'Price' column: Convert to numeric, coerce errors
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
# Drop duplicates
df = df.drop_duplicates()
# Handle missing values: Drop rows with missing ProductName or Price
df = df.dropna(subset=['ProductName', 'Price'])
# Fill missing stock values with 0
df['Stock'] = df['Stock'].fillna(0)
# Standardize 'Category' values
df['Category'] = df['Category'].str.lower()
# Rename columns for consistency
df = df.rename(columns={'ProductID': 'product_id', 'ProductName': 'product_name'})
# Final cleaned data
print("\nCleaned DataFrame:")
print(df)
Select Column and Row, Count Categorical, Stat Continuous
After loading and cleaning a dataset, targeted exploration often involves selecting specific rows , columns , or subsets of the data. It's also essential to explore categorical variables (e.g., counts, modes) and summarize continuous variables (e.g., mean, std, min, max).
Key Operations
-
Select columns
:
df['col']
,df[['col1', 'col2']]
-
Select rows
:
df.loc[condition]
,df.iloc[row_idx]
-
Count categorical values
:
df['col'].value_counts()
,df['col'].mode()
-
Summary of continuous variables
:
df['col'].mean()
,.std()
,.min()
,.max()
,.describe()
Example
Targeted Exploration of a Customer Dataset
import pandas as pd
# Sample dataset
data = {
'customer_id': [201, 202, 203, 204, 205],
'region': ['East', 'West', 'East', 'North', 'South'],
'spend': [250.0, 150.5, 300.0, 400.0, 120.0],
'membership': ['Basic', 'Premium', 'Basic', 'Premium', 'Basic']
}
df = pd.DataFrame(data)
# Select a single column
spend_column = df['spend']
print("Spend Column:\n", spend_column)
# Select multiple columns
subset = df[['customer_id', 'membership']]
print("\nCustomer ID and Membership:\n", subset)
# Select rows based on condition
high_spenders = df[df['spend'] > 250]
print("\nHigh Spenders:\n", high_spenders)
# Select row by index
second_row = df.iloc[1]
print("\nSecond Row:\n", second_row)
# Count categorical values
region_counts = df['region'].value_counts()
membership_mode = df['membership'].mode().iloc[0]
print("\nRegion Counts:\n", region_counts)
print("\nMost Common Membership Type:", membership_mode)
# Stats for continuous column
spend_stats = df['spend'].describe()
print("\nSpend Statistics:\n", spend_stats)
Pre-processing: Feature Selection, Extraction, and Conversion
Pre-processing prepares raw data for modeling by identifying relevant features, engineering new ones, and converting them into suitable formats. This step directly impacts model performance, interpretability, and training efficiency.
Key Concepts
- Feature Selection : Choosing relevant features for modeling (e.g., removing redundant or low-variance columns).
- Feature Extraction : Creating new informative features from existing ones (e.g., extracting year from date).
- Feature Conversion : Transforming features into model-ready formats (e.g., categorical to numeric, datetime parsing).
These operations help reduce noise, improve generalization, and optimize learning.
Example
Preparing E-Commerce Data for Modeling
import pandas as pd
# Simulated data
data = {
'user_id': [1, 2, 3, 4],
'signup_date': ['2022-01-05', '2022-02-10', '2022-01-20', '2022-03-01'],
'last_login': ['2022-04-01', '2022-04-03', '2022-04-01', '2022-04-02'],
'country': ['US', 'UK', 'US', 'CA'],
'age': [25, 30, 22, 35],
'has_purchased': [1, 0, 1, 0]
}
df = pd.DataFrame(data)
# --- Feature Conversion ---
# Convert date columns to datetime
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['last_login'] = pd.to_datetime(df['last_login'])
# --- Feature Extraction ---
# Days since signup
df['days_since_signup'] = (df['last_login'] - df['signup_date']).dt.days
# --- Feature Selection ---
# Drop irrelevant or redundant columns
df_model = df.drop(columns=['user_id', 'signup_date', 'last_login'])
# Final feature matrix
print("Prepared Feature Matrix:\n", df_model)
Common Variable Transformation Methods in Python
A transformation is the application of a mathematical function to a variable in order to change its distribution, scale, or functional form—often to better satisfy modeling assumptions or enhance predictive power. Variable transformation is a crucial step in data preprocessing. It helps improve model performance, normalize distributions, handle skewness, and encode relationships between variables more effectively.
Common Transformations
- Log Transformation : Stabilizes variance and reduces right skew.
- Square Root / Power Transforms : Alternative for mild skew or variance stabilization.
- Standardization (Z-score) : Scales features to have mean 0 and std 1.
- Min-Max Normalization : Rescales data to a [0, 1] range.
- Binning : Converts continuous variables into categorical bins (e.g., age groups).
- Box-Cox / Yeo-Johnson : Normalize non-normal variables (requires positive values for Box-Cox).
These transformations are often necessary for algorithms sensitive to feature scaling (e.g., k-NN, SVM, PCA).
Example
Transforming Income Data for Modeling
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer
# Simulated income data
df = pd.DataFrame({
'income': [35000, 42000, 59000, 120000, 30000, 100000]
})
# Log transformation
df['income_log'] = np.log(df['income'])
# Standardization (Z-score)
scaler = StandardScaler()
df['income_z'] = scaler.fit_transform(df[['income']])
# Min-max normalization
minmax = MinMaxScaler()
df['income_minmax'] = minmax.fit_transform(df[['income']])
# Binning (quartiles)
df['income_bin'] = pd.qcut(df['income'], q=4, labels=False)
# Yeo-Johnson transformation (can handle zero/negative values)
pt = PowerTransformer(method='yeo-johnson')
df['income_yeojohnson'] = pt.fit_transform(df[['income']])
print(df)
Managing Missing Data and Identifying Outliers
Missing values and outliers are common in real-world datasets and can negatively impact analysis and modeling if not handled properly. Managing these correctly helps preserve data integrity and model reliability.
Managing Missing Data
-
Detection
:
df.isna().sum()
,df.info()
-
Removal
:
df.dropna()
— removes rows/columns with missing values -
Imputation
:
df.fillna(value)
, or using statistical imputations (mean, median, mode)
Identifying Outliers
-
Visual methods
:
boxplot()
,hist()
- Statistical methods :
- Z-score method: values with \(|z| > 3\) often considered outliers
- IQR method: values below \(Q1 - 1.5 \times IQR\) or above \(Q3 + 1.5 \times IQR\)
Example
Detecting and Treating Missing Values and Outliers in Sensor Data
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Simulated sensor readings
df = pd.DataFrame({
'temperature': [22.5, 23.0, np.nan, 21.5, 500.0, 22.7, 23.1, 22.6, np.nan, 21.8]
})
# --- Handling Missing Values ---
print("Missing Values:\n", df.isna().sum())
# Impute missing values with the mean
df['temperature_imputed'] = df['temperature'].fillna(df['temperature'].mean())
# --- Identifying Outliers using Z-score ---
z_scores = stats.zscore(df['temperature_imputed'])
df['z_score'] = z_scores
df['is_outlier_z'] = np.abs(z_scores) > 3
# --- Identifying Outliers using IQR ---
Q1 = df['temperature_imputed'].quantile(0.25)
Q3 = df['temperature_imputed'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df['is_outlier_iqr'] = (df['temperature_imputed'] < lower_bound) | (df['temperature_imputed'] > upper_bound)
# --- Plotting ---
plt.boxplot(df['temperature_imputed'], vert=False)
plt.title("Boxplot of Imputed Temperature Readings")
plt.xlabel("Temperature")
plt.show()
print(df)
Encoding Categorical Features
Many machine learning models require numerical input, so categorical features (e.g., gender, region, product type) must be converted into numeric representations. This process is known as encoding .
Encoding Categorical Features
Many machine learning models require numerical input, so categorical features must be transformed into numerical format. The choice of encoding technique depends on whether the categorical variable is nominal (no order) or ordinal (has a natural order), and how many unique categories it contains.
Common Encoding Techniques
- Label Encoding : Assigns a unique integer to each category. Suitable for ordinal features.
- One-Hot Encoding : Converts categories into binary columns. Suitable for nominal features with low cardinality.
- Ordinal Encoding : Maps categories to integers using a meaningful order defined by domain knowledge.
- Frequency Encoding : Replaces categories with their frequency counts.
- Target Encoding : Replaces categories with the mean of the target variable for each category (used in supervised learning).
Example
Encoding Multiple Categorical Features from a Retail Dataset
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
# Simulated dataset
df = pd.DataFrame({
'product_id': [101, 102, 103, 104, 105],
'category': ['Electronics', 'Clothing', 'Clothing', 'Home', 'Electronics'],
'size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
'region': ['East', 'West', 'East', 'South', 'North'],
'rating': [4.5, 3.8, 4.2, 4.0, 4.3],
'purchased': [1, 0, 1, 0, 1] # binary target
})
# 1. Label Encoding (for nominal variable — simple demo)
label_encoder = LabelEncoder()
df['region_encoded'] = label_encoder.fit_transform(df['region'])
# 2. Ordinal Encoding (for ordinal variable like size)
size_order = ['Small', 'Medium', 'Large']
ordinal_encoder = OrdinalEncoder(categories=[size_order])
df['size_encoded'] = ordinal_encoder.fit_transform(df[['size']]).astype(int)
# 3. One-Hot Encoding (for 'category')
df_onehot = pd.get_dummies(df, columns=['category'], prefix='cat')
# 4. Frequency Encoding (for region)
region_freq = df['region'].value_counts(normalize=False)
df['region_freq_encoded'] = df['region'].map(region_freq)
# 5. Target Encoding (for category, using training data only)
X_train, X_test = train_test_split(df, test_size=0.4, random_state=0)
category_means = X_train.groupby('category')['purchased'].mean()
X_test['category_target_encoded'] = X_test['category'].map(category_means)
# Show final DataFrames
print("Original Data with Label and Ordinal Encoding:\n", df[['region', 'region_encoded', 'size', 'size_encoded']])
print("\nOne-Hot Encoded Data:\n", df_onehot.head())
print("\nFrequency Encoded Region:\n", df[['region', 'region_freq_encoded']])
print("\nTarget Encoded Category (Test Set Only):\n", X_test[['category', 'category_target_encoded']])
Using Residuals to Handle Outliers, Missing Values, and Duplicates
Residuals, defined as the difference between observed and predicted values, can be a powerful diagnostic tool in data cleaning. By modeling expected behavior and examining residuals, we can:
- Detect Outliers : Large residuals may indicate abnormal or inconsistent observations.
- Impute Missing Values : Use predicted values from regression models to fill in missing data.
- Identify Duplicates with Inconsistent Values : Duplicates with large residual differences suggest data entry issues.
This approach assumes that deviations from the expected pattern signal anomalies worth reviewing or cleaning.
Example
Cleaning a Sales Dataset Using Residuals from a Predictive Model
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
# Simulated dataset
df = pd.DataFrame({
'store_id': [1, 2, 3, 4, 5, 6],
'region': ['North', 'South', 'East', 'West', 'North', 'East'],
'employees': [10, 15, 12, 8, 10, 12],
'sales': [100, 150, 130, 90, 500, np.nan] # note: one outlier and one missing value
})
# Drop rows with missing target for now (we'll impute later)
df_model = df.dropna(subset=['sales'])
# Fit a simple regression model: sales ~ employees
X = df_model[['employees']]
y = df_model['sales']
model = LinearRegression()
model.fit(X, y)
# Predict and compute residuals
df_model['predicted_sales'] = model.predict(X)
df_model['residual'] = df_model['sales'] - df_model['predicted_sales']
# Detect outliers using residuals (e.g., abs(residual) > 2 * std)
threshold = 2 * df_model['residual'].std()
df_model['is_outlier'] = np.abs(df_model['residual']) > threshold
# Impute missing sales using model prediction
missing_row = df[df['sales'].isna()]
df.loc[df['sales'].isna(), 'sales'] = model.predict(missing_row[['employees']])
# Check for duplicates
df_duplicates = df.duplicated(subset=['store_id', 'region', 'employees'], keep=False)
# Visualization
sns.residplot(x=df_model['employees'], y=df_model['sales'], lowess=True)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residual Plot: Sales vs Employees")
plt.xlabel("Employees")
plt.ylabel("Residuals")
plt.show()
print("Data with Residuals and Outlier Flag:\n", df_model[['store_id', 'employees', 'sales', 'predicted_sales', 'residual', 'is_outlier']])
print("\nRows flagged as potential duplicates:\n", df[df_duplicates])
print("\nDataset after imputing missing values:\n", df)
Estimate Skewness and Kurtosis
Skewness and kurtosis are statistical metrics that describe the shape of a distribution. They are useful for identifying non-normality and guiding transformations or model choices.
Key Concepts
- Skewness measures asymmetry of the distribution.
- Positive skew: long tail on the right.
- Negative skew: long tail on the left.
-
Normal distribution has skewness ≈ 0.
-
Kurtosis measures the "tailedness" or peak of the distribution.
- High kurtosis (>3): heavy tails and sharp peak.
- Low kurtosis (<3): light tails and flat peak.
- Normal distribution has kurtosis ≈ 3 (excess kurtosis ≈ 0).
High skewness or kurtosis may suggest the need for transformation (e.g., log, square root) before applying statistical models.
Example
Assessing Distribution Shape of Customer Spend Data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis
# Simulated skewed spending data
np.random.seed(42)
spend = np.random.exponential(scale=100, size=500)
df = pd.DataFrame({'spend': spend})
# Calculate skewness and kurtosis
skew_val = skew(df['spend'])
kurt_val = kurtosis(df['spend'], fisher=False) # Fisher=False to get kurtosis comparable to normal distribution
print(f"Skewness: {skew_val:.2f}")
print(f"Kurtosis: {kurt_val:.2f}")
# Visualization
sns.histplot(df['spend'], bins=40, kde=True)
plt.title('Distribution of Customer Spend')
plt.xlabel('Spend')
plt.ylabel('Frequency')
plt.show()
Visual Techniques for EDA
Visualization is essential in Exploratory Data Analysis (EDA) to understand the structure, distribution, and relationships within a dataset. Visual tools help detect patterns, trends, outliers, and data quality issues that are hard to capture through summary statistics alone.
Common Visualization Techniques
- Histogram : Shows distribution of a single numeric variable.
- Box Plot : Detects outliers and compares distributions across groups.
- Scatter Plot : Visualizes relationships between two numeric variables.
- Pair Plot : Displays pairwise relationships across multiple numeric variables.
- Bar Plot : Summarizes categorical variables.
- Heatmap : Visualizes correlation or missing value patterns.
- Line Plot : Tracks trends over time or ordered sequences.
These plots provide intuitive insights and help guide preprocessing and feature selection decisions.
Example
Visualizing Sales and Customer Behavior for EDA
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Simulated dataset
np.random.seed(0)
df = pd.DataFrame({
'region': np.random.choice(['North', 'South', 'East', 'West'], size=100),
'sales': np.random.normal(loc=200, scale=50, size=100),
'discount': np.random.uniform(0, 0.3, size=100),
'returns': np.random.poisson(lam=2, size=100),
'day': pd.date_range(start='2023-01-01', periods=100, freq='D')
})
# 1. Histogram of sales
plt.figure(figsize=(6, 4))
sns.histplot(df['sales'], bins=20, kde=True)
plt.title("Histogram: Sales Distribution")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()
# 2. Boxplot of sales by region
plt.figure(figsize=(6, 4))
sns.boxplot(x='region', y='sales', data=df)
plt.title("Boxplot: Sales by Region")
plt.show()
# 3. Scatter plot of sales vs. discount
plt.figure(figsize=(6, 4))
sns.scatterplot(x='discount', y='sales', data=df)
plt.title("Scatter Plot: Sales vs. Discount")
plt.xlabel("Discount")
plt.ylabel("Sales")
plt.show()
# 4. Pair plot for numeric variables
sns.pairplot(df[['sales', 'discount', 'returns']])
plt.suptitle("Pair Plot: Numeric Features", y=1.02)
plt.show()
# 5. Bar plot: Region counts
plt.figure(figsize=(6, 4))
sns.countplot(x='region', data=df)
plt.title("Bar Plot: Region Frequency")
plt.ylabel("Count")
plt.show()
# 6. Correlation heatmap
plt.figure(figsize=(5, 4))
sns.heatmap(df[['sales', 'discount', 'returns']].corr(), annot=True, cmap='coolwarm')
plt.title("Heatmap: Feature Correlations")
plt.show()
# 7. Line plot: Sales over time
plt.figure(figsize=(8, 4))
sns.lineplot(x='day', y='sales', data=df)
plt.title("Line Plot: Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Grouping and Aggregating Data for Deeper Insights
Grouping and aggregating are fundamental techniques for summarizing and analyzing large datasets, enabling you to move from raw records to structured insights that support decision-making. Grouping involves dividing the data into subsets based on the values of one or more categorical variables (e.g., region, product type, customer segment), while aggregating applies statistical functions to these groups to extract patterns and summaries (e.g., average sales per region, total returns per product).
This process is especially useful for: - Identifying trends across categories or time - Comparing performance across groups - Generating descriptive statistics for reporting or dashboards - Creating features for machine learning models (e.g., group-level averages)
Key Operations
-
Group by one or more columns
:
df.groupby('col')
-
Aggregate functions
:
mean()
,sum()
,count()
,median()
,min()
,max()
, or.agg()
- Multi-level grouping : Group by more than one column
- Custom aggregations : Apply multiple functions to different columns
- Reset index : To convert groupby object back into a regular DataFrame
These techniques are essential for cohort analysis, performance tracking, and understanding subgroup behaviors.
Example
Analyzing Sales by Region and Return Behavior
import pandas as pd
import numpy as np
# Simulated dataset
np.random.seed(0)
df = pd.DataFrame({
'region': np.random.choice(['North', 'South', 'East', 'West'], size=100),
'product': np.random.choice(['A', 'B', 'C'], size=100),
'sales': np.random.normal(200, 50, size=100),
'discount': np.random.uniform(0, 0.3, size=100),
'returns': np.random.poisson(1, size=100)
})
# 1. Grouping by one column: average sales per region
avg_sales = df.groupby('region')['sales'].mean().reset_index()
print("Average Sales by Region:\n", avg_sales)
# 2. Grouping by two columns: total sales and returns by region and product
region_product_agg = df.groupby(['region', 'product'])[['sales', 'returns']].sum().reset_index()
print("\nTotal Sales and Returns by Region and Product:\n", region_product_agg)
# 3. Multiple aggregations per group
multi_metrics = df.groupby('region').agg(
avg_sales=('sales', 'mean'),
total_sales=('sales', 'sum'),
sales_std=('sales', 'std'),
avg_discount=('discount', 'mean'),
transaction_count=('sales', 'count'),
total_returns=('returns', 'sum'),
max_discount=('discount', 'max'),
min_discount=('discount', 'min'),
median_sales=('sales', 'median')
).reset_index()
print("\nMultiple Aggregation Metrics by Region:\n", multi_metrics)
# 4. Using .agg() with a dictionary for column-specific functions
custom_agg = df.groupby('product').agg({
'sales': ['mean', 'max', 'min', 'std'],
'discount': ['mean', 'median'],
'returns': 'sum'
}).reset_index()
print("\nCustom Aggregation by Product:\n", custom_agg)
Foundations of Feature Engineering and Transformation
Feature engineering is the process of creating, modifying, or selecting variables (features) to improve model performance. It is a critical step in the data preprocessing pipeline that can have a larger impact on predictive performance than the choice of model itself.
Feature transformation , a subset of feature engineering, refers to changing the scale, form, or distribution of features to meet model assumptions or expose important relationships in the data.
Key Concepts
- Creation : Deriving new features from raw data (e.g., extracting day from a timestamp, calculating ratios).
- Selection : Choosing only the most relevant variables based on statistical tests, correlation, or model-based importance.
- Transformation : Applying functions (e.g., log, binning, scaling) to improve the quality or usefulness of a variable.
- Encoding : Converting categorical variables into numeric formats.
- Interaction Terms : Combining features (e.g., product of two columns) to model non-linear interactions.
Effective feature engineering often requires domain knowledge , creativity, and an understanding of the model's requirements (e.g., linear models assume additive relationships and normally distributed residuals).
Example
Feature Engineering from a Customer Transactions Dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Simulated dataset
df = pd.DataFrame({
'customer_id': [101, 102, 103, 104],
'signup_date': pd.to_datetime(['2023-01-01', '2023-01-10', '2023-01-15', '2023-01-20']),
'last_purchase': pd.to_datetime(['2023-02-01', '2023-02-15', '2023-03-01', '2023-02-25']),
'total_spent': [200, 150, 300, 250],
'num_purchases': [5, 3, 7, 6]
})
# --- Feature Creation ---
# Days between signup and last purchase
df['days_active'] = (df['last_purchase'] - df['signup_date']).dt.days
# Average spend per purchase
df['avg_purchase_value'] = df['total_spent'] / df['num_purchases']
# --- Feature Transformation ---
# Log-transform total_spent to reduce skew
df['log_total_spent'] = np.log(df['total_spent'])
# Scale average purchase value
scaler = MinMaxScaler()
df['avg_purchase_scaled'] = scaler.fit_transform(df[['avg_purchase_value']])
# --- Feature Interaction ---
# Spending intensity = total_spent * days_active
df['spending_intensity'] = df['total_spent'] * df['days_active']
# Final engineered features
print(df[['customer_id', 'days_active', 'avg_purchase_value', 'log_total_spent', 'avg_purchase_scaled', 'spending_intensity']])
Merging
Merging is the process of combining data from two or more tables (DataFrames) based on shared keys or indices. It allows analysts to integrate different aspects of data—such as joining customer info with transactions or linking metadata to experimental results.
Types of Joins
- Inner Join : Keeps only matching rows in both tables.
- Left Join : Keeps all rows from the left table and adds matching rows from the right.
- Right Join : Keeps all rows from the right table.
- Outer Join : Includes all rows from both tables and fills missing matches with NaNs.
Merging can be performed using
pd.merge()
or by aligning indices using
.join()
or
concat()
(for vertical or horizontal stacking).
Example
Combining Customer Info with Transaction History
import pandas as pd
# Customer info
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Carol', 'Dan'],
'region': ['North', 'South', 'East', 'West']
})
# Transactions
transactions = pd.DataFrame({
'customer_id': [1, 2, 2, 4, 5],
'purchase_amount': [100, 150, 200, 130, 90],
'date': pd.to_datetime(['2023-01-01', '2023-01-03', '2023-01-08', '2023-01-10', '2023-01-15'])
})
# 1. Inner join: only customers with transactions
inner = pd.merge(customers, transactions, on='customer_id', how='inner')
print("Inner Join:\n", inner)
# 2. Left join: keep all customers
left = pd.merge(customers, transactions, on='customer_id', how='left')
print("\nLeft Join:\n", left)
# 3. Outer join: keep all customers and all transactions
outer = pd.merge(customers, transactions, on='customer_id', how='outer')
print("\nOuter Join:\n", outer)
Correlations
Correlation analysis quantifies the strength and direction of linear relationships between numeric variables. It is a fundamental step in exploratory data analysis (EDA), helping identify redundant features, detect multicollinearity, and uncover patterns worth further investigation.
Key Concepts
- Pearson correlation coefficient (r) :
- Measures linear correlation between two variables.
- Ranges from -1 (perfect negative) to +1 (perfect positive); 0 indicates no linear relationship.
-
Assumes normality and linearity.
-
Spearman rank correlation :
- Non-parametric; measures monotonic relationships.
-
Useful for ordinal data or non-linear relationships.
-
Correlation matrix :
- A table showing pairwise correlation coefficients for multiple variables.
Use Cases
- Remove highly correlated variables in feature selection (e.g., multicollinearity in regression).
- Detect potential dependencies between outcome and predictor variables.
- Explore relationships among features prior to modeling.
Example
Analyzing Relationships Between Financial Metrics
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Simulated dataset
np.random.seed(0)
df = pd.DataFrame({
'revenue': np.random.normal(1000, 200, 100),
'expenses': np.random.normal(700, 150, 100),
'profit': np.random.normal(300, 80, 100),
'growth_rate': np.random.uniform(0.01, 0.10, 100)
})
# Calculate correlation matrix (Pearson)
corr_matrix = df.corr()
# Display correlation matrix
print("Correlation Matrix (Pearson):\n", corr_matrix)
# Heatmap visualization
plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap (Pearson)")
plt.tight_layout()
plt.show()
# Spearman correlation (non-linear/ordinal)
spearman_corr = df.corr(method='spearman')
print("\nSpearman Correlation Matrix:\n", spearman_corr)
ANOVA
ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more independent groups to determine if at least one group mean is significantly different from the others. It generalizes the two-sample t-test to multiple groups.
Key Concepts
- Null Hypothesis $(H_0)$ : All group means are equal.
- Alternative Hypothesis ($H_1$) : At least one group mean is different.
- F-statistic : Ratio of variance between groups to variance within groups: $ F = \frac{\text{Between-group variability}}{\text{Within-group variability}} $
- If the F-statistic is large and the p-value is small (below a chosen significance level), we reject $H_0$.
Assumptions: - Observations are independent. - Groups have approximately equal variance (homoscedasticity). - Residuals are approximately normally distributed.
Example
Testing If Average Sales Differ Across Product Categories
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
import seaborn as sns
import matplotlib.pyplot as plt
# Simulated dataset
np.random.seed(0)
df = pd.DataFrame({
'product_category': np.random.choice(['A', 'B', 'C'], size=100),
'sales': np.concatenate([
np.random.normal(200, 20, 33),
np.random.normal(220, 25, 34),
np.random.normal(210, 15, 33)
])
})
# Visual inspection
plt.figure(figsize=(6, 4))
sns.boxplot(x='product_category', y='sales', data=df)
plt.title("Sales by Product Category")
plt.show()
# Extract groups
grouped = [df[df['product_category'] == cat]['sales'] for cat in df['product_category'].unique()]
# Perform one-way ANOVA
f_stat, p_val = f_oneway(*grouped)
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_val:.4f}")
if p_val < 0.05:
print("Result: Statistically significant difference between group means.")
else:
print("Result: No significant difference between group means.")
Summary: Data Retrieval and Cleaning Workflow
Data retrieval and cleaning are foundational stages in the data science pipeline. They ensure that downstream analyses and models are based on consistent, complete, and interpretable inputs. This process integrates multiple steps that transform raw and messy data into well-structured and high-quality DataFrames.
Key Steps in the Workflow
- Data Retrieval
- Load structured or semi-structured data from local files (CSV, JSON), databases (SQL), APIs, or cloud storage (e.g., GCS, AWS S3).
-
Combine data from multiple sources using joins or merges.
-
Initial Exploration
-
Inspect data shape, types, missing values, and basic statistics using
.head()
,.info()
,.describe()
, and.value_counts()
. -
Data Cleaning
- Handle missing values via imputation or deletion.
- Detect and treat outliers using visual or statistical methods.
- Drop or correct duplicates.
-
Fix inconsistent data types and standardize formatting.
-
Feature Engineering
- Create new features from existing ones (e.g., ratios, date differences).
- Extract and transform variables (e.g., log, scale, encode).
-
Select informative features based on correlation, variance, or domain logic.
-
Encoding and Transformation
- Convert categorical features using one-hot, label, ordinal, or target encoding.
- Normalize or standardize numerical features.
-
Apply transformations to address skewness, scale issues, or distribution constraints.
-
EDA and Visualization
- Use histograms, box plots, scatter plots, and heatmaps to assess distributions and relationships.
-
Analyze group-level summaries using
.groupby()
and.agg()
.
Together, these steps form a reproducible and systematic approach for converting raw data into a reliable format for modeling and inference.
An End-to-End EDA and Preprocessing Workflow
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler, PowerTransformer
from scipy.stats import skew, kurtosis, f_oneway
# ------------------------
# Step 1: Data Retrieval
# ------------------------
# Simulate loading from CSV/SQL
df = pd.DataFrame({
'customer_id': range(1, 11),
'region': ['North', 'South', 'East', 'West', 'East', 'South', 'North', 'West', 'East', 'South'],
'signup_date': pd.to_datetime(['2023-01-01', '2023-01-05', '2023-01-07', '2023-01-10',
'2023-01-12', '2023-01-15', '2023-01-18', '2023-01-20',
'2023-01-23', '2023-01-25']),
'last_login': pd.to_datetime(['2023-03-01', '2023-03-04', '2023-03-07', '2023-03-01',
'2023-03-03', '2023-03-08', '2023-03-05', '2023-03-07',
'2023-03-06', '2023-03-05']),
'spend': [100, 250, 190, None, 500, 180, 110, 270, 800, None],
'returns': [0, 1, 0, 0, 2, 0, 1, 0, 1, 0],
'membership': ['Basic', 'Premium', 'Basic', 'Premium', 'Basic', 'Basic', 'Premium', 'Basic', 'Premium', 'Basic']
})
# ------------------------
# Step 2: Initial Exploration
# ------------------------
print("Initial Overview:\n", df.head())
print("\nMissing Values:\n", df.isna().sum())
print("\nSkewness:", skew(df['spend'].dropna()))
print("Kurtosis:", kurtosis(df['spend'].dropna(), fisher=False))
# ------------------------
# Step 3: Data Cleaning
# ------------------------
# Fill missing spend with median
df['spend'] = df['spend'].fillna(df['spend'].median())
# Check duplicates
print("\nDuplicate rows:", df.duplicated().sum())
# ------------------------
# Step 4: Feature Engineering
# ------------------------
# Time active (derived feature)
df['days_active'] = (df['last_login'] - df['signup_date']).dt.days
# Average spend per day
df['spend_per_day'] = df['spend'] / df['days_active']
# ------------------------
# Step 5: Encoding
# ------------------------
label_encoder = LabelEncoder()
df['region_encoded'] = label_encoder.fit_transform(df['region'])
# One-hot encoding
df = pd.get_dummies(df, columns=['membership'], drop_first=True)
# ------------------------
# Step 6: Transformation
# ------------------------
scaler = StandardScaler()
df['spend_scaled'] = scaler.fit_transform(df[['spend']])
pt = PowerTransformer(method='yeo-johnson')
df['spend_transformed'] = pt.fit_transform(df[['spend']])
# ------------------------
# Step 7: Grouping & Aggregation
# ------------------------
summary = df.groupby('region').agg(
avg_spend=('spend', 'mean'),
total_returns=('returns', 'sum'),
avg_days_active=('days_active', 'mean')
).reset_index()
print("\nGroup-wise Summary:\n", summary)
# ------------------------
# Step 8: Correlation & ANOVA
# ------------------------
print("\nCorrelation Matrix:\n", df[['spend', 'days_active', 'returns']].corr())
# ANOVA: Is spend significantly different across regions?
groups = [group['spend'].values for _, group in df.groupby('region')]
f_stat, p_val = f_oneway(*groups)
print(f"\nANOVA F-statistic: {f_stat:.2f}, p-value: {p_val:.4f}")
# ------------------------
# Step 9: Visualization
# ------------------------
# Histogram
plt.figure()
sns.histplot(df['spend'], bins=10, kde=True)
plt.title("Spend Distribution")
plt.show()
# Boxplot by region
plt.figure()
sns.boxplot(x='region', y='spend', data=df)
plt.title("Spend by Region")
plt.show()
# Correlation heatmap
plt.figure()
sns.heatmap(df[['spend', 'days_active', 'returns']].corr(), annot=True)
plt.title("Correlation Heatmap")
plt.show()
# Pairplot
sns.pairplot(df[['spend', 'days_active', 'returns']])
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()
# Line plot of spend over time (simulated)
df['login_order'] = df['last_login'].rank().astype(int)
df = df.sort_values('login_order')
plt.figure()
sns.lineplot(x='login_order', y='spend', data=df)
plt.title("Spend Trend by Login Order")
plt.xlabel("Login Order")
plt.ylabel("Spend")
plt.show()
Leave a Comment