Raw data contains errors. It has missing values. It has outliers. Features have different scales. You must prepare data before training models. Data preparation transforms raw data into clean features. Clean features improve model performance. Poor preparation produces poor models.
Data preparation includes collection, cleaning, transformation, and validation. You collect data from sources. You clean errors and inconsistencies. You transform features into usable formats. You validate data quality. Each step affects final model performance.
The workflow starts with raw data. You identify issues. You handle missing values. You remove or transform outliers. You normalize features. You validate results. The output is clean data ready for training.
Data Collection
You collect data from multiple sources. Databases store historical records. APIs provide real-time data. Files contain structured or unstructured data. Sensors capture measurements. Each source has different formats and quality levels.
Assess data quality early. Check completeness. Check accuracy. Check relevance. Identify missing values. Identify duplicates. Identify inconsistencies. Document data sources and collection methods. Track data lineage for reproducibility.
FROM api_import('https://api.example.com/customers')
UNIONALL
SELECT id, name, email, age, income
FROM external_db.customers
)AS combined_data;
SELECTCOUNT(*)AS total_records FROM raw_customer_data;
-- Result:
-- total_records
-- ---------------
-- 1500
-- (1 row)
Data collection requires planning. Define what data you need. Identify available sources. Assess access requirements. Plan collection schedules. Handle rate limits for APIs. Manage storage for large datasets. Ensure data privacy compliance.
Handling Missing Values
Missing values appear as null, NaN, or empty fields. They occur from collection errors, optional fields, or data corruption. Missing values break many algorithms. You must handle them before training.
Three main approaches exist. Removal deletes rows or columns with missing values. Imputation fills missing values with estimates. Prediction uses models to predict missing values. Choose based on missing data amount and pattern.
The diagram shows different strategies. Complete case analysis removes all rows with any missing value. Mean imputation fills numeric missing values with column means. Mode imputation fills categorical missing values with most common values. Model-based imputation predicts missing values using other features.
# Missing Values Handling
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
WHERE age ISNOTNULLAND income ISNOTNULLAND city ISNOTNULL;
-- Method 2: Mean imputation
UPDATE customer_data
SET age =COALESCE(age,(SELECTAVG(age)FROM customer_data WHERE age ISNOTNULL))
WHERE age ISNULL;
UPDATE customer_data
SET income =COALESCE(income,(SELECTAVG(income)FROM customer_data WHERE income ISNOTNULL))
WHERE income ISNULL;
-- Method 3: Mode imputation for categorical
UPDATE customer_data
SET city =COALESCE(city,(
SELECT city FROM customer_data
WHERE city ISNOTNULL
GROUPBY city
ORDERBYCOUNT(*)DESC
LIMIT1
))
WHERE city ISNULL;
SELECT*FROM customer_data;
-- Result:
-- id | age | income | city
-- ----+-----+--------+------
-- 1 | 25 | 50000 | NYC
-- 2 | 30 | 75000 | SF
-- 3 | 32 | 75000 | NYC
-- 4 | 35 | 80000 | NYC
-- 5 | 40 | 75000 | LA
-- 6 | 32 | 90000 | SF
-- (6 rows)
Missing value patterns matter. Missing completely at random means no pattern exists. Missing at random means pattern depends on observed data. Missing not at random means pattern depends on missing values themselves. Understanding patterns guides handling strategy.
Detailed Missing Value Analysis
Analyze missing value patterns before choosing a strategy. Check missing percentages per column. Identify correlations between missing values. Test if missingness depends on other features. Visualize missing patterns using heatmaps.
Missing completely at random occurs when probability of missing is independent of observed and unobserved data. Example: random data corruption. You can safely use deletion or simple imputation. Missing at random occurs when probability of missing depends only on observed data. Example: income missing more often for young people. You can use model-based imputation. Missing not at random occurs when probability of missing depends on missing values themselves. Example: high-income people less likely to report income. This requires specialized handling.
Advanced imputation uses machine learning to predict missing values. Iterative imputation uses multiple models. Each feature with missing values becomes a target. Other features become inputs. Models predict missing values iteratively.
Multiple imputation creates several complete datasets. Each dataset has different imputed values. You train models on each dataset. You combine results to account for imputation uncertainty. This provides better uncertainty estimates.
# Advanced Imputation Techniques
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
Choose strategy based on missing percentage. If less than 5% missing, deletion is acceptable. If 5-20% missing, use imputation. If more than 20% missing, consider if feature is necessary. Very high missing rates may indicate data quality issues.
For numeric features, use mean or median for symmetric distributions. Use median for skewed distributions. Use KNN or iterative imputation for complex patterns. For categorical features, use mode for low cardinality. Use separate category for high cardinality. Consider if missingness itself is informative.
Always validate imputation quality. Compare distributions before and after. Check if imputed values are reasonable. Test model performance with imputed data. Document imputation methods for reproducibility.
Outlier Detection and Treatment
Outliers are values far from the majority. They occur from measurement errors, data entry mistakes, or rare events. Outliers distort statistics and model training. You must identify and handle them appropriately.
Detection methods include statistical tests, distance measures, and visualization. Z-scores flag values beyond thresholds. Interquartile range identifies values outside quartile bounds. Isolation forests detect anomalies automatically. Visualization shows outliers in scatter plots.
Detailed Outlier Detection Methods
Z-score method calculates standardized scores. Z = (x - μ) / σ. Values with |Z| > 3 are outliers. This assumes normal distribution. It works well for symmetric data. It fails for skewed distributions.
Modified Z-score uses median and median absolute deviation. It is more robust to outliers. MAD = median(|x - median(x)|). Modified Z = 0.6745 × (x - median) / MAD. Values with |modified Z| > 3.5 are outliers.
Interquartile range method uses quartiles. Q1 is 25th percentile. Q3 is 75th percentile. IQR = Q3 - Q1. Lower bound = Q1 - 1.5×IQR. Upper bound = Q3 + 1.5×IQR. Values outside bounds are outliers. This method is distribution-free. It works for any distribution shape.
Isolation forest uses tree-based anomaly detection. It isolates outliers using random splits. Outliers require fewer splits to isolate. It works for high-dimensional data. It handles multiple outliers well.
# Detailed Outlier Detection
from sklearn.ensemble import IsolationForest
from scipy import stats
import numpy as np
data = np.array([10,12,11,13,15,14,100,16,12,11,200,13,14,15,12])
Multivariate outliers are unusual combinations of features. They may not be outliers in individual dimensions. Methods include Mahalanobis distance, local outlier factor, and DBSCAN clustering.
Mahalanobis distance measures distance from distribution center. It accounts for feature correlations. D = √((x - μ)ᵀ Σ⁻¹ (x - μ)). Large distances indicate outliers. Threshold typically uses chi-square distribution.
Local outlier factor compares local density. It identifies points with lower density than neighbors. LOF > 1 indicates outlier. Higher values mean more anomalous. It works well for clusters with varying densities.
Treatment depends on outlier cause and impact. Legitimate outliers represent rare events. They should be kept but handled carefully. Erroneous outliers should be removed or corrected.
Removal deletes outlier records. Use when outliers are errors. Use when outliers are few. Use when removal doesn't affect sample size significantly. Capping limits extreme values. Set values beyond thresholds to threshold values. Use when outliers are legitimate but extreme. Use when you want to preserve sample size.
Transformation reduces outlier impact. Log transformation compresses large values. Square root transformation moderates extremes. Box-Cox transformation normalizes distributions. Use when outliers are legitimate. Use when you want to preserve all data.
Separate modeling treats outliers differently. Build models for normal and outlier cases. Use when outliers represent different populations. Use when outliers have different patterns.
The diagram shows outlier detection methods. Z-score method marks values beyond 3 standard deviations. IQR method marks values below Q1-1.5×IQR or above Q3+1.5×IQR. Isolation forest separates outliers using tree structures. Each method has different sensitivity and assumptions.
WHERE m.value>= b.lower_bound AND m.value<= b.upper_bound;
Outlier treatment depends on context. Removal works when outliers are errors. Capping limits extreme values to bounds. Transformation reduces outlier impact. Separate modeling handles legitimate rare cases. Domain knowledge guides appropriate treatment.
Normalization and Standardization
Features have different scales. Age ranges from 0 to 100. Income ranges from 0 to 1,000,000. Distance algorithms treat larger numbers as more important. Normalization and standardization make features comparable.
Normalization scales values to 0-1 range. Formula is (x - min) / (max - min). Standardization centers values around zero with unit variance. Formula is (x - mean) / std. Choose based on algorithm requirements.
The diagram shows scaling transformations. Original data has different ranges. Normalization maps all values to 0-1. Standardization centers at zero with unit spread. Both methods preserve relationships while making features comparable.
# Normalization and Standardization
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample data with different scales
data = pd.DataFrame({
'age':[25,30,35,40,45],
'income':[50000,75000,100000,125000,150000],
'score':[0.5,0.7,0.8,0.9,1.0]
})
# Method 1: Min-Max Normalization (0-1 range)
scaler_minmax = MinMaxScaler()
data_normalized = pd.DataFrame(
scaler_minmax.fit_transform(data),
columns=data.columns
)
print("Normalized data:")
print(data_normalized)
# Method 2: Standardization (zero mean, unit variance)
Scaling requirements vary by algorithm. Distance-based algorithms need scaling. Neural networks require normalization. Tree-based algorithms are scale-invariant. Linear models benefit from standardization. Check algorithm documentation for requirements.
Feature Engineering
Feature engineering creates new features from existing data. It transforms raw inputs into useful representations. Good features improve model performance more than algorithm selection. Domain knowledge guides feature creation.
Common techniques include interaction features, polynomial features, time-based features, and text features. Interaction features combine multiple inputs. Polynomial features capture non-linear relationships. Time features extract temporal patterns. Text features convert words to numbers.
The diagram shows feature engineering transformations. Raw features include price and area. Interaction feature multiplies price and area. Polynomial feature squares area. Time feature extracts month from date. Each transformation captures different patterns.
# Feature Engineering
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
CASEWHEN EXTRACT(DOW FROM sale_date)IN(0,6)THEN1ELSE0ENDAS is_weekend
FROM property_data;
Feature engineering requires iteration. Start with domain knowledge. Create candidate features. Test feature importance. Remove redundant features. Monitor model performance. Automated feature engineering tools exist but manual engineering often performs better.
Feature Selection
Feature selection reduces dimensionality. It removes irrelevant or redundant features. Fewer features mean faster training and less overfitting. Selection methods include filter, wrapper, and embedded approaches.
Filter methods use statistical tests. They rank features independently. Wrapper methods use model performance. They search feature subsets. Embedded methods use model internals. They select during training.
The diagram shows selection methods. Filter method scores each feature independently. Wrapper method tests feature subsets with models. Embedded method uses model weights or importance. Each method has different computational cost and effectiveness.
# Feature Selection
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
Feature selection balances performance and complexity. More features can improve accuracy but increase overfitting risk. Fewer features reduce complexity but may miss important patterns. Use cross-validation to evaluate selection strategies.
Data Validation
Data validation checks data quality after preparation. It verifies completeness, correctness, and consistency. Validation catches errors before training. It ensures data meets model requirements.
Validation checks include range validation, type validation, constraint validation, and relationship validation. Range validation ensures values fall within expected bounds. Type validation ensures correct data types. Constraint validation checks business rules. Relationship validation verifies referential integrity.
Validation should be automated. Create validation rules early. Run validation after each preparation step. Document validation failures. Fix errors systematically. Re-validate after fixes. Maintain validation logs for auditing.
Complete Example: Customer Data Preparation
This example demonstrates complete data preparation workflow for customer data.
Data preparation transforms raw data into clean features. You collect data from multiple sources. You handle missing values through removal, imputation, or prediction. You detect and treat outliers using statistical methods. You normalize or standardize features to comparable scales. You engineer new features from existing data. You select relevant features to reduce dimensionality. You validate data quality throughout the process. Each step improves model performance. Proper preparation requires domain knowledge and iterative refinement.