1. Explain Data Analytics and Provide Examples in Detail
Definition of Data Analytics
Data analytics refers to the process of examining, cleaning, transforming, and modeling raw data to discover useful insights, draw conclusions, and support decision-making. It involves applying statistical and computational techniques to extract meaningful patterns from structured and unstructured data.
Key Components of Data Analytics
- Descriptive Analytics: Summarizes past data to understand what has happened (e.g., sales reports, dashboards).
- Diagnostic Analytics: Identifies why something happened (e.g., root cause analysis of a drop in customer satisfaction).
- Predictive Analytics: Forecasts future outcomes based on historical trends (e.g., predicting stock prices).
- Prescriptive Analytics: Recommends actions to achieve desired outcomes (e.g., optimizing supply chain logistics).
Examples of Data Analytics
- Healthcare: Predicting patient readmission rates using predictive analytics.
- Retail: Segmenting customers based on purchasing behavior for targeted marketing campaigns.
- Finance: Detecting fraudulent transactions using anomaly detection algorithms.
- Manufacturing: Optimizing production schedules using prescriptive analytics.
2. Difference Between Present Technologies Used in Data Analytics OR Usage of Data Analytics in the Market
Difference Between Technologies
| Technology | Description | Use Case |
|---|---|---|
| Excel | Basic tool for small-scale data manipulation and visualization. | Financial reporting, simple trend analysis. |
| SQL | Query language for managing relational databases. | Extracting data for business intelligence. |
| Python/R | Programming languages with libraries for advanced analytics and machine learning. | Building predictive models, statistical analysis. |
| Tableau/Power BI | Visualization tools for creating interactive dashboards. | Business performance monitoring. |
| Hadoop/Spark | Big data frameworks for processing large datasets. | Analyzing social media data, IoT sensor data. |
Usage of Data Analytics in the Market
- Customer Insights: Companies use analytics to understand customer preferences and improve products/services.
- Risk Management: Banks use predictive analytics to assess credit risk and detect fraud.
- Supply Chain Optimization: Retailers optimize inventory levels and delivery routes using prescriptive analytics.
- Market Basket Analysis: Identifying product associations to design effective promotions (e.g., “buy one, get one free”).
- Personalized Marketing: Recommender systems in e-commerce (e.g., Amazon’s “Customers who bought this also bought”).
3. Write in Detail About Regression and Segmentation
Regression
Regression is a statistical method used to model the relationship between a dependent variable (target) and one or more independent variables (predictors). It is widely used for prediction and forecasting.
Types of Regression
-
Linear Regression:
- Formula:
- : Dependent variable
- : Independent variables
- : Coefficients
- : Error term
- Example: Predicting house prices () based on area () and number of bedrooms ().
- Formula:
-
Logistic Regression:
- Used for classification problems where the target variable is binary (e.g., yes/no, 0/1).
- Formula:
- Example: Predicting whether a customer will churn () based on their usage patterns ().
Key Concepts in Regression
- Least Square Estimation: Minimizes the sum of squared errors to find the best-fit line.
- Assumptions:
- Linearity: Relationship between variables is linear.
- Independence: Observations are independent.
- Homoscedasticity: Constant variance of residuals.
- Normality: Residuals follow a normal distribution.
Applications of Regression
- Sales forecasting
- Risk assessment
- Demand prediction
Segmentation
Segmentation divides a dataset into distinct groups (clusters) based on shared characteristics. It is widely used in marketing and customer analysis.
Types of Segmentation
-
K-Means Clustering:
- Groups data points into clusters by minimizing the distance between points within each cluster.
- Formula: Minimize
- : Cluster
- : Centroid of cluster
-
Hierarchical Clustering:
- Builds a tree-like structure (dendrogram) to group data points.
- Methods: Agglomerative (bottom-up) or Divisive (top-down).
Applications of Segmentation
- Customer segmentation for targeted marketing.
- Image segmentation in computer vision.
- Fraud detection by grouping unusual transactions.
Important Formulas
- Linear Regression:
- Logistic Regression:
- K-Means Objective Function:
4. Write Clearly About Vast Data Sources Like Sensors, GPS, and Others
Introduction to Data Sources
Data sources are the origins of raw data used for analysis. In today’s world, vast amounts of data are generated from diverse sources, including IoT devices, sensors, GPS systems, social media, and more. These sources provide structured, semi-structured, and unstructured data that can be analyzed to derive actionable insights.
Types of Data Sources
-
Sensors:
- Definition: Devices that detect or measure physical properties (e.g., temperature, pressure, motion) and convert them into digital signals.
- Examples:
- Temperature sensors in weather stations.
- Motion sensors in security systems.
- Wearable health trackers (e.g., Fitbit).
- Applications:
- Monitoring industrial equipment for predictive maintenance.
- Environmental monitoring (air quality, water levels).
-
GPS (Global Positioning System):
- Definition: A satellite-based navigation system that provides location and time information.
- Examples:
- Location tracking in smartphones.
- Fleet management in logistics.
- Applications:
- Route optimization in transportation.
- Geotagging in social media posts.
-
Social Media:
- Definition: Platforms where users generate text, images, videos, and other content.
- Examples:
- Tweets on Twitter.
- Posts on Facebook or Instagram.
- Applications:
- Sentiment analysis for brand reputation.
- Trend analysis for marketing campaigns.
-
Transactional Data:
- Definition: Data generated from business transactions (e.g., purchases, payments).
- Examples:
- Sales receipts from retail stores.
- Bank transaction records.
- Applications:
- Fraud detection in banking.
- Inventory management in retail.
-
IoT (Internet of Things):
- Definition: Network of interconnected devices that collect and exchange data.
- Examples:
- Smart home devices (e.g., thermostats, lights).
- Connected cars.
- Applications:
- Energy consumption monitoring.
- Traffic management.
-
Web Scraping:
- Definition: Extracting data from websites using automated tools.
- Examples:
- Extracting product prices from e-commerce websites.
- Gathering news articles for sentiment analysis.
- Applications:
- Competitor price monitoring.
- News aggregation platforms.
Challenges with Data Sources
- Volume: Managing large-scale data (Big Data).
- Velocity: Handling real-time data streams.
- Variety: Dealing with structured, semi-structured, and unstructured data.
- Veracity: Ensuring data quality (noise, missing values, outliers).
5. Explain in Detail About Types of Data, Variables, and Various Data Modeling Techniques
Types of Data
-
Structured Data:
- Organized in a predefined format (e.g., rows and columns in databases).
- Example: Relational databases (SQL tables).
-
Semi-Structured Data:
- Partially organized but lacks a rigid structure (e.g., JSON, XML files).
- Example: Emails, log files.
-
Unstructured Data:
- Not organized in a predefined manner (e.g., text, images, videos).
- Example: Social media posts, medical records.
Types of Variables
-
Categorical Variables:
- Represent categories or groups.
- Nominal: No inherent order (e.g., gender, city).
- Ordinal: Ordered categories (e.g., education level: high school < bachelor’s < master’s).
-
Numerical Variables:
- Represent measurable quantities.
- Discrete: Countable values (e.g., number of students).
- Continuous: Infinite range of values (e.g., height, weight).
Data Modeling Techniques
Data modeling involves creating representations of data to understand relationships and patterns. Below are key techniques:
-
Descriptive Modeling:
- Summarizes data characteristics.
- Example: Mean, median, mode, standard deviation.
-
Predictive Modeling:
- Forecasts future outcomes based on historical data.
- Example: Linear regression, decision trees.
-
Prescriptive Modeling:
- Recommends actions to achieve desired outcomes.
- Example: Optimization algorithms.
-
Clustering:
- Groups similar data points together.
- Example: K-Means clustering.
-
Dimensionality Reduction:
- Reduces the number of variables while retaining important information.
- Example: Principal Component Analysis (PCA).
-
Association Rule Mining:
- Identifies relationships between variables.
- Example: Market basket analysis (“If a customer buys bread, they are likely to buy butter”).
-
Time Series Analysis:
- Analyzes data collected over time.
- Example: Stock price forecasting.
Important Formulas
- Mean:
- Standard Deviation:
- Principal Component Analysis (PCA):
- Eigenvalue decomposition of covariance matrix.
- K-Means Clustering Objective Function:
6. Explain in Detail About Different Regression Concepts and the Least Square Estimation Method
What is Regression?
Regression is a statistical technique used to model and analyze the relationship between a dependent variable (target) and one or more independent variables (predictors). It helps in understanding trends, making predictions, and identifying relationships.
Key Regression Concepts
-
Linear Regression:
-
Models the linear relationship between a dependent variable () and one or more independent variables ().
-
Formula:
- : Intercept (value of when all = 0).
- : Coefficients (rate of change of with respect to ).
- : Error term (unexplained variation).
-
Example: Predicting house prices based on area () and number of bedrooms ().
-
-
Logistic Regression:
-
Used for classification problems where the dependent variable is binary (e.g., yes/no, 0/1).
-
Formula:
- : Probability of the event occurring.
- Sigmoid function ensures output is between 0 and 1.
-
Example: Predicting whether a customer will churn () based on usage patterns ().
-
-
Multiple Regression:
- Extends linear regression to include multiple predictors.
- Example: Predicting employee performance based on experience, education level, and training hours.
-
Polynomial Regression:
-
Captures non-linear relationships by including polynomial terms of predictors.
-
Formula:
-
Example: Modeling the relationship between temperature and crop yield.
-
-
Ridge and Lasso Regression:
- Regularization techniques to handle multicollinearity and overfitting.
- Ridge adds a penalty term to the sum of squared coefficients:
- Lasso adds an absolute penalty term:
Least Square Estimation (LSE)
The least square estimation method minimizes the sum of squared differences between observed values () and predicted values ().
-
Objective Function:
- : Actual value.
- : Predicted value.
-
Steps:
- Define the regression equation (e.g., ).
- Compute residuals ().
- Minimize the sum of squared residuals using calculus or optimization algorithms.
-
Assumptions:
- Linearity: Relationship between variables is linear.
- Independence: Observations are independent.
- Homoscedasticity: Constant variance of residuals.
- Normality: Residuals follow a normal distribution.
-
Example: Suppose we have data points : (1, 2), (2, 3), (3, 5). Using LSE, we find the best-fit line .
7. Illustrate Techniques for Handling Missing Values with Examples
Why Handle Missing Values?
Missing values can lead to biased results, reduced accuracy, and incomplete analysis. Proper handling ensures robust models.
Techniques for Handling Missing Values
-
Deletion:
-
Remove rows or columns with missing values.
-
Listwise Deletion: Remove entire rows with missing values.
-
Pairwise Deletion: Use available data for specific analyses.
-
Limitation: Leads to loss of information.
-
Example: If a dataset has 100 rows and 10 rows have missing values, remove those 10 rows.
-
-
Imputation:
- Replace missing values with estimated values.
- Mean/Median/Mode Imputation:
- Replace with mean (numerical), median (skewed data), or mode (categorical).
- Example: Replace missing “age” values with the mean age of the dataset.
- K-Nearest Neighbors (KNN) Imputation:
- Use similar data points to estimate missing values.
- Example: Replace missing “income” values based on neighbors’ income.
- Regression Imputation:
- Predict missing values using regression models.
- Example: Predict missing “sales” values based on other predictors like “advertising spend.”
-
Advanced Techniques:
- Expectation-Maximization (EM):
- Iteratively estimates missing values and updates model parameters.
- Multiple Imputation:
- Creates multiple datasets with imputed values and combines results.
- Example: Impute missing values 5 times and average the results.
- Expectation-Maximization (EM):
-
Indicator Variables:
- Add a binary column indicating whether a value was missing.
- Example: Add a column “is_age_missing” to indicate if “age” was missing.
Choosing the Right Technique
- Small Datasets: Avoid deletion; use imputation.
- Large Datasets: Deletion may be acceptable if missing values are minimal.
- Sensitive Data: Use advanced techniques like EM or multiple imputation.
Important Formulas
- Linear Regression:
- Logistic Regression:
- Least Square Estimation Objective:
8. Explain Databases, Their Types, and Data Characteristics
What is a Database?
A database is an organized collection of structured data stored and accessed electronically. It allows efficient storage, retrieval, and management of large volumes of information.
Types of Databases
-
Relational Databases (RDBMS):
- Data is stored in tables with rows and columns.
- Relationships between tables are defined using keys (e.g., primary key, foreign key).
- Examples: MySQL, PostgreSQL, Oracle, SQL Server.
- Use Case: Banking systems, inventory management.
-
NoSQL Databases:
- Designed for unstructured or semi-structured data.
- Categories:
- Document Stores: Store data as documents (e.g., JSON). Example: MongoDB.
- Key-Value Stores: Store data as key-value pairs. Example: Redis.
- Column-Family Stores: Organize data into columns. Example: Cassandra.
- Graph Databases: Store relationships between data points. Example: Neo4j.
- Use Case: Social media platforms, IoT applications.
-
Object-Oriented Databases:
- Store data as objects (similar to object-oriented programming).
- Example: db4o.
- Use Case: Multimedia applications, CAD systems.
-
Hierarchical Databases:
- Data is organized in a tree-like structure with parent-child relationships.
- Example: IBM Information Management System (IMS).
- Use Case: Legacy systems like file directories.
-
Network Databases:
- Extends hierarchical databases by allowing multiple parent-child relationships.
- Example: Integrated Data Store (IDS).
- Use Case: Complex organizational structures.
Data Characteristics
-
Volume:
- The amount of data generated and stored.
- Example: Social media platforms generate terabytes of data daily.
-
Velocity:
- The speed at which data is generated and processed.
- Example: Real-time stock trading systems.
-
Variety:
- The diversity of data types (structured, semi-structured, unstructured).
- Example: Text, images, videos, sensor data.
-
Veracity:
- The quality and reliability of data.
- Example: Handling noisy or incomplete data.
-
Value:
- The usefulness of data in deriving insights.
- Example: Customer purchase history used for targeted marketing.
9. Write a Short Note on the BLUE Property Assumption and the Purpose of Clustering Modeling
BLUE Property Assumption
BLUE stands for Best Linear Unbiased Estimator, a property of estimators in linear regression models.
-
Key Components:
- Best: Provides the lowest variance among all unbiased estimators.
- Linear: The estimator is a linear function of the dependent variable.
- Unbiased: The expected value of the estimator equals the true parameter value.
- Estimator: A rule or formula used to estimate parameters.
-
Assumptions:
- Linearity: The relationship between variables is linear.
- Independence: Observations are independent.
- Homoscedasticity: Constant variance of residuals.
- Normality: Residuals follow a normal distribution.
-
Purpose: Ensures that the regression model provides reliable and efficient estimates.
Purpose of Clustering Modeling
Clustering is an unsupervised learning technique used to group similar data points together based on their characteristics.
-
Key Objectives:
- Pattern Discovery: Identify hidden patterns or structures in data.
- Segmentation: Divide data into meaningful groups for analysis.
- Dimensionality Reduction: Simplify complex datasets by grouping similar features.
-
Applications:
- Customer Segmentation: Group customers based on purchasing behavior.
- Image Segmentation: Identify regions of interest in images.
- Anomaly Detection: Detect outliers by identifying unusual clusters.
-
Popular Techniques:
- K-Means Clustering:
- Groups data into clusters by minimizing the distance between points within each cluster.
- Objective Function:
- Hierarchical Clustering:
- Builds a tree-like structure (dendrogram) to group data points.
- Methods: Agglomerative (bottom-up) or Divisive (top-down).
- K-Means Clustering:
10. Illustrate Data Modeling Techniques in Detail
What is Data Modeling?
Data modeling involves creating a conceptual representation of data to understand relationships, patterns, and structures. It helps in organizing, analyzing, and interpreting data effectively.
Types of Data Modeling Techniques
-
Descriptive Modeling:
- Focuses on summarizing the characteristics of data.
- Example: Mean, median, mode, standard deviation.
- Use Case: Understanding central tendencies and variability.
-
Predictive Modeling:
- Forecasts future outcomes based on historical trends.
- Example: Linear regression, decision trees, neural networks.
- Use Case: Stock price prediction, customer churn prediction.
-
Prescriptive Modeling:
- Recommends actions to achieve desired outcomes.
- Example: Optimization algorithms, simulation models.
- Use Case: Supply chain optimization, resource allocation.
-
Clustering:
- Groups similar data points together based on shared characteristics.
- Example: K-Means clustering, hierarchical clustering.
- Use Case: Customer segmentation, anomaly detection.
-
Dimensionality Reduction:
- Reduces the number of variables while retaining important information.
- Example: Principal Component Analysis (PCA), t-SNE.
- Use Case: Visualizing high-dimensional data, feature selection.
-
Association Rule Mining:
- Identifies relationships between variables.
- Example: Market basket analysis (“If a customer buys bread, they are likely to buy butter”).
- Use Case: Cross-selling, product recommendations.
-
Time Series Analysis:
- Analyzes data collected over time.
- Example: ARIMA models, exponential smoothing.
- Use Case: Sales forecasting, weather prediction.
Steps in Data Modeling
- Define the Problem: Clearly identify the objective (e.g., prediction, segmentation).
- Data Collection: Gather relevant data from various sources.
- Data Cleaning: Handle missing values, outliers, and noise.
- Feature Engineering: Create meaningful features for modeling.
- Model Selection: Choose appropriate techniques based on the problem.
- Model Training: Train the model on historical data.
- Validation: Evaluate model performance using metrics like accuracy, precision, recall, etc.
- Deployment: Implement the model in real-world applications.
11. What is K-Means Clustering? Illustrate Its Importance in Regard to Modeling
What is K-Means Clustering?
K-Means clustering is an unsupervised learning algorithm that groups data into clusters by minimizing the distance between data points within each cluster.
-
Objective Function:
- : Number of clusters.
- : Cluster .
- : Centroid of cluster .
- : Data point.
-
Steps:
- Initialize centroids randomly.
- Assign each data point to the nearest centroid.
- Recalculate centroids as the mean of all points in the cluster.
- Repeat until convergence (no further changes in cluster assignments).
Importance of K-Means Clustering in Modeling
-
Segmentation:
- Divides data into meaningful groups for targeted analysis.
- Example: Grouping customers based on purchasing behavior.
-
Anomaly Detection:
- Identifies unusual clusters or outliers.
- Example: Detecting fraudulent transactions.
-
Dimensionality Reduction:
- Simplifies complex datasets by grouping similar features.
- Example: Reducing high-dimensional image data.
-
Pattern Discovery:
- Reveals hidden structures in unlabeled data.
- Example: Identifying regions of interest in satellite images.
-
Scalability:
- Efficiently handles large datasets due to its simplicity and speed.
- Example: Segmenting millions of users in social media platforms.
Advantages and Limitations
-
Advantages:
- Simple and easy to implement.
- Works well with large datasets.
- Provides interpretable results.
-
Limitations:
- Requires prior knowledge of (number of clusters).
- Sensitive to initial centroid placement.
- Struggles with non-spherical clusters.
Example
Suppose you have a dataset of customer purchases with two features: “Total Spend” and “Frequency of Purchases.” Using K-Means clustering with , you can group customers into three segments:
- High-value frequent buyers.
- Moderate spenders.
- Infrequent low spenders.
12. Demonstrate Data Validation Methods in Detail with Examples
What is Data Validation?
Data validation ensures that the data collected and used for analysis is accurate, complete, and consistent. It involves verifying that the data meets predefined rules or constraints.
Key Data Validation Methods
-
Field-Level Validation:
- Validates individual fields to ensure they meet specific criteria.
- Example: Checking if an email address contains the ”@” symbol.
- Use Case: Ensuring correct format for user input in forms.
-
Range Validation:
- Ensures values fall within a specified range.
- Example: Validating that age is between 0 and 120.
- Use Case: Preventing unrealistic values in surveys.
-
Consistency Validation:
- Ensures relationships between data points are logical.
- Example: Verifying that “Date of Birth” is earlier than “Joining Date.”
- Use Case: Cross-checking dates in employee records.
-
Uniqueness Validation:
- Ensures no duplicate entries exist for unique fields.
- Example: Validating that each customer has a unique ID.
- Use Case: Avoiding duplicate records in databases.
-
Cross-Field Validation:
- Checks dependencies between multiple fields.
- Example: If “Marital Status” is “Single,” then “Spouse Name” should be empty.
- Use Case: Ensuring logical consistency in forms.
-
Statistical Validation:
- Uses statistical methods to identify anomalies.
- Example: Detecting outliers using z-scores or interquartile range (IQR).
- Use Case: Identifying errors in sensor data.
-
Referential Integrity Validation:
- Ensures foreign keys match primary keys in related tables.
- Example: Verifying that all “Customer IDs” in the “Orders” table exist in the “Customers” table.
- Use Case: Maintaining database integrity.
-
Pattern Matching Validation:
- Ensures data matches a specific format or pattern.
- Example: Validating phone numbers using a regex pattern (e.g.,
+91-XXXXXXXXXX). - Use Case: Ensuring standardized formats for contact information.
Examples of Data Validation
- Example 1: In a survey form, validate that “Age” is a positive integer and “Email” follows the format
name@domain.com. - Example 2: In a financial dataset, validate that “Transaction Amount” is non-negative and “Account Balance” does not go below zero.
13. Qualitative Variables Are Not Always Categorical. Justify with a Suitable Example
Qualitative vs. Categorical Variables
- Qualitative Variables: Represent characteristics or qualities (e.g., color, taste, opinions). They can be either categorical or ordinal.
- Categorical Variables: A subset of qualitative variables where categories have no inherent order (e.g., gender, city).
Justification
Not all qualitative variables are purely categorical because some qualitative variables have an inherent order (ordinal). These variables cannot be treated as nominal categorical variables due to their ordered nature.
Example
-
Variable: Education Level
- Categories: “High School,” “Bachelor’s Degree,” “Master’s Degree,” “Ph.D.”
- This is a qualitative variable because it represents levels of education (a quality).
- However, it is not purely categorical because there is a clear order (hierarchy) among the categories.
-
Implication: Treating this variable as purely categorical would ignore the inherent order, leading to incorrect analysis.
Key Points
- Qualitative variables can be nominal (categorical) or ordinal (ordered).
- Ordinal variables require special handling in analysis (e.g., assigning ranks or weights).
Possible Exam Questions
- Differentiate between qualitative and categorical variables.
- Provide an example of a qualitative variable that is not categorical.
- Why is it important to distinguish between nominal and ordinal variables?
- How does the treatment of ordinal variables differ from nominal variables?
14. Demonstrate Data Preprocessing Techniques in Detail
What is Data Preprocessing?
Data preprocessing involves cleaning, transforming, and organizing raw data to make it suitable for analysis. It is a critical step in ensuring high-quality results.
Key Data Preprocessing Techniques
-
Data Cleaning:
- Handling Missing Values:
- Imputation: Replace missing values with mean, median, or mode.
- Example: Replace missing “Age” values with the average age.
- Deletion: Remove rows or columns with excessive missing values.
- Imputation: Replace missing values with mean, median, or mode.
- Outlier Detection:
- Use techniques like z-scores, IQR, or visualization (boxplots).
- Example: Identify and handle extreme values in sales data.
- Use techniques like z-scores, IQR, or visualization (boxplots).
- Noise Reduction:
- Smooth noisy data using smoothing techniques (e.g., moving averages).
- Example: Smoothing stock price fluctuations.
- Smooth noisy data using smoothing techniques (e.g., moving averages).
- Handling Missing Values:
-
Data Transformation:
- Normalization/Scaling:
- Rescales data to a standard range (e.g., 0 to 1).
- Formula:
- Example: Normalize features for machine learning models.
- Rescales data to a standard range (e.g., 0 to 1).
- Log Transformation:
- Reduces skewness in data.
- Example: Transform highly skewed income data.
- Reduces skewness in data.
- Encoding Categorical Variables:
- Convert categories into numerical values.
- One-Hot Encoding: Creates binary columns for each category.
- Example: Encode “Color” (Red, Blue, Green) into three binary columns.
- Label Encoding: Assigns a unique number to each category.
- Example: Encode “Size” (Small, Medium, Large) as 1, 2, 3.
- One-Hot Encoding: Creates binary columns for each category.
- Convert categories into numerical values.
- Normalization/Scaling:
-
Feature Engineering:
- Create new features to improve model performance.
- Example: Add a “Total Spend” feature by summing individual purchase amounts.
- Create new features to improve model performance.
-
Dimensionality Reduction:
- Reduce the number of features while retaining important information.
- Techniques: Principal Component Analysis (PCA), t-SNE.
- Example: Reduce 100 features to 10 principal components.
- Reduce the number of features while retaining important information.
-
Data Integration:
- Combine data from multiple sources into a unified dataset.
- Example: Merge customer data from CRM and transaction data from ERP systems.
- Combine data from multiple sources into a unified dataset.
-
Data Aggregation:
- Summarize data at a higher level.
- Example: Aggregate daily sales data into monthly totals.
- Summarize data at a higher level.
Example of Data Preprocessing
Suppose you have a dataset of customer transactions:
- Data Cleaning: Handle missing “Amount” values using mean imputation.
- Data Transformation: Normalize “Transaction Amount” to a range of 0 to 1.
- Encoding: Convert “Payment Method” (Cash, Credit Card, Debit Card) into one-hot encoded columns.
- Feature Engineering: Add a “Purchase Frequency” feature based on the number of transactions per customer.