Data Mining Blooms Taxonomy Questions

UNIT I: Introduction to Data Mining & Data Preprocessing

Introduction to Data Mining

  • What is Data Mining?
    → Definition, Relationship with KDD (Knowledge Discovery in Databases), Data Science, and Machine Learning
  • Kinds of Data
    → Structured, Semi-structured, Unstructured; Relational, Transactional, Time-Series, Spatial, Multimedia, Web, Text
  • Knowledge Discovery Process (KDD)
    → Data Selection → Preprocessing → Transformation → Data Mining → Interpretation/Evaluation
  • Data Mining Functionalities
    → Classification, Clustering, Association, Regression, Summarization, Outlier Detection
  • Kinds of Patterns
    → Descriptive vs. Predictive Patterns, Frequent Patterns, Discriminative Patterns
  • Major Issues in Data Mining
    → Scalability, Dimensionality, Data Quality, Privacy, Integration with DBMS, Visualization, Interpretability

Data Understanding & Preprocessing

  • Data Objects and Attribute Types
    → Nominal, Ordinal, Interval, Ratio; Discrete vs. Continuous
  • Basic Statistical Descriptions of Data
    → Mean, Median, Mode, Variance, Standard Deviation, Quartiles, Boxplots, Skewness
  • Data Visualization Techniques
    → Histograms, Scatter Plots, Heatmaps, Parallel Coordinates, PCA Plots
  • Measuring Data Similarity and Dissimilarity
    → Euclidean, Manhattan, Minkowski, Cosine, Jaccard, Hamming Distance; Similarity for Mixed Types
  • Major Tasks in Data Preprocessing
    • Data Cleaning → Handle missing values, noise, outliers
    • Data Integration → Entity resolution, redundancy removal
    • Data Reduction → Dimensionality reduction (PCA), Numerosity reduction (sampling, histograms)
    • Data Transformation → Normalization, Aggregation, Generalization
    • Data Discretization → Binning, Entropy-based, ChiMerge

UNIT II: Association Analysis

Fundamentals

  • Basic Concepts → Itemsets, Support, Confidence, Lift, Conviction
  • Market Basket Analysis → Retail applications, cross-selling

Algorithms

  • Apriori Algorithm
    → Candidate generation, pruning, frequent itemset mining
  • FP-Growth Algorithm
    → FP-Tree construction, pattern growth without candidate generation

Advanced Topics

  • From Association to Correlation Analysis
    → Correlation rules, χ² test, all_confidence, max_confidence, Kulczynski
  • Multilevel & Multidimensional Associations
    → Taxonomy-based rules (e.g., “electronics → laptops → gaming laptops”)
    → Rules with multiple dimensions (e.g., “buys(X, laptop) ∧ age(X, “20–30”) → buys(X, mouse)”)

UNIT III: Classification

Fundamentals

  • Basic Concepts → Training set, Test set, Class labels, Overfitting, Confusion Matrix

Classification Methods

  • Decision Tree Induction → ID3, C4.5, CART; Splitting criteria: Information Gain, Gini Index
  • Bayes Classification Methods → Naïve Bayes, Bayesian Belief Networks
  • Rule-Based Classification → IF-THEN rules, Sequential Covering (RIPPER), Rule Pruning
  • Multilayer Feedforward Neural Networks → Backpropagation, Activation Functions, Hidden Layers
  • Support Vector Machines (SVM) → Maximum margin hyperplane, Kernel Trick (Linear, RBF, Polynomial)
  • k-Nearest Neighbor (k-NN) → Lazy learning, Distance-weighted voting, Curse of Dimensionality

Model Evaluation & Enhancement

  • Metrics for Classifier Performance
    → Accuracy, Precision, Recall, F1-Score, ROC Curve, AUC, Cost Matrix
  • Ensemble Methods
    → Bagging (Random Forest), Boosting (AdaBoost, Gradient Boosting), Stacking

UNIT IV: Cluster Analysis & Outlier Detection

Cluster Analysis

  • Requirements → Scalability, Handling different attributes, Arbitrary shapes, Minimal domain knowledge
  • Partitioning Methods
    • k-Means → Centroid-based, iterative refinement
    • k-Medoids (PAM) → Medoid-based, robust to outliers
  • Hierarchical Methods
    • AGNES (Agglomerative) → Bottom-up merging
    • DIANA (Divisive) → Top-down splitting
    • BIRCH → CF-Tree for large datasets
  • Density-Based Method
    • DBSCAN → Core/Border/Noise points, ε and MinPts parameters

Outlier Analysis

  • Types of Outliers → Global, Contextual, Collective
  • Challenges → High dimensionality, noise masking, interpretability
  • Overview of Detection Methods
    → Statistical (Z-score, Grubbs), Distance-based (k-NN), Density-based (LOF), Clustering-based

UNIT V: Advanced Concepts in Data Mining

Web Mining

  • Web Content Mining → Text mining, HTML parsing, NLP for web pages
  • Web Structure Mining → Link analysis (PageRank, HITS), Web graph mining
  • Web Usage Mining → Clickstream analysis, Session reconstruction, User profiling

Spatial Data Mining

  • Spatial Data Overview → Maps, coordinates, topology, raster/vector models
  • Spatial Data Mining Primitives → Spatial predicates, distance measures, neighborhood definitions
  • Spatial Rules → Co-location patterns (e.g., “school near hospital”)
  • Spatial Classification Algorithms → Spatial decision trees, spatial autocorrelation (Moran’s I)
  • Spatial Clustering Algorithms → STING, CLIQUE, spatial extensions of DBSCAN (e.g., GDBSCAN)

Temporal Data Mining

  • Modeling Temporal Events → Timestamps, intervals, sequences
  • Time Series Analysis → Trend, seasonality, autocorrelation, ARIMA, forecasting
  • Pattern Detection → Periodic patterns, calendar-based patterns
  • Sequences → Sequential pattern mining (GSP, PrefixSpan)
  • Temporal Association Rules → Rules with time constraints (e.g., “A occurs → B occurs within 7 days”)