Data Mining Blooms Taxonomy Questions
UNIT I: Introduction to Data Mining & Data Preprocessing
Introduction to Data Mining
- What is Data Mining?
→ Definition, Relationship with KDD (Knowledge Discovery in Databases), Data Science, and Machine Learning - Kinds of Data
→ Structured, Semi-structured, Unstructured; Relational, Transactional, Time-Series, Spatial, Multimedia, Web, Text - Knowledge Discovery Process (KDD)
→ Data Selection → Preprocessing → Transformation → Data Mining → Interpretation/Evaluation - Data Mining Functionalities
→ Classification, Clustering, Association, Regression, Summarization, Outlier Detection - Kinds of Patterns
→ Descriptive vs. Predictive Patterns, Frequent Patterns, Discriminative Patterns - Major Issues in Data Mining
→ Scalability, Dimensionality, Data Quality, Privacy, Integration with DBMS, Visualization, Interpretability
Data Understanding & Preprocessing
- Data Objects and Attribute Types
→ Nominal, Ordinal, Interval, Ratio; Discrete vs. Continuous - Basic Statistical Descriptions of Data
→ Mean, Median, Mode, Variance, Standard Deviation, Quartiles, Boxplots, Skewness - Data Visualization Techniques
→ Histograms, Scatter Plots, Heatmaps, Parallel Coordinates, PCA Plots - Measuring Data Similarity and Dissimilarity
→ Euclidean, Manhattan, Minkowski, Cosine, Jaccard, Hamming Distance; Similarity for Mixed Types - Major Tasks in Data Preprocessing
- Data Cleaning → Handle missing values, noise, outliers
- Data Integration → Entity resolution, redundancy removal
- Data Reduction → Dimensionality reduction (PCA), Numerosity reduction (sampling, histograms)
- Data Transformation → Normalization, Aggregation, Generalization
- Data Discretization → Binning, Entropy-based, ChiMerge
UNIT II: Association Analysis
Fundamentals
- Basic Concepts → Itemsets, Support, Confidence, Lift, Conviction
- Market Basket Analysis → Retail applications, cross-selling
Algorithms
- Apriori Algorithm
→ Candidate generation, pruning, frequent itemset mining - FP-Growth Algorithm
→ FP-Tree construction, pattern growth without candidate generation
Advanced Topics
- From Association to Correlation Analysis
→ Correlation rules, χ² test, all_confidence, max_confidence, Kulczynski - Multilevel & Multidimensional Associations
→ Taxonomy-based rules (e.g., “electronics → laptops → gaming laptops”)
→ Rules with multiple dimensions (e.g., “buys(X, laptop) ∧ age(X, “20–30”) → buys(X, mouse)”)
UNIT III: Classification
Fundamentals
- Basic Concepts → Training set, Test set, Class labels, Overfitting, Confusion Matrix
Classification Methods
- Decision Tree Induction → ID3, C4.5, CART; Splitting criteria: Information Gain, Gini Index
- Bayes Classification Methods → Naïve Bayes, Bayesian Belief Networks
- Rule-Based Classification → IF-THEN rules, Sequential Covering (RIPPER), Rule Pruning
- Multilayer Feedforward Neural Networks → Backpropagation, Activation Functions, Hidden Layers
- Support Vector Machines (SVM) → Maximum margin hyperplane, Kernel Trick (Linear, RBF, Polynomial)
- k-Nearest Neighbor (k-NN) → Lazy learning, Distance-weighted voting, Curse of Dimensionality
Model Evaluation & Enhancement
- Metrics for Classifier Performance
→ Accuracy, Precision, Recall, F1-Score, ROC Curve, AUC, Cost Matrix - Ensemble Methods
→ Bagging (Random Forest), Boosting (AdaBoost, Gradient Boosting), Stacking
UNIT IV: Cluster Analysis & Outlier Detection
Cluster Analysis
- Requirements → Scalability, Handling different attributes, Arbitrary shapes, Minimal domain knowledge
- Partitioning Methods
- k-Means → Centroid-based, iterative refinement
- k-Medoids (PAM) → Medoid-based, robust to outliers
- Hierarchical Methods
- AGNES (Agglomerative) → Bottom-up merging
- DIANA (Divisive) → Top-down splitting
- BIRCH → CF-Tree for large datasets
- Density-Based Method
- DBSCAN → Core/Border/Noise points, ε and MinPts parameters
Outlier Analysis
- Types of Outliers → Global, Contextual, Collective
- Challenges → High dimensionality, noise masking, interpretability
- Overview of Detection Methods
→ Statistical (Z-score, Grubbs), Distance-based (k-NN), Density-based (LOF), Clustering-based
UNIT V: Advanced Concepts in Data Mining
Web Mining
- Web Content Mining → Text mining, HTML parsing, NLP for web pages
- Web Structure Mining → Link analysis (PageRank, HITS), Web graph mining
- Web Usage Mining → Clickstream analysis, Session reconstruction, User profiling
Spatial Data Mining
- Spatial Data Overview → Maps, coordinates, topology, raster/vector models
- Spatial Data Mining Primitives → Spatial predicates, distance measures, neighborhood definitions
- Spatial Rules → Co-location patterns (e.g., “school near hospital”)
- Spatial Classification Algorithms → Spatial decision trees, spatial autocorrelation (Moran’s I)
- Spatial Clustering Algorithms → STING, CLIQUE, spatial extensions of DBSCAN (e.g., GDBSCAN)
Temporal Data Mining
- Modeling Temporal Events → Timestamps, intervals, sequences
- Time Series Analysis → Trend, seasonality, autocorrelation, ARIMA, forecasting
- Pattern Detection → Periodic patterns, calendar-based patterns
- Sequences → Sequential pattern mining (GSP, PrefixSpan)
- Temporal Association Rules → Rules with time constraints (e.g., “A occurs → B occurs within 7 days”)