Harsh RB

❯

College Resources

❯

❯

4.1 Data Mining Lab

4.1 Data Mining Lab

Feb 19, 20263 min read

Tools: Weka / Pentaho / Python

LIST OF EXPERIMENTS

1. Data Processing Techniques

(i) Data Cleaning
→ Handle missing values, remove duplicates, correct inconsistencies
(ii) Data Transformation – Normalization
→ Min-Max, Z-Score, Decimal Scaling normalization techniques
(iii) Data Integration
→ Merge datasets from multiple sources, resolve schema/entity conflicts

2. Partitioning Methods

Horizontal Partitioning
Vertical Partitioning
Round Robin Partitioning
Hash-Based Partitioning
→ Compare performance, use cases, and efficiency of each method

3. Data Warehouse Schemas

Design and implement:
- Star Schema
- Snowflake Schema
- Fact Constellation (Galaxy) Schema
  → Analyze trade-offs: complexity vs. query performance vs. storage

4. Data Cube Construction & OLAP Operations

Build multidimensional data cubes
Perform OLAP operations:
- Roll-up, Drill-down
- Slice, Dice
- Pivot (Rotate)
  → Use tools like Pandas (Python), Saiku, or Pentaho for visualization

5. ETL Operations (Extract, Transform, Load)

Extract data from heterogeneous sources (CSV, SQL, APIs)
Transform: Clean, aggregate, derive attributes
Load into target warehouse or data mart
→ Implement using Pentaho Data Integration (Kettle) or Python (Pandas + SQLAlchemy)

6. Attribute-Oriented Induction Algorithm

Generalize data by attribute removal/abstraction
Concept hierarchy generation
Generate characteristic rules from relational data

7. Apriori Algorithm Implementation

Mine frequent itemsets
Generate association rules (Support, Confidence, Lift)
Analyze market basket datasets (e.g., retail transactions)

8. FP-Growth Algorithm Implementation

Build FP-Tree from transactional database
Mine frequent patterns without candidate generation
Compare efficiency with Apriori on large datasets

9. Decision Tree Induction

Implement ID3 or C4.5 algorithm
Build tree using entropy/information gain
Prune tree to avoid overfitting → Use Weka’s J48 or sklearn’s DecisionTreeClassifier for validation

10. Calculating Information Gain Measures

Compute Entropy, Information Gain, Gain Ratio
Use for attribute selection in decision trees
Manual calculation + programmatic implementation

11. Classification using Bayesian Approach

Implement Naïve Bayes Classifier
Handle categorical and continuous features
Apply Laplace smoothing for zero-probability handling → Validate using Weka’s NaiveBayes or sklearn

12. Classification using K-Nearest Neighbour (KNN)

Implement KNN from scratch (Euclidean/Manhattan distance)
Tune ‘k’ using cross-validation
Analyze impact of distance metrics and scaling

13. K-Means Clustering Algorithm

Implement centroid-based clustering
Initialize centroids (random/k-means++)
Evaluate with Silhouette Score, Elbow Method → Compare with sklearn.cluster.KMeans

14. BIRCH Algorithm Implementation

Build CF (Clustering Feature) Tree
Handle large datasets with limited memory
Multi-phase clustering: scanning, condensing, global clustering

15. PAM (Partitioning Around Medoids) Algorithm

Implement k-medoids clustering
Use real medoids (actual data points) instead of centroids
Compare robustness to noise vs. K-Means

16. DBSCAN Algorithm Implementation

Density-based clustering
Identify core, border, and noise points
Tune eps and min_samples
Handle clusters of arbitrary shapes and outliers

Graph View

LIST OF EXPERIMENTS
1. Data Processing Techniques
2. Partitioning Methods
3. Data Warehouse Schemas
4. Data Cube Construction & OLAP Operations
5. ETL Operations (Extract, Transform, Load)
6. Attribute-Oriented Induction Algorithm
7. Apriori Algorithm Implementation
8. FP-Growth Algorithm Implementation
9. Decision Tree Induction
10. Calculating Information Gain Measures
11. Classification using Bayesian Approach
12. Classification using K-Nearest Neighbour (KNN)
13. K-Means Clustering Algorithm
14. BIRCH Algorithm Implementation
15. PAM (Partitioning Around Medoids) Algorithm
16. DBSCAN Algorithm Implementation

Created with Quartz v4.5.2 © 2026

GitHub
Harsh RB