Tools: Weka / Pentaho / Python
LIST OF EXPERIMENTS
1. Data Processing Techniques
- (i) Data Cleaning
→ Handle missing values, remove duplicates, correct inconsistencies - (ii) Data Transformation – Normalization
→ Min-Max, Z-Score, Decimal Scaling normalization techniques - (iii) Data Integration
→ Merge datasets from multiple sources, resolve schema/entity conflicts
2. Partitioning Methods
- Horizontal Partitioning
- Vertical Partitioning
- Round Robin Partitioning
- Hash-Based Partitioning
→ Compare performance, use cases, and efficiency of each method
3. Data Warehouse Schemas
- Design and implement:
- Star Schema
- Snowflake Schema
- Fact Constellation (Galaxy) Schema
→ Analyze trade-offs: complexity vs. query performance vs. storage
4. Data Cube Construction & OLAP Operations
- Build multidimensional data cubes
- Perform OLAP operations:
- Roll-up, Drill-down
- Slice, Dice
- Pivot (Rotate)
→ Use tools like Pandas (Python), Saiku, or Pentaho for visualization
5. ETL Operations (Extract, Transform, Load)
- Extract data from heterogeneous sources (CSV, SQL, APIs)
- Transform: Clean, aggregate, derive attributes
- Load into target warehouse or data mart
→ Implement using Pentaho Data Integration (Kettle) or Python (Pandas + SQLAlchemy)
6. Attribute-Oriented Induction Algorithm
- Generalize data by attribute removal/abstraction
- Concept hierarchy generation
- Generate characteristic rules from relational data
7. Apriori Algorithm Implementation
- Mine frequent itemsets
- Generate association rules (Support, Confidence, Lift)
- Analyze market basket datasets (e.g., retail transactions)
8. FP-Growth Algorithm Implementation
- Build FP-Tree from transactional database
- Mine frequent patterns without candidate generation
- Compare efficiency with Apriori on large datasets
9. Decision Tree Induction
- Implement ID3 or C4.5 algorithm
- Build tree using entropy/information gain
- Prune tree to avoid overfitting → Use Weka’s J48 or sklearn’s DecisionTreeClassifier for validation
10. Calculating Information Gain Measures
- Compute Entropy, Information Gain, Gain Ratio
- Use for attribute selection in decision trees
- Manual calculation + programmatic implementation
11. Classification using Bayesian Approach
- Implement Naïve Bayes Classifier
- Handle categorical and continuous features
- Apply Laplace smoothing for zero-probability handling → Validate using Weka’s NaiveBayes or sklearn
12. Classification using K-Nearest Neighbour (KNN)
- Implement KNN from scratch (Euclidean/Manhattan distance)
- Tune ‘k’ using cross-validation
- Analyze impact of distance metrics and scaling
13. K-Means Clustering Algorithm
- Implement centroid-based clustering
- Initialize centroids (random/k-means++)
- Evaluate with Silhouette Score, Elbow Method → Compare with sklearn.cluster.KMeans
14. BIRCH Algorithm Implementation
- Build CF (Clustering Feature) Tree
- Handle large datasets with limited memory
- Multi-phase clustering: scanning, condensing, global clustering
15. PAM (Partitioning Around Medoids) Algorithm
- Implement k-medoids clustering
- Use real medoids (actual data points) instead of centroids
- Compare robustness to noise vs. K-Means
16. DBSCAN Algorithm Implementation
- Density-based clustering
- Identify core, border, and noise points
- Tune
epsandmin_samples - Handle clusters of arbitrary shapes and outliers