Tools: Weka / Pentaho / Python


LIST OF EXPERIMENTS

1. Data Processing Techniques

  • (i) Data Cleaning
    → Handle missing values, remove duplicates, correct inconsistencies
  • (ii) Data Transformation – Normalization
    → Min-Max, Z-Score, Decimal Scaling normalization techniques
  • (iii) Data Integration
    → Merge datasets from multiple sources, resolve schema/entity conflicts

2. Partitioning Methods

  • Horizontal Partitioning
  • Vertical Partitioning
  • Round Robin Partitioning
  • Hash-Based Partitioning
    → Compare performance, use cases, and efficiency of each method

3. Data Warehouse Schemas

  • Design and implement:
    • Star Schema
    • Snowflake Schema
    • Fact Constellation (Galaxy) Schema
      → Analyze trade-offs: complexity vs. query performance vs. storage

4. Data Cube Construction & OLAP Operations

  • Build multidimensional data cubes
  • Perform OLAP operations:
    • Roll-up, Drill-down
    • Slice, Dice
    • Pivot (Rotate)
      → Use tools like Pandas (Python), Saiku, or Pentaho for visualization

5. ETL Operations (Extract, Transform, Load)

  • Extract data from heterogeneous sources (CSV, SQL, APIs)
  • Transform: Clean, aggregate, derive attributes
  • Load into target warehouse or data mart
    → Implement using Pentaho Data Integration (Kettle) or Python (Pandas + SQLAlchemy)

6. Attribute-Oriented Induction Algorithm

  • Generalize data by attribute removal/abstraction
  • Concept hierarchy generation
  • Generate characteristic rules from relational data

7. Apriori Algorithm Implementation

  • Mine frequent itemsets
  • Generate association rules (Support, Confidence, Lift)
  • Analyze market basket datasets (e.g., retail transactions)

8. FP-Growth Algorithm Implementation

  • Build FP-Tree from transactional database
  • Mine frequent patterns without candidate generation
  • Compare efficiency with Apriori on large datasets

9. Decision Tree Induction

  • Implement ID3 or C4.5 algorithm
  • Build tree using entropy/information gain
  • Prune tree to avoid overfitting → Use Weka’s J48 or sklearn’s DecisionTreeClassifier for validation

10. Calculating Information Gain Measures

  • Compute Entropy, Information Gain, Gain Ratio
  • Use for attribute selection in decision trees
  • Manual calculation + programmatic implementation

11. Classification using Bayesian Approach

  • Implement Naïve Bayes Classifier
  • Handle categorical and continuous features
  • Apply Laplace smoothing for zero-probability handling → Validate using Weka’s NaiveBayes or sklearn

12. Classification using K-Nearest Neighbour (KNN)

  • Implement KNN from scratch (Euclidean/Manhattan distance)
  • Tune ‘k’ using cross-validation
  • Analyze impact of distance metrics and scaling

13. K-Means Clustering Algorithm

  • Implement centroid-based clustering
  • Initialize centroids (random/k-means++)
  • Evaluate with Silhouette Score, Elbow Method → Compare with sklearn.cluster.KMeans

14. BIRCH Algorithm Implementation

  • Build CF (Clustering Feature) Tree
  • Handle large datasets with limited memory
  • Multi-phase clustering: scanning, condensing, global clustering

15. PAM (Partitioning Around Medoids) Algorithm

  • Implement k-medoids clustering
  • Use real medoids (actual data points) instead of centroids
  • Compare robustness to noise vs. K-Means

16. DBSCAN Algorithm Implementation

  • Density-based clustering
  • Identify core, border, and noise points
  • Tune eps and min_samples
  • Handle clusters of arbitrary shapes and outliers