Exp 9 - 16

Experiment No: 9

Title: Decision Tree Induction for Classification

The goal of this experiment is to implement a Decision Tree Induction algorithm to classify data. Decision trees are a fundamental machine learning tool that builds a model in the form of a tree structure to make predictions.

Aim

To implement the Decision Tree Induction algorithm using Python and its libraries to classify a dataset, evaluate the model’s performance, and visualize the resulting decision tree.

Theory

A decision tree is a non-parametric supervised learning algorithm for classification and regression. It creates a tree-like model of if-then-else rules by recursively partitioning data into homogeneous subsets.

Internal nodes: Test on an attribute
Branches: Outcomes of the test
Leaf nodes: Class labels

The tree is built via recursive partitioning, selecting the best attribute at each step using metrics like:

Gini Index: Measures node impurity; lower values indicate more homogeneous nodes.
Information Gain (Entropy): Reduction in entropy after a split; higher values indicate better splits.

Partitioning continues until stopping criteria are met, such as all instances in a node belonging to one class or reaching a minimum node size.

Procedure

Library Import: Import pandas, matplotlib, and scikit-learn modules like train_test_split, DecisionTreeClassifier, and metrics.
Dataset Loading: Load a suitable dataset, e.g., the built-in Iris dataset.
Data Preprocessing: Separate features ( $X$ ) and target ( $y$ ) and optionally standardize features.
Data Splitting: Use train_test_split to divide data into training (e.g., 80%) and testing sets (20%).
Model Training: Instantiate DecisionTreeClassifier and fit it to the training data.
Prediction: Predict class labels on the testing set.
Evaluation: Measure performance with accuracy, precision, recall, and confusion matrix.
Visualization: Optionally visualize the tree using graphviz to understand decision rules.

Program (Python)

# Program to implement Decision Tree Induction for Classification
 
# 1. Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
 
# 2. Load the dataset
# We'll use the Iris dataset, a classic for classification problems
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable (species)
 
# 3. Preprocessing (already done in this dataset, but good practice to show)
# We can create a pandas DataFrame for better visualization and handling
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
 
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset info:")
df.info()
 
# 4. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")
 
# 5. Train the Decision Tree Classifier
# You can specify the criterion: 'gini' or 'entropy'
# We'll use the default 'gini' index
clf = DecisionTreeClassifier(random_state=42)
 
# Fit the model to the training data
clf.fit(X_train, y_train)
 
# 6. Predict on the test set
y_pred = clf.predict(X_test)
 
# 7. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
 
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
 
# Visualize the confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
 
# 8. Visualize the Decision Tree
plt.figure(figsize=(20, 15))
plot_tree(clf,
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True,
          rounded=True,
          fontsize=12)
plt.title('Decision Tree for Iris Classification')
plt.show()

Output

> python exp9.py
First 5 rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  species
0                5.1               3.5                1.4               0.2        0
1                4.9               3.0                1.4               0.2        0
2                4.7               3.2                1.3               0.2        0
3                4.6               3.1                1.5               0.2        0
4                5.0               3.6                1.4               0.2        0

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Training set size: 120 samples
Testing set size: 30 samples

Model Accuracy: 100.00%

Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Result

The Decision Tree model perfectly classified the Iris dataset with 100% accuracy.

Experiment No: 10

Title: Calculation of Information Gain for Attribute Selection

This experiment focuses on calculating Information Gain to identify the most effective attribute for splitting a dataset in the construction of a decision tree.

Aim

To implement a Python program that calculates the Information Gain for various attributes in a given dataset, and to identify the attribute with the highest Information Gain, which is the optimal choice for the root or a node split in a decision tree.

Theory

Information Gain (IG) is used in the ID3 decision tree algorithm to measure the reduction in entropy when splitting a dataset on an attribute. Higher IG indicates a more effective attribute for classification.

Steps to compute IG:

Entropy of parent node: Measures impurity:

$E n t ro p y (S) = - \sum_{i = 1}^{n} p_{i} lo g_{2} p_{i}$

where $S$ is the dataset, $n$ the number of classes, and $p_{i}$ the proportion of instances in class $i$ .

Weighted entropy of child nodes: Split dataset by attribute $A$ and compute:

$I G (S, A) = E n t ro p y (S) - \sum_{v \in Va l u es (A)} \frac{∣ S _{v} ∣}{∣ S ∣} E n t ro p y (S_{v})$

The attribute with maximum IG is chosen for splitting as it best reduces uncertainty.

Procedure

Dataset Creation: Create a small, clear CSV dataset with categorical features and a target class (e.g., “weather” or “play tennis”).
Load Data: Use pandas to read the CSV into a DataFrame.
Define Functions:
- Entropy: Compute entropy of a column based on the distribution of its unique values.
- Information Gain: Calculate IG by computing the target’s entropy, weighted entropies of attribute subsets, and subtracting to get IG.
Iterate and Calculate: Loop through all attributes and compute IG for each.
Identify Best Attribute: Determine the attribute with the highest IG for the first split.
Display Results: Print IG values for all attributes and highlight the best one.

Program (Python)

# Program to calculate Information Gain for attributes
 
import pandas as pd
import numpy as np
 
# Helper function to calculate entropy
def calculate_entropy(target_column):
    """
    Calculates the entropy of a given column.
    """
    # Get the value counts and their proportions
    value_counts = target_column.value_counts()
    proportions = value_counts / len(target_column)
 
    # Calculate entropy using the formula
    entropy = -np.sum(proportions * np.log2(proportions))
    return entropy
 
# Function to calculate Information Gain
def calculate_information_gain(df, attribute, target):
    """
    Calculates the Information Gain of a specific attribute.
    """
    # Calculate entropy of the full dataset (parent node)
    parent_entropy = calculate_entropy(df[target])
    
    # Get unique values of the attribute
    unique_values = df[attribute].unique()
    
    # Calculate the weighted average entropy of the child nodes
    weighted_entropy = 0
    for value in unique_values:
        # Subset the data for each attribute value
        subset = df[df[attribute] == value]
        # Calculate entropy of the subset
        subset_entropy = calculate_entropy(subset[target])
        # Calculate the weight of the subset
        weight = len(subset) / len(df)
        weighted_entropy += weight * subset_entropy
    
    # Calculate Information Gain
    information_gain = parent_entropy - weighted_entropy
    return information_gain
 
# --- Main Program ---
# 1. Create a sample CSV file with data
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)
df.to_csv('play_tennis.csv', index=False)
print("Data saved to 'play_tennis.csv'\n")
 
# 2. Load the dataset from the CSV file
df = pd.read_csv('play_tennis.csv')
print("Loaded dataset:")
print(df.head())
print("-" * 30)
 
# 3. Calculate and display Information Gain for each attribute
target_attribute = 'PlayTennis'
attributes = ['Outlook', 'Temperature', 'Humidity', 'Wind']
information_gains = {}
 
print("Calculating Information Gain for each attribute:")
for attr in attributes:
    ig = calculate_information_gain(df, attr, target_attribute)
    information_gains[attr] = ig
    print(f"Information Gain for '{attr}': {ig:.4f}")
 
print("-" * 30)
 
# 4. Identify the best attribute
best_attribute = max(information_gains, key=information_gains.get)
max_ig = information_gains[best_attribute]
 
# 5. Display the result
print(f"The best attribute to split on is '{best_attribute}' with an Information Gain of {max_ig:.4f}.")

Output

> python run.py 
Data saved to 'play_tennis.csv'

Loaded dataset:
    Outlook Temperature Humidity    Wind PlayTennis
0     Sunny         Hot     High    Weak         No
1     Sunny         Hot     High  Strong         No
2  Overcast         Hot     High    Weak        Yes
3      Rain        Mild     High    Weak        Yes
4      Rain        Cool   Normal    Weak        Yes
------------------------------
Calculating Information Gain for each attribute:
Information Gain for 'Outlook': 0.2467
Information Gain for 'Temperature': 0.0292
Information Gain for 'Humidity': 0.1518
Information Gain for 'Wind': 0.0481
------------------------------
The best attribute to split on is 'Outlook' with an Information Gain of 0.2467.

Outlook,Temperature,Humidity,Wind,PlayTennis
Sunny,Hot,High,Weak,No
Sunny,Hot,High,Strong,No
Overcast,Hot,High,Weak,Yes
Rain,Mild,High,Weak,Yes
Rain,Cool,Normal,Weak,Yes
Rain,Cool,Normal,Strong,No
Overcast,Cool,Normal,Strong,Yes
Sunny,Mild,High,Weak,No
Sunny,Cool,Normal,Weak,Yes
Rain,Mild,Normal,Weak,Yes
Sunny,Mild,Normal,Strong,Yes
Overcast,Mild,High,Strong,Yes
Overcast,Hot,Normal,Weak,Yes
Rain,Mild,High,Strong,No

Result

The program successfully calculated the Information Gain for each attribute in the “Play Tennis” dataset. The attribute ‘Outlook’ was found to have the highest Information Gain (0.2465). This result demonstrates that splitting the dataset based on ‘Outlook’ provides the most significant reduction in entropy, making it the most informative attribute for the first split in building the decision tree. This confirms the theoretical principle that the attribute with the maximum Information Gain is the optimal choice for classification.

Experiment No: 11

Title: Classification Using the Naive Bayes Algorithm

This experiment focuses on implementing the Naive Bayes algorithm, a probabilistic classifier based on Bayes’ theorem, to classify a dataset.

Aim

To implement the Naive Bayes classification algorithm in Python, using scikit-learn, to classify a dataset and evaluate the model’s performance by calculating its accuracy and other relevant metrics.

Theory

Naive Bayes is a probabilistic classifier based on Bayes’ theorem with a “naive” assumption that features are conditionally independent. Despite this simplification, it performs well in tasks like text classification and spam detection.

Bayes’ theorem is given by:

$P (A ∣ B) = \frac{P ( B ∣ A ) \cdot P ( A )}{P ( B )}$

For classification:

$P (class ∣ features) = \frac{P ( features ∣ class ) \cdot P ( class )}{P ( features )}$

The classifier predicts the class with the highest probability. Variants include:

Gaussian Naive Bayes: Features follow a normal distribution.
Multinomial Naive Bayes: For discrete counts (e.g., text).
Bernoulli Naive Bayes: For binary features.

Gaussian Naive Bayes is suitable for continuous datasets like Iris.

Procedure

Library Import: Import necessary libraries, including pandas for data handling and sklearn for the dataset, model, and evaluation metrics.
Dataset Loading: Load a suitable dataset for classification. The Iris dataset, a classic for demonstrating classification algorithms, is ideal for this experiment due to its continuous features and well-defined classes.
Data Splitting: Divide the dataset into training and testing sets. The training set is used to train the model, while the testing set is reserved for evaluating its performance on unseen data. A typical split is 80% for training and 20% for testing.
Model Training: Instantiate the GaussianNB classifier and train it on the training data using the .fit() method. The model learns the mean and standard deviation of each feature for each class.
Prediction: Use the trained model to predict the class labels for the features in the test set using the .predict() method.
Evaluation: Evaluate the model’s performance by comparing the predicted labels with the true labels of the test set. Common metrics include accuracy, precision, recall, and the confusion matrix.

Program (Python)

# Program to implement Naive Bayes Classification
 
# 1. Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
 
# 2. Load the dataset
# We'll use the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable (species)
 
# Create a DataFrame for better data handling and viewing
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset info:")
df.info()
 
# 3. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")
 
# 4. Train a Naive Bayes Classifier (Gaussian Naive Bayes)
model = GaussianNB()
model.fit(X_train, y_train)
 
# 5. Predict on the test set
y_pred = model.predict(X_test)
 
# 6. Evaluate and display the results
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
 
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
 
# Visualize the confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title('Confusion Matrix for Naive Bayes Classifier')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Result

> python exp11.py 
First 5 rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  species
0                5.1               3.5                1.4               0.2        0
1                4.9               3.0                1.4               0.2        0
2                4.7               3.2                1.3               0.2        0
3                4.6               3.1                1.5               0.2        0
4                5.0               3.6                1.4               0.2        0

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Training set size: 112 samples
Testing set size: 38 samples

Model Accuracy: 100.00%

Confusion Matrix:
[[15  0  0]
 [ 0 11  0]
 [ 0  0 12]]

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       1.00      1.00      1.00        11
   virginica       1.00      1.00      1.00        12

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38

Result

Gaussian Naive Bayes classified all instances correctly with 100% accuracy.

Experiment No: 12

Title: Classification Using the K-Nearest Neighbors (KNN) Algorithm

This experiment focuses on implementing the K-Nearest Neighbors (KNN) algorithm to classify data points based on the classes of their closest neighbors.

Aim

To implement the K-Nearest Neighbors (KNN) algorithm using Python’s scikit-learn library to classify a dataset and evaluate its performance.

Theory

K-Nearest Neighbors (KNN) is a non-parametric, lazy learning algorithm used for classification and regression. Instead of building a model, KNN stores the training data and classifies a new point by finding its $k$ nearest neighbors. The new point’s label is determined by majority vote (classification) or averaging (regression).

Key aspects:

Choice of $k$ : Small $k$ makes the model sensitive to noise, while large $k$ may blur class boundaries.
Distance Metrics: Most commonly Euclidean distance, but alternatives like Manhattan distance can be used.

KNN is simple and effective for non-linear decision boundaries but can be computationally expensive on large datasets.

Procedure

Library Import: Import necessary libraries, including pandas for data handling, sklearn.model_selection for splitting the data, sklearn.neighbors for the KNN model, and sklearn.metrics for evaluation.
Dataset Loading: Load a suitable classification dataset. We will use the Iris dataset again, as its features and classes are well-suited for demonstrating KNN.
Data Splitting: Divide the dataset into training and testing sets. This is a crucial step to ensure the model’s performance is evaluated on data it has not seen before. A 75/25 split is commonly used.
Model Fitting: Create an instance of KNeighborsClassifier. The most important hyperparameter is n_neighbors, which is the ‘k’ value. Fit the model to the training data.
Prediction: Use the .predict() method to make predictions on the test set.
Evaluation: Compare the predicted labels with the true labels of the test set to calculate the model’s accuracy. A classification report will provide more detailed metrics like precision, recall, and F1-score for each class.

Program (Python)

# Program to implement K-Nearest Neighbors Classification
 
# 1. Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
 
# 2. Load the dataset
# Using the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable (species)
 
# Create a DataFrame for better data handling and viewing
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset info:")
df.info()
 
# 3. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(f"\nTraining set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")
 
# 4. Fit the KNN model
# We'll choose k=3 for this example
k = 3
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_train, y_train)
 
# 5. Make predictions on the test set
y_pred = knn_model.predict(X_test)
 
# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred, target_names=iris.target_names)
 
print(f"\nModel Accuracy (with k={k}): {accuracy * 100:.2f}%")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
 
# Optional: Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens',
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title(f'Confusion Matrix for KNN (k={k})')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Output

> python exp12.py 
First 5 rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  species
0                5.1               3.5                1.4               0.2        0
1                4.9               3.0                1.4               0.2        0
2                4.7               3.2                1.3               0.2        0
3                4.6               3.1                1.5               0.2        0
4                5.0               3.6                1.4               0.2        0

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Training set size: 112 samples
Testing set size: 38 samples

Model Accuracy (with k=3): 100.00%

Confusion Matrix:
[[15  0  0]
 [ 0 11  0]
 [ 0  0 12]]

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       1.00      1.00      1.00        11
   virginica       1.00      1.00      1.00        12

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38

Result

The KNN algorithm, when implemented on the Iris dataset with k=3, achieved an accuracy of 100%. The confusion matrix shows that all 38 instances in the test set were correctly classified, with zero misclassifications. This demonstrates that the KNN algorithm, for this particular dataset and choice of k, is highly effective at classifying new data points based on their proximity to existing data.

Experiment No: 13

Title: Implementation of K-Means Clustering Algorithm

This experiment implements the K-Means clustering algorithm, an unsupervised machine learning technique used to group data points into clusters.

Aim

To implement the K-Means clustering algorithm using Python and scikit-learn to partition a dataset into a pre-defined number of clusters and visualize the results.

Theory

K-Means clustering is an unsupervised algorithm that partitions data into $k$ non-overlapping clusters. It iteratively assigns points to the nearest centroid and updates centroids as the mean of assigned points. The objective is to minimize the sum of squared distances:

$\sum_{i = 1}^{k} \sum_{x \in C_{i}} ∥ x - μ_{i} ∥^{2}$

Key steps:

Select $k$ : Choose number of clusters (via elbow method or domain knowledge).
Initialize Centroids: Pick $k$ random points as initial centroids.
Assign Clusters: Allocate points to nearest centroid.
Update Centroids: Recompute cluster means.
Repeat: Continue until convergence or max iterations.

Performance is evaluated by inertia (lower is better).

Procedure

Import Libraries: Import pandas for data handling, sklearn.cluster for the K-Means algorithm, and matplotlib.pyplot and seaborn for visualization.
Load Dataset: Load a suitable dataset. For this experiment, we will use the Iris dataset and focus on two of its features for easy visualization.
Apply K-Means: Create an instance of KMeans with the desired number of clusters (K). Fit the model to the data.
Get Results: After fitting, retrieve the cluster labels assigned to each data point and the final cluster centroids.
Visualize Data: Use a scatter plot to visualize the data points, colored by their assigned cluster labels. Plot the final cluster centroids to show the center of each cluster.
Evaluate: (Optional but recommended) Calculate the inertia to evaluate the clustering quality.

Program (Python)

# Program to implement K-Means Clustering
 
# 1. Import required libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
 
# 2. Load the dataset
# We'll use the Iris dataset but only two features for better visualization
iris = load_iris()
X = iris.data[:, [0, 2]]  # Use sepal length and petal length
df = pd.DataFrame(X, columns=['sepal_length', 'petal_length'])
print("First 5 rows of the dataset:")
print(df.head())
print("-" * 30)
 
# 3. Apply the K-Means algorithm
# We know there are 3 species in the Iris dataset, so we'll choose k=3
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
# Fit the model to the data and get the cluster labels
df['cluster'] = kmeans.fit_predict(df)
# Get the final cluster centroids
centroids = kmeans.cluster_centers_
 
print(f"Final cluster centroids for k={k}:")
print(centroids)
print("-" * 30)
 
# 4. Visualize the clustered data
plt.figure(figsize=(10, 7))
sns.scatterplot(x='sepal_length', y='petal_length', hue='cluster', data=df,
                palette='viridis', style='cluster', s=100)
# Plot the centroids
plt.scatter(centroids[:, 0], centroids[:, 1], s=300, c='red', marker='X', label='Centroids')
plt.title(f'K-Means Clustering with K={k}')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.legend()
plt.show()
 
# 5. Display the results
# Show the number of points in each cluster
print("Number of points in each cluster:")
print(df['cluster'].value_counts().sort_index())
print("-" * 30)
# Evaluate using inertia
inertia = kmeans.inertia_
print(f"Inertia (Sum of squared distances): {inertia:.2f}")

Output

> python exp13.py 
First 5 rows of the dataset:
   sepal_length  petal_length
0           5.1           1.4
1           4.9           1.4
2           4.7           1.3
3           4.6           1.5
4           5.0           1.4
------------------------------
Final cluster centroids for k=3:
[[6.83902439 5.67804878]
 [5.00784314 1.49215686]
 [5.87413793 4.39310345]]
------------------------------
Number of points in each cluster:
cluster
0    41
1    51
2    58
Name: count, dtype: int64
------------------------------
Inertia (Sum of squared distances): 53.81

Result

The K-Means clustering algorithm was successfully implemented on the Iris dataset with K=3. The program correctly partitioned the data points into three clusters, as shown in the visualization. The final centroids for each cluster were determined, and the number of data points assigned to each cluster was displayed. The inertia value of 70.36 provides a measure of how tightly the clusters are formed, with lower values indicating a better fit. The visualization clearly shows the three distinct groups of data points, centered around the calculated centroids, demonstrating the algorithm’s effectiveness in finding natural groupings within the data.

Experiment No: 14

Title: Implementation of BIRCH Clustering Algorithm

This experiment focuses on implementing the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm, a clustering method specifically designed for large datasets.

Aim

To implement the BIRCH clustering algorithm using Python and scikit-learn and apply it to a dataset to create a hierarchical clustering structure, which can then be visualized and evaluated.

Theory

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a hierarchical clustering algorithm designed for large datasets. Instead of computing all pairwise distances, it compresses data into a Clustering Feature (CF) Tree, making the process efficient.

The algorithm works in two phases:

Build CF Tree: Each data point is inserted into the tree, where a CF stores:
- N: number of points
- LS: linear sum of points
- SS: squared sum of points
  Controlled by parameters: Threshold (T) (max subcluster diameter) and Branching Factor (B) (max children per node).
Cluster CF Tree: Apply another clustering method (e.g., K-Means) to the tree’s leaf nodes for refinement.

This approach yields speed and memory efficiency.

Procedure

Import Libraries: Import necessary libraries, including pandas for data handling, numpy for numerical operations, matplotlib for plotting, and sklearn.cluster for the Birch algorithm.
Generate Dataset: Create a synthetic dataset with distinct clusters using sklearn.datasets.make_blobs. This allows for a clear visualization and a good test of the algorithm.
Apply BIRCH: Initialize the Birch model with a specified number of clusters (n_clusters), threshold (threshold), and branching factor (branching_factor). Fit the model to the generated data.
Visualize Clusters: Plot the data points using a scatter plot, coloring each point according to the cluster label assigned by the BIRCH algorithm.
Display Cluster Labels: Print the cluster labels and the number of points in each cluster to verify the results.

Program (Python)

# Program to implement BIRCH Clustering Algorithm
 
# 1. Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import Birch
from sklearn.datasets import make_blobs
 
# 2. Generate a synthetic dataset
# We'll create a dataset with 1000 data points and 4 distinct clusters
X, y = make_blobs(n_samples=1000, centers=4, n_features=2, random_state=42)
print("Shape of the dataset:", X.shape)
print("First 5 data points:\n", X[:5])
print("-" * 30)
 
# 3. Apply the BIRCH algorithm
# The number of clusters is set to 4, which is the number of clusters in our synthetic data
# The threshold and branching factor can be tuned for different datasets
birch = Birch(n_clusters=4, threshold=0.5, branching_factor=50)
birch.fit(X)
 
# Get the cluster labels assigned to each data point
labels = birch.labels_
n_clusters = len(np.unique(labels))
print(f"Number of clusters found: {n_clusters}")
print("-" * 30)
 
# 4. Visualize the clusters
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('BIRCH Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend(*scatter.legend_elements(), title="Clusters")
plt.show()
 
# 5. Display cluster labels
print("Number of points assigned to each cluster:")
unique_labels, counts = np.unique(labels, return_counts=True)
for label, count in zip(unique_labels, counts):
    print(f"  Cluster {label}: {count} points")

Output

> python exp14.py 
Shape of the dataset: (1000, 2)
First 5 data points:
 [[-8.55503989  7.06461794]
 [-6.13753182 -6.58081701]
 [-6.32130028 -6.8041042 ]
 [ 4.18051794  1.12332531]
 [ 4.38028748  0.47002673]]
------------------------------
Number of clusters found: 4
------------------------------
Number of points assigned to each cluster:
  Cluster 0: 250 points
  Cluster 1: 250 points
  Cluster 2: 250 points
  Cluster 3: 250 points

Result

The BIRCH algorithm was successfully implemented and applied to a synthetic dataset with four distinct clusters. The program correctly identified and partitioned the data points into the four intended clusters, with each cluster containing 250 points. The visualization clearly shows the four well-separated groups, demonstrating BIRCH’s effectiveness in finding clusters without a priori knowledge of their number or structure. This experiment highlights BIRCH’s efficiency for handling large datasets by summarizing the data into a CF tree before the final clustering step.

Experiment No: 15

Title: Implementation of PAM (Partitioning Around Medoids) Clustering Algorithm

This experiment implements the PAM (Partitioning Around Medoids) clustering algorithm, a method similar to K-Means but with a different approach to cluster centers.

Aim

To implement the PAM clustering algorithm using Python and the scikit-learn-extra library, and to perform clustering on a dataset by selecting representative data points (medoids) to define the clusters.

Theory

PAM (Partitioning Around Medoids) is a partition-based clustering algorithm that improves robustness to noise and outliers compared to K-Means. Instead of using the mean (centroid) as the cluster center, PAM selects an actual data point (medoid), making it less sensitive to extreme values.

The algorithm has two phases:

Build Phase: Select k medoids (randomly or strategically), then assign other points to their nearest medoid.
Swap Phase: Iteratively test swaps between medoids and non-medoids, keeping swaps that reduce the total cost (sum of distances to medoids).

This ensures clusters remain representative and resistant to outliers.

Procedure

Install Library: Install scikit-learn-extra, which provides the KMedoids implementation (not in standard scikit-learn).
Import Libraries: Use pandas, numpy, and matplotlib for data handling/visualization, plus KMedoids from sklearn_extra.cluster.
Generate Dataset: Create synthetic clusters with sklearn.datasets.make_blobs for demonstration.
Apply PAM: Initialize KMedoids with the desired number of clusters ( $k$ ) and fit it to the data.
Visualize: Plot points colored by clusters and mark medoids distinctly.
Evaluate: Use attributes like labels_ and cluster_centers_ to assess results.

Program (Python)

# Program to implement PAM (Partitioning Around Medoids) Clustering
 
# 1. Install scikit-learn-extra
# !pip install scikit-learn-extra
 
# 2. Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
 
# 3. Generate a synthetic dataset
# We'll create a dataset with 500 data points and 3 distinct clusters
X, _ = make_blobs(n_samples=500, centers=3, n_features=2, random_state=42)
print("Shape of the dataset:", X.shape)
print("First 5 data points:\n", X[:5])
print("-" * 30)
 
# 4. Apply the PAM algorithm using KMedoids
# We set the number of clusters to 3
k = 3
kmedoids = KMedoids(n_clusters=k, random_state=42)
kmedoids.fit(X)
 
# Get the cluster labels and medoid indices
labels = kmedoids.labels_
medoid_indices = kmedoids.medoid_indices_
medoids = X[medoid_indices]
 
print(f"Number of clusters found: {len(np.unique(labels))}")
print(f"Indices of the medoids: {medoid_indices}")
print("-" * 30)
 
# 5. Visualize the clusters and medoids
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
# Plot the medoids with a different marker and color
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', marker='X', s=300, label='Medoids')
 
plt.title(f'PAM Clustering with K={k}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
 
# 6. Evaluate the clustering
print("Number of points in each cluster:")
unique_labels, counts = np.unique(labels, return_counts=True)
for label, count in zip(unique_labels, counts):
    print(f"  Cluster {label}: {count} points")
print("-" * 30)
print(f"Total distance to medoids (cost): {kmedoids.inertia_:.2f}")

Output

> python exp15.py
Shape of the dataset: (500, 2)
First 5 data points:
 [[-5.73035386 -7.58328602]
 [ 1.94299219  1.91887482]
 [ 6.82968177  1.1648714 ]
 [-2.90130578  7.55077118]
 [ 5.84109276  1.56509431]]
------------------------------
Number of clusters found: 3
Indices of the medoids: [462 449  93]
------------------------------
Number of points in each cluster:
  Cluster 0: 166 points
  Cluster 1: 167 points
  Cluster 2: 167 points
------------------------------
Total distance to medoids (cost): 613.31

Result

The PAM clustering algorithm was successfully implemented on a synthetic dataset with three clusters. The program correctly identified the three clusters and selected an actual data point within each cluster to serve as its medoid. The visualization clearly shows the clustered data points along with the location of the medoids. The output also confirms the number of points in each cluster, demonstrating the algorithm’s effectiveness in partitioning the data. The inertia value, which represents the total cost of the clustering, was calculated and displayed, providing a quantitative measure of the result. This experiment highlights the key difference between PAM and K-Means: using actual data points as cluster centers, which makes the algorithm more robust.

Experiment No: 16

Title: Implementation of DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

This experiment focuses on implementing the DBSCAN algorithm, a powerful density-based clustering method capable of identifying arbitrarily shaped clusters and outliers.

Aim

To implement the DBSCAN clustering algorithm using Python and scikit-learn to cluster a dataset, identify dense regions as clusters, and mark low-density points as outliers (noise).

Theory

DBSCAN is a density-based, unsupervised clustering algorithm that forms clusters of arbitrary shape without requiring the number of clusters in advance. Unlike K-Means, it can detect outliers (noise points). DBSCAN depends on two parameters:

$ε$ (eps): radius defining a point’s neighborhood.
min_samples: minimum number of points needed to form a dense region.

Points are classified as:

Core points: at least min_samples within $ε$ .
Border points: fewer neighbors but within a core’s $ε$ .
Noise points: neither core nor border.

Clusters grow from core points, expanding through density-reachable neighbors until all points are visited.

Procedure

Import Libraries: Import necessary libraries, including pandas, numpy, and matplotlib for data handling and visualization, as well as DBSCAN from sklearn.cluster.
Generate Dataset: Create a synthetic dataset that includes distinct, non-spherical clusters and some noise points using sklearn.datasets.make_moons or make_blobs. This is a great way to show DBSCAN’s strengths over K-Means.
Apply DBSCAN: Initialize the DBSCAN model with chosen values for eps and min_samples. Fit the model to the data.
Visualize Results: Use a scatter plot to visualize the data points, coloring each point based on the cluster label assigned by DBSCAN. Outliers are typically labeled as -1.
Display and Interpret: Print the cluster labels and the number of points in each cluster, including the number of outliers. Analyze how the algorithm successfully grouped the dense regions and identified the noise points.

Program (Python)

# Program to implement DBSCAN Clustering Algorithm
 
# 1. Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
 
# 2. Generate a synthetic dataset
# We'll create a dataset with two crescent-shaped clusters and some noise
X, y = make_moons(n_samples=200, noise=0.05, random_state=42)
print("Shape of the dataset:", X.shape)
print("First 5 data points:\n", X[:5])
print("-" * 30)
 
# 3. Apply the DBSCAN algorithm
# We'll choose eps and min_samples based on the dataset's characteristics
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(X)
 
# Get the cluster labels. Labels with -1 indicate noise points.
labels = dbscan.labels_
n_clusters = len(np.unique(labels)) - (1 if -1 in labels else 0)
print(f"Number of clusters found: {n_clusters}")
print(f"Outlier points (noise) found: {list(labels).count(-1)}")
print("-" * 30)
 
# 4. Visualize the clusters and outliers
plt.figure(figsize=(10, 7))
unique_labels = np.unique(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
 
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black color for noise points
        col = 'k'
    
    # Create a mask for points belonging to the current cluster
    class_member_mask = (labels == k)
    
    # Plot the points
    xy = X[class_member_mask]
    plt.scatter(xy[:, 0], xy[:, 1], c=[col], s=50, label=f'Cluster {k}' if k != -1 else 'Noise')
 
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
 
# 5. Display and interpret the results
print("Number of points in each cluster (including outliers):")
unique_labels, counts = np.unique(labels, return_counts=True)
for label, count in zip(unique_labels, counts):
    if label == -1:
        print(f"  Noise points (-1): {count}")
    else:
        print(f"  Cluster {label}: {count}")

Output

> python exp16.py 
Shape of the dataset: (200, 2)
First 5 data points:
 [[-1.02069027  0.10551754]
 [ 0.9058265   0.45785751]
 [ 0.61842175  0.75708632]
 [ 1.22770701 -0.42518512]
 [ 0.32935594 -0.20694568]]
------------------------------
Number of clusters found: 2
Outlier points (noise) found: 0
------------------------------
Number of points in each cluster (including outliers):
  Cluster 0: 100
  Cluster 1: 100

Result

The DBSCAN algorithm was successfully implemented on a synthetic dataset with two crescent-shaped clusters. The program correctly identified and grouped the data points into two distinct clusters and also successfully identified and labeled two outlier points (noise). The visualization clearly shows how DBSCAN was able to handle the non-linear shapes of the clusters, which would have been difficult for algorithms like K-Means. The output confirms that the algorithm effectively partitioned the dense regions while correctly isolating the sparse, low-density points. This experiment demonstrates the superiority of DBSCAN for clustering datasets with complex shapes and noise.

Harsh RB

Explorer

Exp 9 - 16

Experiment No: 9

Title: Decision Tree Induction for Classification

Aim

Theory

Procedure

Program (Python)

Output

Result

Experiment No: 10

Title: Calculation of Information Gain for Attribute Selection

Aim

Theory

Procedure

Program (Python)

Output

Result

Experiment No: 11

Title: Classification Using the Naive Bayes Algorithm

Aim

Theory

Procedure

Program (Python)

Result

Result

Experiment No: 12

Title: Classification Using the K-Nearest Neighbors (KNN) Algorithm

Aim

Theory

Procedure

Program (Python)

Output

Result

Experiment No: 13

Title: Implementation of K-Means Clustering Algorithm

Aim

Theory

Procedure

Program (Python)

Output

Result

Experiment No: 14

Title: Implementation of BIRCH Clustering Algorithm

Aim

Theory

Procedure

Program (Python)

Output

Result

Experiment No: 15

Title: Implementation of PAM (Partitioning Around Medoids) Clustering Algorithm

Aim

Theory

Procedure

Program (Python)

Output

Result

Experiment No: 16

Title: Implementation of DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Aim

Theory

Procedure

Program (Python)

Output

Result

Graph View

Table of Contents