Prerequisites
- Data Structures: Familiarity with basic data structures such as lists, dictionaries, trees, etc.
- Finite Automata: Understanding of automata theory and regular expressions.
- Probability Theory: Knowledge of probability distributions, Bayes’ theorem, and statistical methods.
Course Objectives
- To develop an understanding of the challenges and solutions in NLP.
- To explore linguistic phenomena and model them using formal grammars.
- To implement NLP algorithms and tools using Python and NLTK.
Course Outcomes
Upon completion of this course, students will be able to: 4. Linguistic Sensitivity: Demonstrate sensitivity to linguistic phenomena and model them using formal grammars. 5. NLTK Proficiency: Gain hands-on experience with the NLTK library for NLP tasks. 6. String and Tree Manipulation: Work on strings and tree structures for text analysis. 7. Parameter Estimation: Estimate parameters using supervised and unsupervised training methods.
List of Experiments
1. Text Preprocessing
Write a Python program to perform the following tasks on text:
- a) Tokenization: Split text into individual words or tokens.
- b) Stop Word Removal: Remove common words (e.g., “the”, “is”) that do not contribute to meaning.
Example Code Snippet:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
text = "Natural Language Processing is fascinating."
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)2. Porter Stemmer Algorithm
Write a Python program to implement the Porter Stemmer algorithm for stemming (reducing words to their root form).
Example Code Snippet:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "runner", "runs"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)3. Word Analysis and Generation
Write a Python program for:
- a) Word Analysis: Analyze words using morphological features.
- b) Word Generation: Generate variations of a given word.
Example Code Snippet:
from nltk.corpus import wordnet
# Word Analysis
synonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
print(synonyms)
# Word Generation
from nltk.corpus import words
valid_words = [w for w in words.words() if w.startswith("run")]
print(valid_words)4. Word Sense Disambiguation (WSD)
Create a sample list of at least 5 words with ambiguous senses (e.g., “bank”) and write a Python program to implement WSD.
Example Code Snippet:
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
sentence = "I went to the bank to deposit money."
ambiguous_word = "bank"
context = word_tokenize(sentence)
sense = lesk(context, ambiguous_word)
print(f"Sense: {sense}, Definition: {sense.definition()}")5. NLTK Toolkit - Stemming
Install the NLTK toolkit and perform stemming using different stemmers (Porter, Snowball, Lancaster).
Example Code Snippet:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
stemmers = {
"Porter": PorterStemmer(),
"Snowball": SnowballStemmer("english"),
"Lancaster": LancasterStemmer()
}
word = "happiness"
for name, stemmer in stemmers.items():
print(f"{name}: {stemmer.stem(word)}")6. Part-of-Speech (POS) Tagging
Create a sample list of at least 10 words and perform POS tagging. Find the POS for any given word.
Example Code Snippet:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
words = ["run", "jump", "happy", "bank", "deposit"]
tagged_words = pos_tag(words)
print(tagged_words)7. Morphological Analysis and N-Grams
Write a Python program to:
- a) Perform Morphological Analysis: Analyze word structure using NLTK.
- b) Generate N-Grams: Create bigrams, trigrams, etc., from text.
- c) Implement N-Gram Smoothing: Handle zero-frequency issues.
Example Code Snippet:
from nltk import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
# Morphological Analysis
from nltk.corpus import wordnet
synsets = wordnet.synsets("bank")
print([syn.definition() for syn in synsets])
# N-Grams
text = "Natural Language Processing is fun".split()
bigrams = list(ngrams(text, 2))
print(bigrams)
# N-Gram Smoothing
train_data, vocab = padded_everygram_pipeline(2, [text])
model = MLE(2)
model.fit(train_data, vocab)
print(model.score("Language", ["Natural"]))8. Audio-Text Conversion
Using the NLTK package, convert an audio file to text and vice versa.
Example Code Snippet:
import speech_recognition as sr
from gtts import gTTS
import os
# Audio to Text
r = sr.Recognizer()
with sr.AudioFile("audio_file.wav") as source:
audio_data = r.record(source)
text = r.recognize_google(audio_data)
print(text)
# Text to Audio
tts = gTTS(text="Hello, this is a test.")
tts.save("output.mp3")
os.system("start output.mp3")Textbooks
- Multilingual Natural Language Processing Applications by Daniel M. Bikel and Imed Zitouni.
- O’Reilly Practical Natural Language Processing by Sowmya Vajjala et al.
- Speech and Language Processing by Daniel Jurafsky and James H. Martin.
Reference Books
- Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper.