Cs6601 Assignment 6: Hidden Markov Models for ASL Recognition

Introduction to Hidden Markov Models in AI

Hidden Markov Models (HMMs) are a cornerstone of modern artificial intelligence, powering applications from speech recognition to gesture analysis. In the context of the Cs6601 assignment 6, you'll apply HMMs to recognize isolated American Sign Language (ASL) words. This tutorial will guide you through the core concepts and implementation steps without solving the assignment outright. Whether you're a student tackling this assignment or a developer exploring probabilistic models, understanding HMMs is essential for pattern recognition tasks.

Understanding the ASL Recognition Problem

American Sign Language recognition involves interpreting hand movements over time. In this assignment, you'll work with Y-coordinates of the right hand and right thumb from video frames. The words to recognize are "ALLIGATOR", "NUTS", and "SLEEP". Each word is modeled as a sequence of hidden states that generate observable coordinates. This mirrors real-world AI applications like gesture-controlled interfaces or assistive technologies for the deaf community.

Think of HMMs like a trending AI music generator: the hidden states are the underlying musical notes, and the observations are the sounds you hear. By analyzing the sequence of sounds, you infer the most likely notes. Similarly, from hand coordinates, you infer the sign being performed.

HMM Structure for Cs6601 Assignment 6

Each word's HMM has exactly three hidden states. Transitions are restricted: you can stay in the current state or move to the next state. This left-to-right topology is common in speech and gesture recognition. The model parameters include:

Prior probabilities: Probability of starting in each state (always state 1 for this assignment).
Transition probabilities: Likelihood of moving from one state to another.
Emission parameters: For each state, a Gaussian distribution (mean and standard deviation) describing the observed Y-coordinates.

For example, in the training data, the word "ALLIGATOR" has sequences like (31,28,28,37,68,49,64,66,22,17,53,73,81,78,48,49,47). These are split into three groups corresponding to the three hidden states. Your job is to estimate the parameters from these splits.

Step 1: Encoding the HMM Parameters

To encode the HMM, you need to compute transition and emission probabilities from the training samples. The assignment provides initial state boundaries for each sequence. For instance, for the first ALLIGATOR sample (17 frames):

State 1: frames 1-6 (31,28,28,37,68,49)
State 2: frames 7-12 (64,66,22,17,53,73)
State 3: frames 13-17 (81,78,48,49,47)

From these, you calculate:

Prior: π = [1, 0, 0] (always start in state 1).
Transition matrix A: Count transitions between states across all training sequences. For example, from state 1 to state 1 (staying) vs. state 1 to state 2 (moving). Normalize to get probabilities.
Emission parameters: For each state, compute the mean and standard deviation of all observations assigned to that state across all samples.

Remember to round all values to 3 decimal places using the specified rounding rules (e.g., 0.1234 becomes 0.123, 0.2345 becomes 0.235). Do not use Python's round() function; implement custom rounding.

Here's a Python code snippet to compute mean and std without round():

def custom_round(value, decimals=3):
    factor = 10 ** decimals
    return int(value * factor + 0.5) / factor

mean = sum(observations) / len(observations)
mean_rounded = custom_round(mean)
variance = sum((x - mean)**2 for x in observations) / len(observations)
std = variance ** 0.5
std_rounded = custom_round(std)

Apply this to each state's observations across all training samples for a given word.

Step 2: Handling State Boundary Adjustments

The assignment warns about states being "squeezed out" during training. If you follow the Baum-Welch or Viterbi training procedure, you might find that a state ends up with zero observations. To prevent this, always keep at least one observation per state. For example, if adjusting boundaries would leave state 1 empty, stop the adjustment. This is a practical constraint to ensure the HMM remains valid.

In the provided example, if moving the boundary left would remove all observations from state 1, you must keep the original boundary. This reflects real-world HMM training where you enforce minimum state duration.

Step 3: Building the One-Dimensional Model (Part 1)

In Part 1a, you will hardcode the computed parameters for each word. The assignment provides a table of training samples. Use all samples for a word to compute a single set of parameters. For instance, for "ALLIGATOR", you have three samples. Combine observations for each state across samples:

State 1 observations: (31,28,28,37,68,49) from sample1, (25,62,75,80) from sample2, (-4,69,59,45,62,22) from sample3. Total 16 observations.
Compute mean and std from these 16 values.

Similarly, for transitions, count how many times the sequence moves from state i to state j. For example, in sample1, state 1 has 6 frames, so transitions: state1→state1 (5 times), state1→state2 (1 time). Sum over all samples and normalize.

Once you have the parameters, you can implement the Viterbi algorithm to decode the most likely state sequence for a new observation sequence. This is the core of the word recognizer.

Step 4: Going Multidimensional (Part 2)

Part 2 extends the model to two dimensions: right hand Y and right thumb Y. Now each observation is a pair (y_hand, y_thumb). The emission distribution becomes a bivariate Gaussian with mean vector and covariance matrix. You'll need to compute these parameters similarly, but now for two coordinates.

The transition and prior probabilities remain the same as in Part 1 (since they depend only on state transitions, not observations). However, the emission probabilities change. For each state, compute the mean of hand Y and thumb Y, and the covariance between them.

For example, if state 1 has observations: hand=[31,28,28], thumb=[10,12,11], then mean_hand=29, mean_thumb=11, covariance = [[?],[?]]. The Viterbi algorithm now uses the multivariate Gaussian likelihood.

Real-World Applications and Trends

HMMs are not just for assignments; they power modern AI tools. For instance, Apple's Siri and Google Assistant use HMMs for speech recognition. In gaming, gesture recognition using HMMs enables hands-free controls. With the rise of AI video generation (like Sora), understanding temporal models like HMMs is crucial for generating coherent motion sequences. By mastering this assignment, you're building skills applicable to cutting-edge AI research.

Common Pitfalls and Tips

Rounding: Use custom rounding, not Python's round(). Write a helper function.
State boundaries: Ensure each state has at least one observation. If a boundary adjustment would leave a state empty, revert.
Transition counts: Remember that the last frame of a state does not transition to the next state if the sequence ends. Count transitions only between frames within the sequence.
Testing: Run the provided unit tests frequently. Comment out unrelated tests when working step by step.

Use the Edstem challenge questions and Canvas lectures for additional support. The Wikipedia page provided in the resources is edited for this assignment, so rely on it.

Conclusion

Hidden Markov Models are a powerful tool for sequential data analysis. This Cs6601 assignment gives you hands-on experience with HMM encoding, parameter estimation, and decoding. By following this guide, you'll build a robust ASL word recognizer. Remember to submit your submission.py file to Gradescope, and use your 10 submissions wisely. Good luck!