Programming lesson
Mastering Probability, MLE, PCA, and Clustering: A Comprehensive Tutorial for CSE/ISYE 6740
Dive into key concepts from CSE/ISYE 6740 homework assignments, including probability, maximum likelihood estimation, principal component analysis, and clustering, with timely examples and clear explanations.
Introduction
If you're tackling the CSE/ISYE 6740 homework assignments, you're facing a rich mix of probability theory, maximum likelihood estimation (MLE), principal component analysis (PCA), and clustering. These topics form the backbone of machine learning and data science. In this tutorial, we'll break down each concept with intuitive explanations, step-by-step derivations, and timely examples—like using the 2026 FIFA World Cup qualifiers or the latest AI image compression trends—to make the material stick. By the end, you'll be ready to solve similar problems with confidence.
Probability: Bayes' Theorem in Action
The first homework problem tests your grasp of conditional probability and Bayes' theorem. Let's walk through a classic scenario.
Example: Employee Resignations
Stores A, B, and C have 50, 75, and 100 employees, with 50%, 60%, and 70% women respectively. If a resigned employee is a woman, what's the probability she worked at store C? This is a textbook Bayes problem. Let W be the event that the employee is a woman. We want P(C|W). Using Bayes: P(C|W) = P(W|C)P(C) / P(W). Compute P(C) = 100/225, P(W|C) = 0.7, and P(W) = (50*0.5 + 75*0.6 + 100*0.7)/225. The answer works out to about 0.4667. This type of calculation is essential in fields like spam filtering or medical diagnostics.
Medical Testing and False Positives
Another problem: a test is 95% effective at detecting a disease, with a 1% false positive rate, and 0.5% prevalence. Given a positive test, what's the probability you have the disease? Again, Bayes: P(D|+) = (0.95 * 0.005) / (0.95*0.005 + 0.01*0.995) ≈ 0.323. Surprisingly low! This illustrates why even accurate tests can produce many false positives when the condition is rare—a crucial lesson in public health, especially relevant during pandemic tracking.
Baseball Playoff Probability
The Braves, Giants, and Dodgers are tied with 3 games left. Giants play Dodgers all 3; Braves play Padres. Each game is 50-50. What's the probability the Braves win outright? And the probability of a playoff? This requires enumerating outcomes. The Braves win if they win at least 2 of 3 and the Giants-Dodgers series doesn't produce a winner with more wins. The probability of a playoff is the chance that two teams tie. These problems mirror scenarios in sports analytics, like predicting NBA play-in tournament outcomes.
Maximum Likelihood Estimation
MLE is a cornerstone of statistical modeling. You'll derive estimators for Poisson, Multinomial, and Gaussian distributions.
Poisson MLE
Given i.i.d. samples from Poisson(λ), the log-likelihood is ℓ(λ) = Σ (k_i log λ - λ - log k_i!). Differentiate and set to zero: Σ (k_i/λ - 1) = 0 → λ̂ = (1/n) Σ k_i. So the sample mean is the MLE. This is used in modeling count data, like the number of goals in a World Cup match or website visits per hour.
Multinomial MLE
For multinomial with categories, the MLE of θ_j is the proportion of counts in category j: θ̂_j = x_j / n. This is intuitive: if you roll a die 100 times, the estimated probability of each face is its observed frequency.
Gaussian MLE
For univariate Gaussian, the MLE for μ is the sample mean, and for σ² it's the biased sample variance (dividing by n, not n-1). This is a classic result: μ̂ = (1/n) Σ x_i, σ̂² = (1/n) Σ (x_i - μ̂)². In practice, we often use the unbiased version, but MLE gives the biased one.
Principal Component Analysis
PCA can be derived as minimizing reconstruction error. Given D-dimensional data, we want to represent each point x_n as a projection onto M principal components plus a constant offset for the remaining dimensions.
Optimal Coefficients
For the first M dimensions, the optimal z_i^n is the projection of x_n onto u_i: z_i^n = x_n^T u_i. For the remaining dimensions, the optimal constant b_j is the mean of the projections: b_j = (1/n) Σ x_n^T u_j = u_j^T x̄. This shows that the best low-dimensional representation is the projection onto the principal subspace, and the residual is captured by the mean.
Choosing the Basis
The optimal u_i are the eigenvectors of the sample covariance matrix S. The reconstruction error is minimized by choosing the eigenvectors corresponding to the largest eigenvalues. This is exactly what PCA does: it finds directions of maximum variance. In practice, PCA is used for dimensionality reduction, like compressing images or visualizing high-dimensional data.
Clustering: K-Means and Hierarchical
Clustering groups similar data points. K-means minimizes the within-cluster sum of squares.
K-Means Convergence
The algorithm alternates between assigning points to the nearest centroid and updating centroids to the cluster mean. This process converges to a local optimum because the distortion function decreases at each step and is bounded below. The centroid update formula μ_k = (1/n_k) Σ x_n for points in cluster k comes from setting the derivative of the distortion with respect to μ_k to zero.
Hierarchical Clustering Linkage
Single, complete, and average linkage define distances between clusters. For data shaped like two moons, single linkage can chain across the gap, failing to separate them. Complete linkage tends to produce compact clusters and can separate the moons if the gap is wide enough. Average linkage is a compromise. In practice, for K-means-like clusters (spherical, equal size), average or complete linkage works best.
Programming: Image Compression with Clustering
The final assignment involves compressing an image by clustering pixels in RGB space. Each pixel is a 3D point (R,G,B). K-means groups similar colors, and each pixel is replaced by its cluster centroid, reducing the number of distinct colors to K. This is like how streaming services compress video: by representing similar colors with a single value. For example, with K=16, a full-color image becomes a 16-color posterized version, drastically reducing file size while preserving the overall look.
Conclusion
These homework problems cover essential ML concepts: probability, MLE, PCA, and clustering. By understanding the derivations and practicing with real-world analogies—from sports playoffs to image compression—you'll build a strong foundation for more advanced topics. Keep experimenting with code, and you'll master these techniques in no time.