Parallel Radix Sort with OpenMP: CME213 Tutorial

Introduction to Parallel Radix Sort

Radix Sort is a non-comparative sorting algorithm that sorts integers by processing individual digits or bits. In the CME213 assignment, you will implement Radix Sort in parallel using OpenMP, a shared-memory parallel programming API. This tutorial will guide you through the key concepts and implementation steps, focusing on the parallelization strategies that can speed up sorting on modern multi-core CPUs.

Radix Sort is particularly relevant in big data and AI applications where large datasets need to be sorted efficiently. For example, sorting millions of user IDs in a social media feed or ranking game scores in real-time leaderboards often leverages parallel sorting algorithms.

Understanding Radix Sort Basics

Radix Sort processes elements bit by bit, from the least significant bit (LSB) to the most significant bit (MSB). In each pass, it examines a group of numBits bits and sorts the elements into 2^numBits buckets. The steps are:

Histogram: Count the frequency of each bucket.
Prefix Sum: Compute the starting positions for each bucket.
Reorder: Place elements into a temporary array based on the bucket positions.

For a 32-bit integer, if we use 4 bits per pass, we need 8 passes. The number of buckets is 16. The algorithm is stable and has a time complexity of O(n * k) where k is the number of passes.

Parallelization with OpenMP

OpenMP allows you to parallelize loops easily using compiler directives. In Radix Sort, the histogram and reordering steps can be parallelized. However, care must be taken to avoid race conditions. Common techniques include:

Parallel histogram with reduction: Use #pragma omp parallel for reduction(+:histogram[:numBuckets]) to safely update the histogram from multiple threads.
Local histograms: Each thread computes its own local histogram, then they are merged.
Parallel prefix sum: Use OpenMP's scan or manual parallel prefix sum for global bucket positions.

For reordering, each thread can process a portion of the input array and write to the correct positions in the output array using the global offsets.

Step-by-Step Implementation

1. Setting Up OpenMP

Add -fopenmp to your compiler flags. Set the number of threads via omp_set_num_threads() or environment variable OMP_NUM_THREADS.

#include <omp.h>
int main() {
    omp_set_num_threads(4);
    // ...
}

2. Parallel Histogram

Suppose we have an array arr of size n and we want to compute histogram for current bits. We can use a reduction clause:

int histogram[16] = {0};
#pragma omp parallel for reduction(+:histogram[:16])
for (int i = 0; i < n; i++) {
    int bucket = (arr[i] >> shift) & 0xF;
    histogram[bucket]++;
}

This works well for small number of buckets. For larger buckets, local histograms may reduce overhead.

3. Parallel Prefix Sum

The prefix sum computes the starting positions for each bucket. This is a sequential dependency, but we can parallelize it using a two-phase approach: first compute partial sums per thread, then combine.

Alternatively, use OpenMP's omp_scan if available (OpenMP 5.0+). For simplicity, a sequential prefix sum is often acceptable because the number of buckets is small.

int prefix[17] = {0};
for (int i = 0; i < 16; i++) {
    prefix[i+1] = prefix[i] + histogram[i];
}

4. Parallel Reordering

Create a temporary array temp. Use the prefix to determine where each element goes. Each thread processes a chunk of the input and uses atomic operations or local positions to write.

int *temp = new int[n];
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    int chunk = n / omp_get_num_threads();
    int start = tid * chunk;
    int end = (tid == omp_get_num_threads()-1) ? n : start + chunk;
    int local_pos[16];
    for (int b = 0; b < 16; b++) local_pos[b] = prefix[b];
    // adjust local_pos for this thread's start (not shown for brevity)
    for (int i = start; i < end; i++) {
        int bucket = (arr[i] >> shift) & 0xF;
        temp[local_pos[bucket]++] = arr[i];
    }
}
// copy temp back to arr
memcpy(arr, temp, n*sizeof(int));
delete[] temp;

Note: The above simplified version does not handle the adjustment of local positions correctly. A proper implementation requires computing the starting offsets for each thread using a parallel prefix on the histogram counts per thread.

Optimization Tips

Cache efficiency: Process data in cache-friendly blocks.
Load balancing: Ensure each thread gets roughly equal work.
Reduce false sharing: Pad histogram arrays to avoid cache line contention.
Use appropriate number of bits per pass: 4 or 8 bits often work well.

Trend-Inspired Analogy: Sorting Game Leaderboards

Imagine you are developing a real-time leaderboard for a popular battle royale game like Fortnite. Thousands of players' scores need to be sorted every second to display the top players. Using a parallel Radix Sort with OpenMP on a multi-core server can handle this efficiently. Each pass of Radix Sort is like a round of sorting players by a specific digit of their score. With OpenMP, you can assign each core to process a subset of players, compute histograms, and reorder them, ensuring the leaderboard updates quickly even under heavy load.

Testing and Debugging

Start with a small array (e.g., 1000 elements) and compare your parallel sort with a sequential version. Use assert to verify correctness. Gradually increase the array size and number of threads. Monitor speedup using omp_get_wtime(). Common pitfalls include incorrect bucket indexing, off-by-one errors in prefix sums, and race conditions in the reordering step.

Conclusion

Parallel Radix Sort with OpenMP is a powerful technique for sorting large datasets. By understanding histogram computation, prefix sums, and parallel reordering, you can achieve significant speedups on multi-core CPUs. Apply these concepts to your CME213 assignment and beyond, whether you're sorting data for AI models, financial transactions, or gaming leaderboards.