Assignment Chef icon Assignment Chef

[SOLVED] Cosc407 a8 (25 marks) focus: cuda(g): performance optimization

5.0 1 customer review Digital download

Digital download

$25.00

Availability
In stock
Checkout
One item

Need a hand?

Message us on WhatsApp for payment or download support.

WhatsApp QR code

The aim of this lab is to implement the reduction algorithm we learned in class. You are required to write
a full CUDA program to compute the sum of a 1D float array with 224 elements. In this assignment, you
need to implement and time four versions of the kernel that will do the reduction:
• Version (1) uses the shared memory but has more divergence within the warps. This version is
similar to the example on page 46 in the lecture notes “05G_CUDA_BestPractices”.
• Version (2) uses the shared memory and has less divergence. This is similar to the example on
page 49 in the lecture notes.
• Version (3) is similar to version (1) but it does not use the shared memory (writes directly to the
global memory).
• Version (4) is similar to version (2) but it does not use the shared memory.
In the host code (the main function), create a 1D array with 2
24 random float numbers from 0 to 255.
Then, launch each of the above four kernels, one at a time. The output should similar to one below:
Reducing an array of 16777216 floats on a grid of(32768,1,1) blocks, each block with (512,1,1) threads
Using shared memory, More divergence: GPU time: 227.364 ms GPU sum: 2139115520.00
Using shared memory, Less divergence: GPU time: 37.777 ms GPU sum: 2139115520.00
Using global memory, More divergence: GPU time: 234.256 ms GPU sum: 2139115520.00
Using global memory, Less divergence: GPU time: 45.022 ms GPU sum: 2139115520.00
In addition, choose any of the four kernels and compare its execution time: (i) once using the
‘multiplication’ operator (*) to compute the ‘stride’, and (ii) another time using the ‘shift’ operator (<<).
Report the difference in time. Note: ignore the error that Visual Studio displaysfor using __syncthreads().
Your code should still run with this error.
Marking guide:
+16 for 4 kernels. The marks are distributed for each kernel as follows:
– +1 for using global/shared memory
– +1 for more/less divergence
– +2 for correctness
+8 for the host code distributed as follows:
– +2 for timing the execution
– +4 for kernel configuration and launch (+1 for each kernel)
– +2 for other code (e.g., copying to/from GPU, freeing up the memory, etc).
+1 for the difference between using * and <<
+3 bonus for creating and using a macro to handle errors for CUDA function calls.
Submission Instructions
For this assignment, you need to do the following:
1- Compress all your files (the source code and the image) into one zip file and give it a name that
matches your ID (e.g., 1234567.zip).
2- Submit the zipped file to Canvas.
Note that you can resubmit an assignment, but the new submission overwrites the old submission and
receives a new timestamp.