Hey there!
I found this code in GitHub which solves N-body problem using traditional Newtonian gravitational equations. The repository owner, pchapin, has already tried various parallelizing methods like – pthreads, OpenMP, MPI, and CUDA.
While going through the whole programs and running it for different inputs. I discovered that there few of spots for improvement for the CUDA code. So I compiled the CUDA code with nvcc and ran it on nvprof.
Here is the output I got by running nvprof.
$ nvprof ./CUDA ... ==9140== Profiling application: ./CUDA ==9140== Profiling result: Time(%) Time Calls Avg Min Max Name 98.52% 15.9990s 8766 1.8251ms 1.7794ms 1.8741ms do_calculations(Object*, ObjectDynamics*, ObjectDynamics*) 0.91% 147.44ms 8766 16.819us 13.153us 43.331us [CUDA memcpy DtoH] 0.57% 92.554ms 8767 10.557us 1.7600us 24.609us [CUDA memcpy HtoD] ==9140== API calls: Time(%) Time Calls Avg Min Max Name 98.78% 16.7852s 17533 957.35us 8.1400us 6.0742ms cudaMemcpy 0.66% 112.41ms 3 37.470ms 7.3140us 112.21ms cudaMalloc 0.50% 84.234ms 8766 9.6090us 6.3370us 2.3381ms cudaLaunch 0.04% 6.0357ms 26298 229ns 172ns 11.220us cudaSetupArgument 0.02% 3.9477ms 8766 450ns 254ns 389.01us cudaConfigureCall 0.00% 341.91us 91 3.7570us 168ns 140.66us cuDeviceGetAttribute 0.00% 309.89us 3 103.30us 15.360us 178.17us cudaFree 0.00% 70.259us 1 70.259us 70.259us 70.259us cuDeviceTotalMem 0.00% 50.684us 1 50.684us 50.684us 50.684us cuDeviceGetName 0.00% 12.056us 3 4.0180us 293ns 11.350us cuDeviceGet 0.00% 2.4960us 3 832ns 166ns 1.6640us cuDeviceGetCount
Here we can see that do_calculations
kernel is the elephant in the room. It takes all the time – 16 secs.
The do_calculations
function calculates the force on i
th object and applies Newton’s gravitational law with the j
object. This kernel launches one thread per object. So every thread calculates the force on each object.
Code
Following is the part of the code which I improved.
int object_i = blockIdx.x*blockDim.x + threadIdx.x; for( int object_j = 0; object_j < OBJECT_COUNT; ++object_j ) { if( object_i == object_j ) continue; Vector3 displacement = cuda_v3_subtract( current_dynamics[object_j].position, current_dynamics[object_i].position ); float distance_squared = cuda_magnitude_squared( displacement ); float distance = sqrt( distance_squared ); float t1 = object_array[object_i].mass / distance; float t2 = object_array[object_j].mass / distance; float force_magnitude = ( G * t1 ) * t2; Vector3 force = cuda_v3_multiply( (force_magnitude / distance ), displacement ); total_force = cuda_v3_add( total_force, force ); }
There are some unnecessary part which I skip here above.
Some information on the variables.
current_dynamics
: Contains the current position (Vector3) of all objects.
next_dynamics
: Where the updated position (Vector3) of the object is written to.
object_array
: This contains the mass of all the objects.
total_force
: A temporary variable to store the total force after each step. This is later updated in the array next_dynamics
.
object_i
: is the index of current object for which we are calculating the force.
The variables current_dynamics
, next_dynamics
and object_array
are stored in CUDA global memory. total_force
is local variable.
Optimization
As you can see in the code the current_dynamics
is accessed repeatedly by variable object_j
and the access is contiguous. So only one thing struck to my mind – shared memory. So I rewrote the code to use shared memory, this was my initial idea
- Create a shared array of the same size as the global array.
- Then copy the whole array to shared memory.
But it won’t work because of 2 reasons
- Shared memory is shared across threads in the same block. So threads in other blocks cannot use this shared memory.
- It needs a lot of memory. It needs
N * NO_OF_BLOCKS * sizeof(Object)
memory at max.
So my idea was to create a shared memory of block size (i.e. no. of threads in a block) and split the calculation into N / block size
sections. Where every section uses the shared memory completely.
__shared__ Vector3 shared_position[BLOCK_SIZE]; __shared__ float shared_mass[BLOCK_SIZE]; int blocks = (OBJECT_COUNT + BLOCK_SIZE - 1) / BLOCK_SIZE; for (unsigned block_i = 0; block_i < blocks-1; block_i++) { shared_position[threadIdx.x] = current_dynamics[block_i * blockDim.x + threadIdx.x].position; shared_mass[threadIdx.x] = object_array[block_i * blockDim.x + threadIdx.x].mass; __syncthreads(); for (int object_j = 0; object_j < BLOCK_SIZE; ++object_j) { if( object_i == block_i * blockDim.x + object_j ) continue; // same force calculation } __syncthreads(); }
Here I had to use __syncthreads()
to run the calculation part only after the shared memory is been populated and a __syncthreads()
after the calculation to make sure the shared memory population for next section starts only after all threads finish it calculations.
Also here I need to handle the last section separately. That’s why the loop above runs till blocks - 1
times.
Now time to run this code.
==29238== Profiling application: ./CUDA ==29238== Profiling result: Time(%) Time Calls Avg Min Max Name 96.91% 7.70050s 8766 878.45us 863.91us 901.16us do_calculations(Object*, ObjectDynamics*, ObjectDynamics*) 1.91% 151.40ms 8766 17.271us 13.248us 42.848us [CUDA memcpy DtoH] 1.18% 93.956ms 8767 10.717us 1.6320us 25.120us [CUDA memcpy HtoD] ...
Shared memory reduced the time from almost 16 secs to 7.7 secs for input of 1000 objects. That’s almost less than half of the previous time. I didn’t expect I would get such a huge improvement. It’s because I didn’t expect shared memory to be this fast compared to global memory.
Also as you can see in the code the array element current_dynamics[object_i]
is accessed multiple times from the global memory. So I tried to make a local variable to store that value. Same goes for object_array
. It improved the performance by some milliseconds. It wasn’t a great improvement.
System information
Processor: Intel® Core™ i5-4200U CPU @ 1.60GHz × 4 Memory: 3.6 GiB Graphics: GeForce GT 740M/PCIe/SSE2 OS Type: 64-bit
One thought on “N body simulation in CUDA”