N body simulation in CUDA

Hey there!

I found this code in GitHub which solves N-body problem using traditional Newtonian gravitational equations. The repository owner, pchapin, has already tried various parallelizing methods like – pthreads, OpenMP, MPI, and CUDA.

While going through the whole programs and running it for different inputs. I discovered that there few of spots for improvement for the CUDA code. So I compiled the CUDA code with nvcc and ran it on nvprof.

Here is the output I got by running nvprof.

$ nvprof ./CUDA
...
==9140== Profiling application: ./CUDA
==9140== Profiling result:
Time(%) Time     Calls Avg      Min      Max      Name
 98.52% 15.9990s 8766  1.8251ms 1.7794ms 1.8741ms do_calculations(Object*, ObjectDynamics*, ObjectDynamics*)
 0.91%  147.44ms 8766  16.819us 13.153us 43.331us [CUDA memcpy DtoH]
 0.57%  92.554ms 8767  10.557us 1.7600us 24.609us [CUDA memcpy HtoD]

==9140== API calls:
Time(%) Time     Calls Avg      Min      Max      Name
 98.78% 16.7852s 17533 957.35us 8.1400us 6.0742ms cudaMemcpy
 0.66%  112.41ms 3     37.470ms 7.3140us 112.21ms cudaMalloc
 0.50%  84.234ms 8766  9.6090us 6.3370us 2.3381ms cudaLaunch
 0.04%  6.0357ms 26298 229ns    172ns    11.220us cudaSetupArgument
 0.02%  3.9477ms 8766  450ns    254ns    389.01us cudaConfigureCall
 0.00%  341.91us 91    3.7570us 168ns    140.66us cuDeviceGetAttribute
 0.00%  309.89us 3     103.30us 15.360us 178.17us cudaFree
 0.00%  70.259us 1     70.259us 70.259us 70.259us cuDeviceTotalMem
 0.00%  50.684us 1     50.684us 50.684us 50.684us cuDeviceGetName
 0.00%  12.056us 3     4.0180us 293ns    11.350us cuDeviceGet
 0.00%  2.4960us 3     832ns    166ns    1.6640us cuDeviceGetCount

Here we can see that do_calculations kernel is the elephant in the room. It takes all the time – 16 secs.

The do_calculations function calculates the force on ith object and applies Newton’s gravitational law with the j object. This kernel launches one thread per object. So every thread calculates the force on each object.

Code

Following is the part of the code which I improved.

int object_i = blockIdx.x*blockDim.x + threadIdx.x;
for( int object_j = 0; object_j < OBJECT_COUNT; ++object_j ) {
    if( object_i == object_j ) continue;
    Vector3 displacement = cuda_v3_subtract( current_dynamics[object_j].position, current_dynamics[object_i].position );
    float distance_squared = cuda_magnitude_squared( displacement );
    float distance = sqrt( distance_squared );
    float t1 = object_array[object_i].mass / distance;
    float t2 = object_array[object_j].mass / distance;
    float force_magnitude = ( G * t1 ) * t2;
    Vector3 force = cuda_v3_multiply( (force_magnitude / distance ), displacement );
    total_force = cuda_v3_add( total_force, force );
}

There are some unnecessary part which I skip here above.

Some information on the variables.

current_dynamics: Contains the current position (Vector3) of all objects.

next_dynamics: Where the updated position (Vector3) of the object is written to.

object_array: This contains the mass of all the objects.

total_force: A temporary variable to store the total force after each step. This is later updated in the array next_dynamics.

object_i: is the index of current object for which we are calculating the force.

The variables current_dynamics, next_dynamics and object_array are stored in CUDA global memory. total_force is local variable.

Optimization

As you can see in the code the current_dynamics is accessed repeatedly by variable object_j and the access is contiguous. So only one thing struck to my mind – shared memory. So I rewrote the code to use shared memory, this was my initial idea

  1. Create a shared array of the same size as the global array.
  2. Then copy the whole array to shared memory.

But it won’t work because of 2 reasons

  1. Shared memory is shared across threads in the same block. So threads in other blocks cannot use this shared memory.
  2. It needs a lot of memory. It needs N * NO_OF_BLOCKS * sizeof(Object) memory at max.

So my idea was to create a shared memory of block size (i.e. no. of threads in a block) and split the calculation into N / block size sections. Where every section uses the shared memory completely.

 __shared__ Vector3 shared_position[BLOCK_SIZE];
 __shared__ float shared_mass[BLOCK_SIZE];
 int blocks = (OBJECT_COUNT + BLOCK_SIZE - 1) / BLOCK_SIZE;
 for (unsigned block_i = 0; block_i < blocks-1; block_i++) {
     shared_position[threadIdx.x] = current_dynamics[block_i * blockDim.x + threadIdx.x].position;
     shared_mass[threadIdx.x] = object_array[block_i * blockDim.x + threadIdx.x].mass;
     __syncthreads();
     for (int object_j = 0; object_j < BLOCK_SIZE; ++object_j) {
         if( object_i == block_i * blockDim.x + object_j ) continue;
         // same force calculation
     }
     __syncthreads();
 }

Here I had to use __syncthreads() to run the calculation part only after the shared memory is been populated and a __syncthreads() after the calculation to make sure the shared memory population for next section starts only after all threads finish it calculations.

Also here I need to handle the last section separately. That’s why the loop above runs till blocks - 1 times.

Now time to run this code.

==29238== Profiling application: ./CUDA
==29238== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 96.91%  7.70050s      8766  878.45us  863.91us  901.16us  do_calculations(Object*, ObjectDynamics*, ObjectDynamics*)
  1.91%  151.40ms      8766  17.271us  13.248us  42.848us  [CUDA memcpy DtoH]
  1.18%  93.956ms      8767  10.717us  1.6320us  25.120us  [CUDA memcpy HtoD]
...

Shared memory reduced the time from almost 16 secs to 7.7 secs for input of 1000 objects. That’s almost less than half of the previous time. I didn’t expect I would get such a huge improvement. It’s because I didn’t expect shared memory to be this fast compared to global memory.

Also as you can see in the code the array element current_dynamics[object_i] is accessed multiple times from the global memory. So I tried to make a local variable to store that value. Same goes for object_array. It improved the performance by some milliseconds. It wasn’t a great improvement.

System information

Processor:  Intel® Core™ i5-4200U CPU @ 1.60GHz × 4
Memory:     3.6 GiB
Graphics:   GeForce GT 740M/PCIe/SSE2
OS Type:    64-bit

One thought on “N body simulation in CUDA

Leave a comment