Hey there!

This is a continuation the open source project I worked on the previous blog.

Here I found few hotspots to optimize the OpenMP code in the same repository.

### Code

Following is the part of the code which I improved.

#pragma omp parallel for for( object_i = 0; object_i < OBJECT_COUNT; ++object_i ) { Vector3 total_force = { 0.0, 0.0, 0.0 }; for( int object_j = 0; object_j < OBJECT_COUNT; ++object_j ) { if( object_i == object_j ) continue; Vector3 displacement = v3_subtract( current_dynamics[object_j].position, current_dynamics[object_i].position ); double distance_squared = magnitude_squared( displacement ); double distance = sqrt( distance_squared ); double force_magnitude = ( G * object_array[object_i].mass * object_array[object_j].mass ) / distance_squared; Vector3 force = v3_multiply( (force_magnitude / distance ), displacement ); total_force = v3_add( total_force, force ); } Vector3 acceleration = v3_divide( total_force, object_array[object_i].mass); Vector3 delta_v = v3_multiply( TIME_STEP, acceleration ); Vector3 delta_position = v3_multiply( TIME_STEP, current_dynamics[object_i].velocity ); next_dynamics[object_i].velocity = v3_add( current_dynamics[object_i].velocity, delta_v ); next_dynamics[object_i].position = v3_add( current_dynamics[object_i].position, delta_position ); }

In the code snippet above, the force is calculated for every object (`i`

) by appling the Newton’s gravitational equation with every other object (`j`

).

#### Some information on the variables.

`current_dynamics`

: Contains the current position (Vector3) of all objects.

`next_dynamics`

: Where the updated position (Vector3) of the object is written to.

`object_array`

: This contains the mass of all the objects.

### Optimization

I applied Newton’s most famous law – Newton’s 3rd law. According to Newton’s 3rd law

To every action there is always opposed an equal reaction

The force on object `i`

because of object `j`

is same as the force on object `j`

because of object `i`

. So there is repeated force calculation going on here in the code. So I change the inner for loop to start from `object_i + 1`

instead of `0`

. Here comes the catch, doing this will unbalance the distribution of the work across threads. The threads which gets lower index values will have to lot of work compared to threads getting higher index values. So I changed the scheduling of the parallel for loop to **dynamic scheduling**.

Also while calculating the force for object `i`

due to object `j`

I need to add the force to the object `j`

. So I created an additional array to store it. Here I am storing the acceleration instead the force to avoid dividing it by mass again and again.

Also, I had to split the updating of the `next_dynamics`

with the new position into an another `for`

loop because I need to wait for all threads to complete its work of calculating the force.

Here is my code after the improvement.

#pragma omp parallel for schedule(dynamic) for( object_i = 0; object_i < OBJECT_COUNT; ++object_i ) { Vector3 total_acc = { 0.0, 0.0, 0.0 }; for( int object_j = object_i+1; object_j < OBJECT_COUNT; ++object_j ) { Vector3 displacement = v3_subtract( current_dynamics[object_j].position, current_dynamics[object_i].position ); double distance_squared = magnitude_squared( displacement ); double distance = sqrt( distance_squared ); double acc_magnitude = ( G * object_array[object_j].mass ) / distance_squared; Vector3 accel = v3_multiply( (acc_magnitude / distance ), displacement ); total_acc = v3_add( total_acc, accel ); acc[object_j] = v3_add( acc[object_j], v3_multiply( object_array[object_i].mass / object_array[object_j].mass, accel )); } acc[object_i] = v3_add( acc[object_i], total_acc ); } #pragma omp parallel for for( object_i = 0; object_i < OBJECT_COUNT; ++object_i ) { Vector3 acceleration = acc[object_i]; Vector3 delta_v = v3_multiply( TIME_STEP, acceleration ); Vector3 delta_position = v3_multiply( TIME_STEP, current_dynamics[object_i].velocity ); next_dynamics[object_i].velocity = v3_add( current_dynamics[object_i].velocity, delta_v ); next_dynamics[object_i].position = v3_add( current_dynamics[object_i].position, delta_position ); }

Now time to run this code.

real 0m5.564s user 0m21.012s sys 0m0.008s

Initially, it was

real 0m7.667s user 0m28.116s sys 0m0.000s

So this optimization improved the performance of the code from almost 7.7 secs to 5.5 secs. This code simulates the motion over a given time. Here the time was `1/16`

when compared the CUDA.

### System information

Processor: Intel® Core™ i5-4200U CPU @ 1.60GHz × 4 Memory: 3.6 GiB Graphics: GeForce GT 740M/PCIe/SSE2 OS Type: 64-bit

Amazing post Shreyas, you have done the real work. 🙂

LikeLike

Thanks. 🙂

LikeLike