Wednesday, February 13, 2013

obstacle detection from velodyne, a story of code optimization

I started working on this algorithm for detecting obstacles in a velodyne scan (about 140,000 points per revolution in 64 beams). The algorithm is as follow:

1:    for each beam b  
2:        pb ← partner beam (a beam slightly further)
3:        for each point P in beam pb  
4:            y ← distance(P)  
5:            for each point Q in beam b, around P (+/- few degrees)  
6:                x ← distance(Q)  
7:                E[P] = expected distance of P, as a function of x  
8:                if y < E[P]  
9:                    P is an obstacle  

That can be up to a million distance tests (line 8) depending on the situation. Velodyne data comes once every 100ms and we need to do a lot more once points have been labeled, so this should run in about 10ms. When I started coding it, it was running in a couple of seconds. This is the story of how I brought it down to 12ms on a single core of an average laptop (without using a GPU or OpenMP).

Since the structure of the velodyne is fixed, we can pre-compute the beam association (beam partnering), formula for expected the distance, which is a function of the vertical angle of beams.

Then, I realized that once a point had been labeled has an obstacle, there was no need comparing it with its other neighbors. This cut down the execution time from a few seconds to about 500ms.

I organized the data in a 2D array, beam index being the first dimension. For each point I am storing its angular position, so that I can do neighbor searching based on angular distance. After running it through a profiler (gperftools) I realized that this organization was slowing down the algorithm a lot: first of all the algorithm didn't really know where to look for neighboring points so a bit of searching was necessary, and angle comparison was too long.

So I reorganized the points into a 3D array, with points binned by small groups according to their angular position in number of encoder ticks (integer). I am using 700 bins, leading to a couple of points per bin. Then finding neighboring points is straightforward: knowing the encoder value for a point, simply divide it by the max value and multiply by the number of bins. This led to a huge speedup, down to about 35ms.

Again, I inspected the code with the profiler and noticed that it was spending a lot of time accessing the points in the array. So I cached the pointer values when the same location was used several times.

 Obj &p = table[i][j]  
 for( )  
   do something with p  

is faster than

   do something with table[i][j]  

This brought down the execution time to about 20ms.

Letting the compiler optimize the code (until now I was in debug mode), brought the execution time down to 12ms!

ps: code formatting from