The Need For Speed 2: OpenMP

I’ve been getting frustrated with just how long each run is taking, which is currently about a week. This means that every time that I make a small change, its several days before I know whether it’s successful or not.

I’ve been through the code looking to remove anything that is not essential and pre-compute as much data as possible so it doesn’t need doing in the main CPU intensive evolutionary generations. I’ve reduced the maximum syntax tree depth to 6 to limit the complexity of solutions and  therefore the processing time. I’ve filtered the training data set to exclude the quiet hours overnight when I don’t intend to trade but still it is taking far too long.

Then I discovered OpenMP – a standard for parallel processing in C that is supported by GCC. This means that by adding a simple pragma to my code in the main evaluation loop, the code will automatically run across all the CPU cores in my machine (only 2) giving an instant doubling of performance (almost) :-

 while (g < num_generations)

  {
     #pragma omp parallel for private(i, p)

     for (i=0; i<num_traders; i++)

     {

       p = evaluate_trader(&traders[i], 0);

       traders[i].profit = p;

     }

     sort_t(traders, 0, num_traders);

Notice the “#pragma omp parallel for”  ?  This splits the loop into two separate loops running in parallel in two separate threads on two cores. Since this loop has 256 iterations and no dependencies from one iteration to the next this is ideal for parallelism and could be split many more ways if I had more CPU’s / cores.

Comments

3 Responses to “The Need For Speed 2: OpenMP”

  1. Nick says:

    Maybe Nvidia CUDA is something you would like to try out.

  2. andrew says:

    Hi Nick,

    Thanks for the comment, I have been looking at CUDA and it looks amazing on paper. Unfortunately I haven’t got any hardware suitable at the moment and I’m also slightly concerned about the memory addressing limitations of the GPU’s and the long latency of accessing main memory. I’ve heard that with some problems a speed-up of only around 3.5 times was achieved on a 240 core GPU whereas I could get that with no rework by moving to a quad core CPU for similar cost. I’m sure it all hinges on whether you can partition your data up into small enough packets to fully enable independant parallel processing and when you can’t you’ll get disappointing results. But it is definately something that I want to look at.

    Andrew

  3. Nic Watson says:

    There are two dimensions along which one can parallelize computation: one can distribute the data across the computing nodes, or you can distribute the code across the computing nodes. OpenMP is a library for distributing data across nodes. GP screams for distributing code across compute nodes: each node gets a set of individuals to evaluate, and they only need inter-node communication for crossover/sorting.

Leave a Reply