Working in Parallel

April 12, 2016 - 8 minutes read - 1607 words

Moving from a sequential to a parallel implementation of an algorithm usually means that something has to change, and it may mean that the whole algorithm has to be rethought. There are some important concepts that help us predict what kinds of changes will make an algorithm perform well when run in parallel. This article is a brief introduction.

Simple Performance

In the sequential case, we typically talk about algorithms in terms of the order of the algorithm (typically called Big O). This lets us talk about how an algorithm behaves as the number of items changes, without worrying too much about specifics of how long it takes to actually do the work.

For example, if we have data stored in an unsorted list structure, and we need to find out if a particular value is in the list, our only option is to check each item in the list until we find the item or reach the end of the list. In Big O notation we call this O(n), indicating that as the length n of the list grows, we should expect the time it takes to search it to increase in a linear way.

The key here is that we don’t care how long it takes to step through the list and look at each element, and we don’t care that sometimes we’ll find it on the first try and exit very quickly. We only care about the general relationship between the size of the list and the time it takes to run. In this case, if the list gets twice as long, the average run time will get about twice as long.

Similarly, if we had an unsorted list, and we had to see if it contains any duplicated elements, we would call this O(n^2), because we are going to have to do n searches through the list, each of which we already said is O(n). Regular math works here, and O(n) times O(n) equals O(n^2). Again, the time to do the comparison or the fact that we might find a duplicate very quickly doesn’t matter; we just care about the fact that if the list gets three times as long, the average run time will be about nine times as long.

Work and Depth

When we move from working sequentially to working in parallel, we start to worry not just about how many steps an algorithm takes, but also how easy it is to do more than one thing at the same time. For example, in our case of searching an unordered list, while we have to step through the whole list, every single comparison that we do is independent of every other, so we could hand out each comparison to a separate processor (if we had that many) and get them all done at once.

To talk about how easy it is to break an algorithm up, we need to talk about more than just a single Big O value for the algorithm. Instead, the convention is to talk about it in terms of “work” and “depth”. Work is the same as we saw earlier; it is how the run time grows as the input size grows. Depth also uses Big O notation, but it uses it to express how easy it is to split the problem up so it can be run in parallel.

We use the term “depth” because we are thinking in terms of “divide and conquer”: we expect to have a recursive function that hands off smaller and smaller pieces of the problem to new versions of itself. In that kind of setup, the flatter (shallower) the recursion, the better, because it means we can spread out across multiple processes more easily. In the case of our search in an unordered list, we would say that depth is O(1), also known as “constant time”. No matter how many extra items there are in the list, we can in theory break it up into that number of pieces.

We can analyze our unsorted list duplicate search in the same way. We need to compare each item in the list with every other item. No problem, we just create n^2 separate tasks, each with a different “left” and “right” index for the comparison, and we can do all the comparisons in one step. So the depth is still O(1).

At this point, hopefully alarm bells will be ringing about how feasible this is, but I’m not quite ready to let the real world intrude yet. So bear with me.

Available Parallelism

So we have some figures for work and depth, in Big O notation. While these figures are useful on their own, there is also value in putting them together. We can define the “available parallelism” as the work divided by the depth; in this case, bigger is better.

With our search through an unsorted list, the work was O(n) and the depth was O(1), giving an available parallelism of O(n). What this says is that while the amount of work we need to do increases in a linear way as the size of the input increases, our ability to do the work in parallel also increases in a linear way, so as long as we have more processors to throw at the problem we can get the work done in about the same amount of time (ignoring for a moment the overhead of splitting up the work).

To give a marginally more realistic example, let’s say that instead of just identifying duplicates, we wanted a count of the number of duplicates for each duplicate we find. Now, instead of just comparing each item in the list to every other item, we also need to keep track of how many matches we’ve found. That means we can’t split up the comparisons completely; the best we can do is split up the “left” side of the comparison so we count the number of matches in parallel for each item in the list. Of course, this is a very poor approach, because we are finding the number of duplicates for every duplicate, which is a lot of wasted work; bear with me.

So for this example, while the work is still O(n^2), the depth is now O(n). This means our available parallelism is O(n), which is still quite good because it tells us that if we have more processors, we can put them to use, but it also means that we would expect the running time of the algorithm to increase in a linear way as the size of the input increases.

It would be nice to avoid that wasted work. Those of you who are experienced in map / reduce have probably noticed that rather than comparing each item to every other item, we can have a map emit a count for each item, then add up the count of each item in a reducer. In fact this is the default WordCount example from Hadoop. The mapper in this case has work O(n), and if the reducer is written correctly this can make the depth O(log n). Hopefully in a future article I can expand on this, since right associativity is cool.

One final note: it is interesting to see that while the work is much worse in the first example, because of all the wasted comparisons, it has slightly better available parallelism than the map / reduce example, because it fully preserves independence of all the comparisons. In this case, that is not enough reason to prefer it, but it does illustrate an important rule in parallel programming, which is that sometimes it is necessary to waste work in order to improve parallelism.

The Real World Will Not Stop Hassling Me

So far, with everything we considered, to have a good parallel algorithm we want to increase our available parallelism, because then we can just throw more hardware at the problem to get it to run faster. Unfortunately, while that can be true, it isn’t the full story.

First of all, in the real world servers cost money, and so does electricity. There is bound to be a limit on our ability to buy more hardware or spawn more cloud instances. At that point, no matter what the theoretical speedup of our algorithm is, we won’t see any actual advantages, because we’ll just be queueing up more tasks than we have cores to run them on.

Second, Big O notation hides a lot of important differences between algorithms. There’s a definite cost in a real implementation in creating a task or kicking off a Goroutine. In most real-world implementations of parallel algorithms, there is a tuning step that happens where the number of parallel tasks we actually spawn is far less than the theoretical maximum for the algorithm. There’s a reason Hadoop lets you carefully configure split size and block size. This illustrates what was absurd about breaking up our duplicate search into n^2 pieces; the overhead of doing that is going to be many times greater than the time it takes to do the single comparison of two items.

Third, as we saw above, to get higher available parallelism we sometimes have to do extra work, not just extra overhead. Sometimes that extra work is justified by the speedup we get; sometimes it is not.

Conclusion

This is a pretty basic discussion of how parallel algorithms are analyzed and compared to each other. If you’d like to see how parallel code might be written in practice, I have a GitHub repository that runs a Net Present Value simulator using Java fork / join RecursiveTask that might be of interest.