Shared Hash Lookup in Perl

2007-09-14 02:47PM PDT/Home

Philip Aaronson

Lucky me, I got the dirty job of trying to enhance some old Perl code I didn't write, and whose author has since moved on to greener pastures. We do, what we've got to do. The issue was that it was running out of memory.

For a simplified version of the problem, consider a Perl program which forks off n number of processes that each are looking up values in their process space's identical copy of a hash. At some point, either when the hash gets large enough or when the number of processes gets large enough, or both, you run out of memory. Which gets you thinking, damn, it would be really worthwhile to have a single copy of that hash, and not n. Of course, right?

Which is really, a very nice fundamental question for any language. I tried a couple of different approaches to fix this in Perl:

  1. tie the hash to a dbm file: 10x slower
  2. create threads and put the hash in shared memory: 6x slower
Nothing was really satisfactory. While they both "solved" the memory issue, they traded away too much. The problem with the dbm approach, we're doing a read from disk each time. The problem with threading, and using a shared memory in Perl is most likely overhead.

For work, since no obvious solution jumped out at me, I took a different tact and refactored the hash to make it smaller and solved the memory problem that way. But this question is still nagging me.

Update: A very related questions is posed by Tim Bray, who explores Erlang. Also of note, a very interesting comment about running a MapReduce implementation on Hadoop by Tom White. As well as a reference to Pig.