The other day I was writing some code that needed to build up a list of strings in a large file, and identify any duplicates that occurred. On average, the file would contain about 10000 unique strings, and I’d be looking at finding at least 500 duplicates per string. That means that throughout the parsing of my file, I’d have to add about 10000 strings to the list, and probably do about 5000000 lookups on them. This needed to be a fast process…
This blog entry is about an awesome trick that Mr Alex Evans taught me when I went and asked him for advice about this problem. It’s such a good lesson I thought I’d write it up and shove it on the blog. I’d come up with various ways I could approach this – hash tables, maps, binary trees, specialised string trees and couldn’t pick how to attack it. His advice was that if an algorithm can do anything at all that you don’t need it to do, it’s probably slower than it needs to be.
A perfect solution to a problem will do exactly what it needs to – no more, no less. That way you know that even if the code isn’t perfectly optimised, it’s not spending time doing things it doesn’t need to. It occurred to me that even nature follows this approach – a perfectly evolved creature for some scenario will be exactly as good as it needs to be at things – anything extra is a waste of valuable resources. Cheetahs don’t run at 150mph because 70mph is enough! Clearly it’s a good way of thinking of things, so I set about this approach for my string searching.
First, my requirements:
- I only ever add to the list – I never remove
- I never need to extract any sorted data from it or anything fancy like that
- I never need to iterate over the list
- I’m going to be doing on average 500 times more lookups than I am adds, and the majority are likely to be successful (i.e. they will find an entry)
- The strings themselves will always be in memory (as they’re loaded when I load the file), so they don’t need copying when adding
- I have loads and loads of memory to burn!
I started off with a std::map, which could do:
Adds, gets, deletes
- Any sort of types for the key and value, with any sort of comparison function
- Easy iteration of the list
- Easy extraction of sorted data from the list
- Automatically manages memory allocation
Clearly the map does way too much – it supports lots of operations I don’t need, and can deal with any arbitrary types, whereas I just need a null terminated string for a key and a single unsigned int for the value. To cut a long story short, after several rounds of following Alex’s advice, I found myself with:
- A custom hash table – uses a very simple 64 bit hash algorithm on the string, then uses the bottom 20 bits as an index into a table
- Each entry in the table is a list of strings, their full 64 bit hash, their length, and their corresponding unsigned int.
- This means my lookups can identify mismatches very fast – only in the event of 2 strings having exactly the same hash and length will I need to do an actual strcmp to verify they’re identical. The odds are I’ll end up doing 1 strcmp per lookup
- All memory in the table is allocated linearly in one big mega chunk (which gets bigger if it really has to). This avoids lots of dynamic allocation, but prevents me being able to delete things.
- Specialised to deal with strings + unsigned ints – doesn’t handle arbitrary class types at all
- It’s only 150 lines of code
The result – parsing my file and doing all the string lookups goes from 22.5s to 1.8s.
Anyway, the moral of the story and the lesson this blog post is all about: