OK, I decided to revive this blog from hibernation. I guess, it is more like a website and not a real blog that you can subscribe to. Anyway, recently, I was evaluating one pretty good search engine and hit a performance issue that might be interesting to some people. The issue was with the speed of data indexing for search – after some basic perf tuning, I reached certain speed (in “documents per second”), but it was still not sufficient for me. So I decided to do more parameter tweaking to see if it can be improved, but nothing helped. It looked like I hit its upper perf limit.
This graph shows how different resources were utilized during most of the indexing time (there were some minor variations, but this graph shows the most representative part):
As you can see that the most used resource was the disk IO (not a surprise). Specifically, the engine was writing data most of the time. This makes sense, since it is creating the search index on the disk J. What’s interesting – even though the search indexer needs to do word breaking and some textual processing, which are processor intensive operations, CPU was not used most of the time. The hardware that I had was: Intel QuadCore 2.3 GHz, 8GB RAM and two 500GB SATA Hard Drives with one hard drive dedicated to indexing (OS ran on another drive). So my computer had plenty of CPU power, medium size memory and pretty slow commodity hard drives. When I ran the same indexing on a better SCSI drive it worked faster (as expected).
I started thinking how to make it run faster on my existing SATA drive and tried different variations of parameters (increased memory caching, changed number of threads, changed number of documents per indexing batch, etc.), but it had almost zero effect on the speed of indexing. Then I stumbled on some parameters that were controlling compression of index chunks and some temporary text files used during indexing. The manual for this search engine said clearly that if I turn compression “off”, then indexing should run faster. This made sense since I remembered that in the “old” days compression was expensive, so I turned it off. To my big surprise, I found that the indexing process became much slower without compression. When I turned it back on, the indexing performance had improved.
At this point I realized that in this setup, where the biggest bottleneck is the disk IO and where CPU power/memory is in abundance, compression can actually help improving disk intensive operations – data gets compressed/decompressed in memory (indexing chunks in my case were pretty small) and then gets written/read to/from the disk much faster. The time to compress/decompress small chunks of data in memory is negligible compared to time needed to write/read it from/to the disk, if you have plenty of CPU power and if your data compresses well. My intuition from the older days, when compression was costly and was all about saving hard disk space, was wrong. Modern systems are mostly bound by the disk IO and not by CPU/memory, so compression can improve performance of disk intensive applications/servers.
Later, I found that other people knew this fact all along. For example, in this excellent book – “Introduction to Information Retrieval“, I found the following:
“The second more subtle advantage of compression is faster transfer data from disk to memory … We can reduce input/output (IO) time by loading a much smaller compressed posting list, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.”
And later also:
“Choosing the optimal encoding for an inverted index is an ever-changing game for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally, CPUs were slow, and so highly compressed techniques were not optimal. Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you’re running a search engine with everything in memory, the equation changes again.”
So, if you develop data intensive applications, then compression might be your friend if the disk IO is a bottleneck. This may change again with the arrival of solid state disks…