Robin Harris at StorageMojo links to several presentations from the Seattle Conference on Scalability hosted by Google. Barry Brumitt leads one talk on Using Map-Reduce with Large Geographic Datasets which looked interesting to me.
Best part was that right at the beginning of the presentation he talks about the need for index files to complement raw data, and then...
"Once we have index files, then we load those up onto servers, usually in RAM so that we can actually answer queries from users. It’s really not that useful to have things that are used even slightly frequently on disk because you are limited by the bandwidth to the disk and the number of seeks you can do. So you can do 100 maybe 1000 queries per second on a disk. If you are Google and answering the kind of queries we do, that just doesn’t do it."
Last year we noted another mention of Google's Storage Performance Gap by Luiz Barroso when he outlined the problem at the Intel Developers Conference.
Of course, only some companies have to deal with that scale of queries. And only a select few have the ability to architect a compute infrastructure from the ground up like Google.
But many companies do suffer from I/O bottlenecks and most have longstanding infrastructure in place that cannot necessarily be discarded for a clean slate approach. In those cases, adding speed and performance can be more easily accomplished by using a scalable caching appliance and implementing a centralized storage caching solution from Gear6.