February 23, 2007

Has Google hit 10 petabytes yet?

I’ve been musing about how big Google’s core database might be. Figuring that out is not a trivial problem, unless they’ve published the answer somewhere that I’m not aware of. But here’s a big clue, from an announcement about their n-gram data:

We processed 1,024,908,267,229 words of running text

My guess is that that number bears a close relationship to the total amount of text that Google had on disk in Spring, 2006 — probably including Gmail and the web-search cache alike, unless there’s some confidentiality issue that keeps Gmail out of the analysis. (Note: These n-grams aren’t letter combinations; they’re phrases. So Google published a list of all the 5-word phrases it has stored 40 or more times each. Hence my thoughts about confidentiality.) If so, to a first approximation we’re talking multiple petabytes of raw data, although that would be heavily adjusted (both up and down) by factors such as compression, mirroring, and the storage of related information such as HTML markup instructions.

An alternative estimating approach that gets nowhere is top-down, hardware-based. Google is known to have many hundreds of thousands of processing nodes, obviously with a lot of redundancy. But how much disk is attached to each? Less than 100 gigabytes? More than a terabyte? Is that information publicly known?

I think the only think we can say for sure about Google’s database is that the raw data is a Saganesque “petabytes and petabytes,” probably in the range of 5-10 petabytes total. But if somebody has more accurate information or analysis, I’d love for you to share.

Want to continue getting great research about search engines, directories, and other hot internet topics? Then subscribe to our feed, by RSS/Atom or e-mail! We recommend taking the integrated feed for all our blogs, but blog-specific ones are also easily available.

Technorati Tags: , , ,


Leave a Reply

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.