Google spans entire planet with GPS-powered database

September 20, 2012
599px-The_Earth_seen_from_Apollo_17

(Credit: NASA)

Wired Enterprise reports that Google has published a research paper (open access) detailing Spanner, which Google says is the first database that can quickly store and retrieve information across a worldwide network of data centers while keeping that information “consistent” — meaning all users see the same collection of information at all times.

Spanner borrows techniques from some of the other massive software platforms Google built for its data centers, but at its heart is something completely new. Spanner plugs into a network of servers equipped with super-precise atomic clocks and GPS antennas, using these time keepers to more accurately synchronize the distribution of data across such a vast network.

“If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now,” Mike Olson, the CEO of Hadoop specialist Cloudera, said at recent event in Silicon Valley.

Facebook is already building a system that’s somewhat similar to Spanner, in that it aims to juggle information across multiple data centers. Judging from our discussions with Facebook about this system — known as Prism — it’s quite different from Google’s creation.

The genius of the platform lies in something Google calls the TrueTime API. API is short for application programming interface, but in this case, Google is referring to a central data feed that its servers plug into. Basically, TrueTime uses those GPS antennas and atomic clocks to get Google’s entire network running in lock step.

To understand TrueTime, you have to understand the limits of existing databases. Today, there are many databases designed to store data across thousands of servers. Most were inspired either by Google’s BigTable database or a similar storage system built by Amazon known as Dynamo. They work well enough, but they aren’t designed to juggle information across multiple data centers — at least not in a way that keeps the information consistent at all times.

According to Andy Gross — the principal architect at Basho, whose Riak database is based on Amazon Dynamo — the problem is that servers must constantly communicate to ensure they correctly store and retrieve data, and all this back-and-forth ends up bogging down the system if you spread it across multiple geographic locations. “You have to a do a whole lot of communication to decide the correct order for all the transactions,” Gross says, “and the latencies you get are typically prohibitive for a fast database.”