Mike Olson runs a company that specializes in the world’s hottest software. He’s the CEO of Cloudera, a Silicon Valley startup that deals in Hadoop, an open source software platform based on tech that turned Google into the most dominant force on the web.
Hadoop is expected to fuel a $813 million software market by the year 2016. But even Olson says it’s already old news.
Hadoop sprung from two research papers Google published in late 2003 and 2004. One described the Google File System, a way of storing massive amounts of data across thousands of dirt-cheap computer servers, and the other detailed MapReduce, which pooled the processing power inside all those servers and crunched all that data into something useful. Eight years later, Hadoop is widely used across the web, for data analysis and all sorts of other number-crunching tasks. But Google has moved on.
In 2009, the web giant started replacing GFS and MapReduce with new technologies, and Mike Olson will tell you that these technologies are where the world is going. “If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now,” Olson said during a recent panel discussion alongside Wired.
‘If you want to know what the large-scale,
high-performance data processing infrastructure of the future looks
like, my advice would be to read the Google research papers that are
coming out right now.’
Since the rise of Hadoop, Google has published three particularly
interesting papers on the infrastructure that underpins its massive web
operation. One details Caffeine, the software platform that builds the index for Google’s web search engine. Another shows off Pregel, a “graph
database” designed to map the relationships between vast amounts of
online information. But the most intriguing paper is the one that
describes a tool called Dremel.
— Mike Olson
“If you had told me beforehand me what Dremel claims to do, I wouldn’t have believed you could build it,” says Armando Fox, a professor of computer science at the University of California, Berkeley who specializes in these sorts of data-center-sized software platforms.
Dremel is a way of analyzing information. Running across thousands of servers, it lets you “query” large amounts of data, such as a collection of web documents or a library of digital books or even the data describing millions of spam messages. This is akin to analyzing a traditional database using SQL, the Structured Query Language that has been widely used across the software world for decades. If you have a collection of digital books, for instance, you could run an ad hoc query that gives you a list of all the authors — or a list of all the authors who cover a particular subject.
“You have a SQL-like language that makes it very easy to formulate ad hoc queries or recurring queries — and you don’t have to do any programming. You just type the query into a command line,” says Urs Hölzle, the man who oversees the Google infrastructure.
The difference is that Dremel can handle web-sized amounts of data at blazing fast speed. According to Google’s paper, you can run queries on multiple petabytes — millions of gigabytes — in a matter of seconds.
Hadoop already provides tools for running SQL-like queries on large datasets. Sister projects such as Pig and Hive were built for this very reason. But with Hadoop, there’s lag time. It’s a “batch processing” platform. You give it a task. It takes a few minutes to run the task — or a few hours. And then you get the result. But Dremel was specifically designed for instant queries.
“Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce jobs, but at a fraction of the execution time,” reads Google’s Dremel paper. Hölzle says it can run a query on a petabyte of data in about three seconds.
According to Armando Fox, this is unprecedented. Hadoop is the centerpiece of the “Big Data” movement, a widespread effort to build tools that can analyze extremely large amounts of information. But with today’s Big Data tools, there’s often a drawback. You can’t quite analyze the data with the speed and precision you expect from traditional data analysis or “business intelligence” tools. But with Dremel, Fox says, you can.
“They managed to combine large-scale analytics with the ability to really drill down into the data, and they’ve done it in a way that I wouldn’t have thought was possible,” he says. “The size of the data and the speed with which you can comfortably explore the data is really impressive. People have done Big Data systems before, but before Dremel, no one had really done a system that was that big and that fast.
“Usually, you have to do one or the other. The more you do one, the more you have to give up on the other. But with Dremel, they did both.”
‘Before Dremel, no one had really done a
system that was that big and that fast. Usually, you have to do one or
the other. The more you do one, the more you have to give up on the
other. But with Dremel, they did both.’
According to Google’s paper, the platform has been used inside Google
since 2006, with “thousands” of Googlers using it to analyze everything
from the software crash reports for various Google services to the
behavior of disks inside the company’s data centers. Sometimes, the tool
is used with tens of servers, sometime with thousands.
— Armando Fox
Despite Hadoop’s undoubted success, Cloudera’s Mike Olson says that the companies and developers who built the platform were rather slow off the blocks. And we’re seeing the same thing with Dremel. Google published the Dremel paper in 2010, but we’re still a long way from seeing the platform mimicked by developers outside the company. A team of Israeli engineers is building a clone they called OpenDremel, though one of these developers, David Gruzman, tells us that coding is only just beginning again after a long hiatus.
Mike Miller — an affiliate professor of particle physics at the University of Washington and the chief scientist of Cloudant, a company that’s tackling many of the same data problems Google has faced over the years — is amazed we haven’t seen some big-name venture capitalist fund a startup dedicated to reverse-engineering Dremel.
That said, you can use Dremel today — even if you’re not a Google engineer. Google now offers a Dremel web service it calls BigQuery. You can use the platform via an online API, or application programming interface. Basically, you upload your data to Google, and it lets you run queries on its internal infrastructure.
This is part of a growing number of cloud services offered by the company. First, it let you run build, run, and host entire applications atop its infrastructure using a service called Google App Engine, and now it offers various other utilities that run atop this same infrastructure, including BigQuery and the Google Compute Engine, which serves up instant access to virtual servers.
The rest of the world may lag behind Google. But Google is bringing itself to the rest of the world.