Tuesday, June 10, 2014

Google’s Dremel Makes Big Data Look Small

Old news: http://www.wired.com/2012/08/googles-dremel-makes-big-data-look-small/


Mike Olson runs a company that specializes in the world’s hottest software. He’s the CEO of Cloudera, a Silicon Valley startup that deals in Hadoop, an open source software platform based on tech that turned Google into the most dominant force on the web.
Hadoop is expected to fuel a $813 million software market by the year 2016. But even Olson says it’s already old news.
Hadoop sprung from two research papers Google published in late 2003 and 2004. One described the Google File System, a way of storing massive amounts of data across thousands of dirt-cheap computer servers, and the other detailed MapReduce, which pooled the processing power inside all those servers and crunched all that data into something useful. Eight years later, Hadoop is widely used across the web, for data analysis and all sorts of other number-crunching tasks. But Google has moved on.
In 2009, the web giant started replacing GFS and MapReduce with new technologies, and Mike Olson will tell you that these technologies are where the world is going. “If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now,” Olson said during a recent panel discussion alongside Wired.
‘If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now.’
— Mike Olson
Since the rise of Hadoop, Google has published three particularly interesting papers on the infrastructure that underpins its massive web operation. One details Caffeine, the software platform that builds the index for Google’s web search engine. Another shows off Pregel, a “graph database” designed to map the relationships between vast amounts of online information. But the most intriguing paper is the one that describes a tool called Dremel.
“If you had told me beforehand me what Dremel claims to do, I wouldn’t have believed you could build it,” says Armando Fox, a professor of computer science at the University of California, Berkeley who specializes in these sorts of data-center-sized software platforms.
Dremel is a way of analyzing information. Running across thousands of servers, it lets you “query” large amounts of data, such as a collection of web documents or a library of digital books or even the data describing millions of spam messages. This is akin to analyzing a traditional database using SQL, the Structured Query Language that has been widely used across the software world for decades. If you have a collection of digital books, for instance, you could run an ad hoc query that gives you a list of all the authors — or a list of all the authors who cover a particular subject.
“You have a SQL-like language that makes it very easy to formulate ad hoc queries or recurring queries — and you don’t have to do any programming. You just type the query into a command line,” says Urs Hölzle, the man who oversees the Google infrastructure.
The difference is that Dremel can handle web-sized amounts of data at blazing fast speed. According to Google’s paper, you can run queries on multiple petabytes — millions of gigabytes — in a matter of seconds.
Hadoop already provides tools for running SQL-like queries on large datasets. Sister projects such as Pig and Hive were built for this very reason. But with Hadoop, there’s lag time. It’s a “batch processing” platform. You give it a task. It takes a few minutes to run the task — or a few hours. And then you get the result. But Dremel was specifically designed for instant queries.
“Dremel can execute many queries over such data that would ordinarily require a sequence of MapReduce jobs, but at a fraction of the execution time,” reads Google’s Dremel paper. Hölzle says it can run a query on a petabyte of data in about three seconds.
According to Armando Fox, this is unprecedented. Hadoop is the centerpiece of the “Big Data” movement, a widespread effort to build tools that can analyze extremely large amounts of information. But with today’s Big Data tools, there’s often a drawback. You can’t quite analyze the data with the speed and precision you expect from traditional data analysis or “business intelligence” tools. But with Dremel, Fox says, you can.
“They managed to combine large-scale analytics with the ability to really drill down into the data, and they’ve done it in a way that I wouldn’t have thought was possible,” he says. “The size of the data and the speed with which you can comfortably explore the data is really impressive. People have done Big Data systems before, but before Dremel, no one had really done a system that was that big and that fast.
“Usually, you have to do one or the other. The more you do one, the more you have to give up on the other. But with Dremel, they did both.”
‘Before Dremel, no one had really done a system that was that big and that fast. Usually, you have to do one or the other. The more you do one, the more you have to give up on the other. But with Dremel, they did both.’
— Armando Fox
According to Google’s paper, the platform has been used inside Google since 2006, with “thousands” of Googlers using it to analyze everything from the software crash reports for various Google services to the behavior of disks inside the company’s data centers. Sometimes, the tool is used with tens of servers, sometime with thousands.
Despite Hadoop’s undoubted success, Cloudera’s Mike Olson says that the companies and developers who built the platform were rather slow off the blocks. And we’re seeing the same thing with Dremel. Google published the Dremel paper in 2010, but we’re still a long way from seeing the platform mimicked by developers outside the company. A team of Israeli engineers is building a clone they called OpenDremel, though one of these developers, David Gruzman, tells us that coding is only just beginning again after a long hiatus.
Mike Miller — an affiliate professor of particle physics at the University of Washington and the chief scientist of Cloudant, a company that’s tackling many of the same data problems Google has faced over the years — is amazed we haven’t seen some big-name venture capitalist fund a startup dedicated to reverse-engineering Dremel.
That said, you can use Dremel today — even if you’re not a Google engineer. Google now offers a Dremel web service it calls BigQuery. You can use the platform via an online API, or application programming interface. Basically, you upload your data to Google, and it lets you run queries on its internal infrastructure.
This is part of a growing number of cloud services offered by the company. First, it let you run build, run, and host entire applications atop its infrastructure using a service called Google App Engine, and now it offers various other utilities that run atop this same infrastructure, including BigQuery and the Google Compute Engine, which serves up instant access to virtual servers.
The rest of the world may lag behind Google. But Google is bringing itself to the rest of the world.

Saturday, June 7, 2014

知识改变命运?"90后"凤凰男:寒门难出贵子

http://www.wenxuecity.com/news/2014/06/07/3336884.html

从西南地区的农村考上上海的名校,一个原本年少轻狂的少年发现在知识改变命运的表象背后,还隐藏着许多并不由他掌控的现实和未来。和那些出身中上阶层的同学相比,他缺少对人生的规划意识和执行力。他们之间的命运,也正在不断分野。

曾经的年少轻狂

我今年24岁。在我已有的人生版图中,发生过两次大的地理迁徙:一次是到上海读大学,一次是到北京工作。再之前的时光,就局限于并不算贫穷的西南乡村:农村的家、镇上的初中、区里的高中。

12岁那年,上初中,父母送我去学校。估计是我父亲乡村医生的身份,以及母亲为人细致豁达,镇上的学生家长都客气地夸赞我。母亲客套地说: “农村长大的小孩,和镇上的孩子还是有差距的”。那时的我,对这番话完全不以为然。

随后的三年初中生涯中,不论是学习成绩、文体活动、礼仪素质,我都用自己的实际行动“粉碎”了母亲这番“谬论”。即便是进入区里的高中,我也以后来居上的姿态证明了自己。反而是那些小镇上的同学,给我印象并不佳。

但即便是在这样一个中下甚至底层的乡村中,每个成绩好的小孩都会被告知“学校和社会不一样”,成绩不好的小孩很多家长也不再强求,而是积极利用亲戚、朋友、熟人等资源为孩子谋划其他的出路——俗话说的“拼爹”。

比如高考。我高考那年,北大在我们那边的录取分数高到什么地步?仅比当年状元的裸分分数低一分!你要上北大,除非能加分。加分有多种类型,有鼓励性的特长加分、有照顾性的少数民族加分等,这本是一项在丰富选拔标准、促进公平录取等方面都很好的政策,但在实践中,由于考核标准、监管制度等不到位,多个加分项成为谁“活动到位”谁拥有的战利品。即便是纯粹看特长,我是农村家庭的孩子,如何去培养类似于小提琴、古筝之类的特长?

结果就是,我们彻底被特长加分政策抛弃在外。

他们的先见之明

凭借努力和运气,我当年高考的成绩还算差强人意,考上了复旦,大家都很高兴。庆功酒那天,每个亲戚都恭喜我父母,大家都认为我们这个家庭的命运已经被知识改变。

但进入大学,特别是步入社会的现实后,母亲当年的那些话,却一遍又一遍地从我的记忆中被唤起。只不过,“农村小孩”和“镇上小孩”的区别,变成了屌丝和高富帅/白富美之间的距离。而且,差距一词的内涵也由纯粹的个人努力,拓展到了能动用多少背后的社会资源的层面。

四年的大学生活,足够我熟悉和了解这里的环境,让我从太多的惊艳中将曾经的自负磨为谦卑。尽管也在很努力地去提高自己,却不知我最大的落后不在于英语口语发音、入门级计算机水平、贫乏的歌舞才艺等,真正差距在于,和那些出身中上阶层的同学相比,我缺少对人生的规划意识和执行力。

2010年世博会,学校要组织大批志愿者,我们学院大部分同学都在这个行列里。但当时的我并不愿意参加,因为正值家里农忙收割,缺失我这个主要劳动力,意味着父母需要付出成倍的劳动。和家里几经商量,最后的结果是父母强烈坚持我去世博会,他们的理由是难得有机会见识这么大的场面。而与此同时,我那些早就规划好读研和出国的同学们,正为此次难得的义工经历做着周密的日程安排,他们知道,这是通过下一道关卡的重要筹码。

对他们来说,清晰的人生规划是全方面的。我的手机里至今仍保存着一条短信,是大学期间唯一一次对女孩表白收到的回复:很难得,我们这么有默契,你也很优秀;但我是本地人,家中独女,我不可能去你家乡发展,更不愿意折腾自己去磨合我们两个家庭间的差距,我清楚我想要的生活,祝福你!

当时的我,难以理解这般的不近人情。直到工作之后,才慢慢懂得了其中的理性和得体,不得不感叹:有些人的人生迈出的每一步,都在为下一步的攻城掠地积攒力量;而有些人,真的是车到山前再找路,简单走走,随便看看。对于20出头的年轻人来说,这种规划意识的启蒙和支撑,都离不开家庭的熏陶和远见。知识改变命运的逻辑,在这里变得芜杂。

最终的分野

大学毕业后,每个人的选择迅速地分野。

我所在的寝室比较典型,当时宿舍一共四人,两个上海本地同学,家庭背景一官一商,毕业后分别到美国和英国深造;另外一名同学来自普通工薪家庭,毕业后父母托关系在准一线城市的老家为他谋了一份大型国企员工的职位。而我因为在老家找工作几经挫折,毕业前夕不得不选择北漂。

现在工作两年,因为在事业单位上班,领导们的背景大家明里暗里都知道一些,对于阶层、圈子、关系、资源的代际传承已经司空见惯。最近,办公室一位领导正张罗着将他在北京四中读高三的小孩送到美国念大学,这不禁让我想到我的小学/初中、高中、大学三个不同时期同学的现状:

很明显,知识确实能改变一部分人的命运。像我这样来自农村上大学的孩子绝大部分不会再务农,同样远远优于打工的同龄人。但对我们这部分人来说,在城市生活也绝对不会容易,特别是在房子、职业发展等大的人生机会方面需要自力更生,在大城市成家立业、为后代积累优渥社会资源的过程,尤其漫长而艰难。

今年春节回家,大年三十上午还在大巴上往家赶的我,在车上偶遇小学同学。拖家带口在温州打工的他告诉我,火车票太难买了,他中转了好几个站想尽办法才赶回来。尽管疲惫,但他的脸上还是洋溢着即将团聚的喜庆。

我们已无什么共同话题,他恭维我读书才有出路,我则夸奖他小孩很可爱。终于我先下车,思来想去,我给了他孩子一百元作为压岁钱,他反复推脱,最后还是收下,然后翻箱开包要给我他从温州带回来的特产。我看着他手脚并用的忙碌样,兜里的手机开始不停地振动,那是我的大学同学微信群,此时,他们正在发微信红包、晒马尔代夫的度假照、热烈讨论年终奖……

耳里传来连绵的爆竹声,午饭吃得早一些的人家已经开始上坟。手里拎着一袋温州特产的我,快步向家里走去。尽管那里有满怀期待的家人,有热腾腾的丰盛饭菜,可我知道,这已是我回不去的故乡;而千里之外的北京,尽管我常年生活在那里,但在可预见的未来,我都将在心灵上是那座城市的过客。