Big Data Integration

Big Data Integration

3,990 members
  • Join

    When you join a group, other members will be able to see your profile and message you. The group logo will be visible on your profile unless you change that setting.

  • Information and settings

Have something to say? Join LinkedIn for free to participate in the conversation. When you join, you can comment and post your own discussions.

James

Hadoop uber-alles? YARN unwinds the grip of MapReduce uber-alles

IBM Big Data Evangelist. Senior Program Director, Product Mktg, Big Data Analytics. Editor-in-Chief, IBM Data Magazine

Hadoop has always been a catch-all for disparate open-source initiatives that compose together into a more or less unified big-data architecture. Some would claim that Hadoop has always been, at its very heart, simply a distributed file system (HDFS), but the range of HDFS-alternative databases, including Hbase and Cassandra, undermines that assertion.

Until recently, Hadoop has been, down deep, a specific job-execution layer--MapReduce--that executes on one or more alternative, massively parallel data-persistence layers, one of which happens to be HDFS. But the recent introduction of the next-generation execution layer for Hadoop--known as YARN (Yet Another Resource Negotiator)--eliminates the strict dependency of Hadoop environments on MapReduce. Just as critical, YARN eliminates a job-execution bottleneck that has bedeviled MapReduce from the start: the fact that all MapReduce jobs (pre-YARN) have had to run as batch processes through a single daemon (JobTracker), a constraint that limits scalability and dampens processing speed. These MapReduce constraints have spurred many vendors to implement their own speedups, such as IBM's Adaptive MapReduce, to get around the bottleneck of native MapReduce.

All of which makes one wonder what, specifically, "Hadoop" means anymore, in terms of an identifiable "stack" that is distinct from other big data and analytics platforms and tools. But that's just a definitional quibble, because YARN is a foundational component of the evolving big-data mosaic. YARN puts traditional Hadoop into a larger context of composable, fit-to-purpose platforms for processing the full gamut of data management, analytics, and transactional computing jobs.

YARN transforms Hadoop (however defined) into a general-purpose distributed job-execution layer of the sort that the open-source initiative's original definition (still on the Apache website) alludes to. Though it retains backward compatibility with the MapReduce API and continues to execute MapReduce jobs, a YARN engine is capable of executing a wide range of jobs that were developed in other languages.

Just as important, YARN can become a unifying thread for diverse Apache open-source initiatives around big data. As this recent InfoWorld article noted (http://ow.ly/pTpjU ): "The biggest win of all here is how MapReduce itself becomes just one possible way of many to mine data through Hadoop."

That's the YARN promise, but seeing it realized requires that the industry retool their Hadoop stacks and tools to work with it. Per the article, "Apache claims that any distributed application can run on YARN, albeit with some porting. To that end, Apache's maintained a list of YARN-compatible applications, such as the social-graph analysis system Apache Giraph (which Facebook uses). More are on the way from other parties, too."

This is good, but notice that disclaimer: "albeit with some porting." The article notes that YARN's true test will be in the extent to which vendors port their analytic development tools to output jobs that are conformant with YARN. As the author states, porting development languages to YARN "isn't a trivial effort."

Whether, and to what extent this takes place consistently throughout the industry and diverse Apache and other open-source communities, will determine the extent to which YARN, the defining feature of what some call "Hadoop 2.0," truly takes hold.

  • Comment
  • 10 months ago
  • Close viewer

Comments

Your group posting status

Your posts across groups are being moderated temporarily because one of your recent contributions was marked as spam or flagged for not being relevant. Learn more.

Feedback