All the investment and innovation that’s occurred in the Big Data infrastructure space over the last decade will have gone for naught if data scientists and application developers can’t production-ize analytic insights.
That’s why YARN, a sub-project of Apache Hadoop released last fall, is such a big deal. YARN enables developers to build applications on Hadoop that process data in multiple new ways beyond just batch processing.
Still, YARN, like HDFS and MapReduce before it, is simply an enabler. Developers still need to actually build Big Data applications. This requires toolsets that allow developers to integrate multiple data streams, to apply predictive models at scale, to create intuitive user interfaces and more. And even then, significant training is needed, especially for data scientists whose expertise is in analyzing data, not building applications for end-users.
So I’ve been encouraged by a handful of recent announcements from Hadoop ecosystem vendors aimed at lowering the barriers to successful Big Data application development.
Hadoop meets application development tooling
On the tools set side of the equation, Hortonworks recently expanded its partnership with Concurrent, which sells support services for the open source Cascading application development framework. When I spoke with the company last fall, Concurrent Founder and CTO Chris Wensel described Cascading as a Java library used by application developers to quickly create complex, data oriented applications. Concurrent’s Cascading SDK abstract’s away the complexity of dealing with things like MapReduce and Pig, allowing developers to integrate data sources via APIs and easily migrate predictive models into Hadoop. (You can explore sample Cascading-based apps on GitHub here.)
As part of the expanded partnership, Hortonworks said it will ensure ongoing compatibility of Cascading-based apps with the Hortonworks Data Platform and will provide level 1 and level 2 Cascading support for customers (Concurrent will still handle level 3 support.) This compatibility includes the ability to execute Cascading-based apps on Apache Tez, a recently developed Hadoop-based execution engine for real-time Big Data workloads. While Concurrent itself is still in its early days, open source Cascading is quite popular with application developers, garnering over 90,000 downloads per month.
Training needed to make the most of Hadoop
Even with better tooling, application developers need to learn new skills in order to build enterprise-grade Big Data apps. This requires training, a cause Cloudera has taken on as its own. At its analyst day event in March, I learned from CEO Tom Reilly that Cloudera has trained over 50,000 practitioners on Hadoop since the company’s founding in 2009. But its training efforts really took off in December when it formed a partnership with Udacity, provider of MOOCs focused on computer and data science. Since then, Cloudera has trained over 30,000 practitioners. Cloudera estimates it’s educational services, overseen by Sarah Sproehnle, has trained over 80% of all practitioners that have taken some form of Hadoop training.
Earlier this month Cloudera announced a new training service to provide developers hands-on training in building Big Data applications on Hadoop. Cloudera says the purpose of the four day course is to prepare “data professionals to use an EDH’s full capabilities to build custom, converged applications that enable their organizations to achieve greater value from data and solve real-world problems.”
EDH refers to Cloudera’s Enterprise Data Hub, which layers multiple data processing engines, including Cloudera Impala and Cloudera Search, on top of its core Hadoop distribution. While the ability to process data in multiple ways on a single platform is a positive in the abstract, it means Hadoop application developers must be fluent in a number of data processing approaches. Cloudera’s new training course is designed to help developers learn these new skills so they can take advantage of Hadoop’s new multi-data processing capabilities.
A new, hybrid role emerging
Both announcements are good signs. It signals to me that Hadoop is starting to move beyond the “feeds-and-speeds” stage of its life-cycle and into a new stage focused on business value. What good is all that data you can now store, process and analyze relatively inexpensively in Hadoop if you can’t surface actionable insights to end users that are responsible for moving the business forward? Not much. That’s why the development of Big Data applications is so critical.
But tools and tools-related training are just two legs of the stool. There are softer skills data scientists and applications developers need to learn, and a not insignificant change-management challenge lurking in the background.
Namely, as enterprise applications become more data-centric, the roles of data scientist and application developer are merging. In the short-term, this means the two roles must learn collaborate more effectively and both must assume new ways of thinking. For data scientists, this means starting to think more about how the insights they uncover can be translated into repeatable form factors consumable by end-users. And application developers need to gain a better understanding of data flows and how analytic requirements impact application performance.
CIOs, too, have a job to do facilitating this transition. They should take steps now to enable and encourage collaboration between data scientists and application developers, and help both roles understand the challenges of the other. This may require developing new incentives and ways of measuring (and rewarding) outcomes that focus on mutual successes of these two previously silo’ed roles.
In the long-term, this gradual merging of roles may result in a new role entirely, a hybrid data scientist-application developer. There are a few already out there (though they may not think of themselves in these terms), but they are rare. I don’t know exactly what we might call this role, but if you think data scientists are valuable and hard to find today, just think what the demand for this hybrid data scientist-application developer will be in the years to come.