The Problem with Spring Data Hadoop

SDH, following the mantra of all Spring projects, attempts to subsume Hadoop activities into the set of Spring conventions; users of the Hadoop ecosystem will however have distinct expectations.

As the set of Hadoop activities an organisation will engage in is a superset of those supported by SDH, a division running along lines of team responsibility will inevitably occur in all but the most trivial use cases.

Who should use Spring Data Hadoop?

Criteria that would suggest a good fit for the use of SDH include:

Spring interaction with Hadoop is either minimal or absolute
Existing Spring applications that need to invoke Hadoop-related tasks
Spring applications that interact with HBase
Applications that are to be maintained by staff with more Spring than Hadoop knowledge

Who should not use Spring Data Hadoop?

SDH would be a poor choice of technology if its full feature-set were to be used in an environment where there are dedicated Hadoop operations staff, or where a Hadoop distribution with the full suite of ecosystem tools is deployed.

SDH addresses several challenges for which solutions already exist in the Hadoop ecosystem. Moving workflow management out of the realm of dedicated Hadoop solutions and into bespoke Spring applications should be the preserve of those organisations with Hadoop operations small enough to have them under the remit of staff fluent in Spring/Java.

To provide a simile: using the full spectrum of SDH’s job invocation and scheduling would be a little like having a Spring application also rotate its own logs – something normally the responsibility of automation provided by system administrators.

Compromises to consider

Organisational responsibility

An organisation with its own Hadoop cluster will end up running into one or more compromises where the more flexible option is to forego SDH in favour of dedicated Hadoop solutions.

A central tenet of SDH is to take the operation of Hadoop jobs and ancillary tasks and ground them firmly in the Java/Spring realm. This will be of benefit to Spring developers looking to reduce the cost of learning Hadoop’s operational idiosyncrasies, but potentially at the cost of DevOps engineers having visibility of how jobs are scheduled and configured.

The Hadoop ecosystem changes rapidly and is used in many areas where Spring is not, so it is reasonable to assume that the myriad tools will not coalesce around Spring conventions. It is also true that self-hosting a Hadoop cluster is a non-trivial undertaking, and that the specialist staff required would likely be familiar with dedicated Hadoop solutions to the challenges SDH addresses.

An organisation acting as client to a managed Hadoop cluster would likely benefit from SDH’s scheduling and workflow features. It is not hard to imagine a Spring/Java team with little operational Hadoop experience prototyping against a naive Hadoop installation, or perhaps iterating on non-critical functionality against a cluster maintained by another department. In such an example the mission-critical jobs would be managed by a traditional Hadoop scheduler, and maintained by Hadoop operations sta

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE