Open Credo

June 30, 2013 | Software Consultancy

Spring Data Hadoop – Objective Overview

Spring Data Hadoop (SDH) is a Spring offshoot project that allows the invocation and configuration of Hadoop tasks within a Spring application context. It offers support for Hadoop jobs, HBase, Pig, Hive, Cascading and additionally JSR-223 scripting for job preparation and tidy-up.

It is most suited for use in organisations with existing Spring applications or investment in Spring expertise. Some SDH features replicate functionality of tools in the Hadoop ecosystem that DevOps engineers who maintain a Hadoop cluster will be more familiar with.

 

WRITTEN BY

Daniel Jones

Spring Data Hadoop – Objective Overview

Those looking for dependency injection and inversion of control within MapReduce jobs will be disappointed to find that SDH offers no job-level API.

What Spring Data offers

SDH offers benefits in four main areas:

  • Hadoop jobs as beans
  • Hadoop configuration via Spring
  • JSR-223 scripting to interact with HDFS
  • Integration with Hadoop tools (HBase, Pig, Hive and Cascading)

What Spring Data Hadoop is not

Spring Data Hadoop is an anomalous child of its umbrella project parent: SDH is not primarily a technology for the access of data. It does not offer a repository view of HDFS and only one small component of SDH (HBaseTemplate, an analogue to JdbcTemplate, MongoTemplate et al) bears any apparent resemblance to other Spring Data projects.

SDH does not attempt to bring the values of Spring and the benefits of dependency injection and inversion of control into map reduce jobs themselves. There is no API to be tapped into when coding a job, meaning working with MapReduce will feel as cumbersome as it always has.

Hadoop jobs as beans

SDH offers a means to represent Hadoop jobs as Spring beans within an application context. These can then be invoked at context start-up, triggered using standard Spring scheduling techniques, or called programmatically by other beans. Namespace configuration is also available for performing Hadoop jobs as part of a Spring Batch workflow.

What constitutes a Hadoop job as far as SDH is concerned? In line with its philosophy of integration SDH can invoke jobs manifested in a variety of fashions:

  • job-runner specifies mapper and reducer classes, input and output paths
  • tool-runner runs existing jobs written to the Tool API using the current classpath
  • jar-runner will execute a job packaged in an external standalone jar

There is enough flexibility in the variety of ways that SDH can invoke jobs that it should be easy to work with any legacy code one may need to integrate with.



	
	
	

Hadoop configuration via Spring

SDH will load existing Hadoop configuration from the classpath, as is standard (albeit sometimes problematic) behaviour in Hadoop applications.

Spring developers will be pleased to see that SDH will also allow Hadoop configuration to be specified as part of a Spring application’s configuration, meaning that invaluable tools like property placeholders and SpEL are available to exploit. This isn’t limited to XML configuration: Java-based configuration is available too. It’s worthwhile to note that configuration resident on Hadoop nodes in the cluster may conflict with the values specified in the Spring application.

mapreduce.job.reduces=10 
mapreduce.task.io.sort.factor=${props.hadoop.sortFactor}

SDH provides namespace configuration for Hadoop’s distributed cache, allowing files on HDFS to be replicated across nodes in the cluster.

	
	

JSR-223 scripting to interact with HDFS

JSR-223 scripts allow set-up and tear-down tasks to be performed around jobs, defined in a manner more fluent than Java code. SDH populates the scripting context with implicit variables for Hadoop configuration, the HDFS shell, resource loaders and more.

Much of interacting with Hadoop is not suited to the verbosity of a language like Java, and in the wild one will often see DevOps engineers prop up MapReduce jobs with shell scripts. The inclusion of JSR-223 scripting increases the ease with which book-end tasks can be performed whilst still allowing the code to reside in one artifact and in a language in which development teams are proficient.


Integration with Hadoop tools (HBase, Hive, Pig, Cascading)

The component of SDH most in-line with expectations set by the rest of Spring Data is its support for HBase via HBaseTemplate. This offers a pattern of interaction familiar to Spring developers who have used other Spring Data projects, or even the JdbcTemplate from the core framework. This is a more direct form of access than HBase-JPA mapping technologies like DataNucleus and Kundera.

Hive allows the mapping of data in HDFS to the relational model and so it is most welcome that via core Spring functionality the ability to map Hive to a JdbcTemplate is offered. SDH expands on this existing opportunity by allowing a Hive server to be started, and Hive Thrift clients to be created. Hive scripts can be created as Spring beans and accessed accordingly.

SDH allows the invocation of Pig scripts and Cascading workflows, and as per regular Hadoop jobs offers them as regular Spring beans to the rest of the context. Often developers produce Hadoop jobs on a scale of agility, from ad-hoc Pig scripts to modularised re-usable Cascading workflows. Support of these forms of MapReduce job enable the full lifecycle of Hadoop development to be managed through SDH.

 

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE

Twitter LinkedIn Facebook Email

SIMILAR POSTS

Blog