New Blog Series: Spark - The Pragmatic Bits

Nicki Watt

April 25, 2017

•

Share this post

Copied!

Apache Spark is a powerful open source processing engine which is fast becoming our technology of choice for data analytic projects here at OpenCredo. For many years now we have been helping our clients to practically implement and take advantage of various big data technologies including the like of Apache Cassandra amongst others.

We very much see Apache Spark as a key complementary technology which, with its focus on speed and ease of use, is opening up even greater opportunities for gaining speedy insight and analysis into business critical data.In order to share our experience in dealing with Spark, we have decided to put together a series of articles and a webinar.

These will explore how Spark can be used to address some of the common data processing challenges and pain points in a pragmatic and practical way.The series starts off with Dávid Borsós blog post, "Data Analytics using Cassandra and Spark". Cassandra is a highly performant database when used to store large amounts of data, and performing queries for which it has been optimized.

However, when it comes to trying to analyze and gain broader insight from the data captured, Cassandra can be cumbersome to work with, and may not be as performant and scalable as needed. This article demonstrates how you can practically combine Apache Spark with Apache Cassandra in order to better deal with such scenarios.

David will then provide practical instructions in the second blog of the series, "Deploy Spark with an Apache Cassandra cluster". This post will show how you can deploy the open source version of Apache Spark alongside an Apache Cassandra cluster. This also includes a programmable infrastructure code example.The third part of this series is Matt Long's blog "Testing a Spark Application".

The ability to write and run adhoc Spark queries is helpful for getting immediate insight into certain data problems, but what happens when these queries needs to form part of a bigger software system? Matt takes you on a journey looking at how you may need to take existing Spark code (in fact the same demo code used in David’s first article), and refactor it in order to make it more testable.The conclusion of the series is a webinar which explores the use case of “Detecting stolen AWS credential usage with Spark”. This will focus on how Spark’s relatively new high-level structured streaming API (still only Alpha) can be used to detect this scenario.