Recently, the sad news has emerged that Basho, which developed the Riak distributed database, has gone into receivership. This would appear to present a problem for those who have adopted the commercial version of the Riak database (Riak KV) supported by Basho.
Data analytics isn’t a field commonly associated with testing, but there’s no reason we can’t treat it like any other application. Data analytics services are often deployed in production, and production services should be properly tested. This post covers some basic approaches for the testing of Cassandra/Spark code. There will be some code examples, but the focus is on how to structure your code to ensure it is testable!
My recent blogpost I explored a few cases where using Cassandra and Spark together can be useful. My focus was on the functional behaviour of such a stack and what you need to do as a developer to interact with it. However, it did not describe any details about the infrastructure setup that is capable of running such Spark code or any deployment considerations. In this post, I will explore this in more detail and show some practical advice in how to deploy Spark and Apache Cassandra.
Apache Spark is a powerful open source processing engine which is fast becoming our technology of choice for data analytic projects here at OpenCredo. For many years now we have been helping our clients to practically implement and take advantage of various big data technologies including the like of Apache Cassandra amongst others.
In recent years, Cassandra has become one of the most widely used NoSQL databases: many of our clients use Cassandra for a variety of different purposes. This is no accident as it is a great datastore with nice scalability and performance characteristics.
However, adopting Cassandra as a single, one size fits all database has several downsides. The partitioned/distributed data storage model makes it difficult (and often very inefficient) to do certain types of queries or data analytics that are much more straightforward in a relational database.
One of the default Cassandra strategies to deal with more sophisticated queries is to create CQL tables that contain the data in a structure that matches the query itself (denormalization). Cassandra 3.0 introduces a new CQL feature, Materialized Views which captures this concept as a first-class construct.
One of the simplest and best-understood models of computation is the Finite State Machine (FSM). An FSM has fixed range of states it can be in, and is always in one of these states. When an input arrives, this triggers a transition in the FSM from its current state to the next state. There may be several possible transitions to several different states, and which transition is chosen depends on the input.
In the culmination of our blog series on the topic, on October 6th 2016 OpenCredo Consultants Dominic Fox, Alla Babkina and Guy Richardson, and hosted by Marco Cullen, presented the common design and implementation issues that they have come across in real-world Apache Cassandra deployments.
If there is one thing to understand about Cassandra, it is the fact that it is optimised for writes. In Cassandra everything is a write including logical deletion of data which results in tombstones – special deletion records. We have noticed that lack of understanding of tombstones is often the root cause of production issues our clients experience with Cassandra. We have decided to share a compilation of the most common problems with Cassandra tombstones and some practical advice on solving them.
Cassandra isn’t a relational database management system, but it has some features that make it look a bit like one. Chief among these is CQL, a query language with an SQL-like syntax. CQL isn’t a bad thing in itself – in fact it’s very convenient – but it can be misleading since it gives developers the illusion that they are working with a familiar data model, when things are really very different under the hood.
A growing number of clients are asking OpenCredo for help with using Apache Cassandra and solving specific problems they encounter. Clients have different use cases, requirements, implementation and teams but experience similar issues. We have noticed that Cassandra data modelling problems are the most consistent cause of Cassandra failing to meet their expectations. Data modelling is one of the most complex areas of using Cassandra and has many considerations.
At OpenCredo we have been working with Cassandra since 2012 and we are big fans of both open source Apache Cassandra and the capabilities of DataStax Enterprise. Over the years we have collected a great deal of experience throughout the company on how to deliver the benefits of Cassandra in real world projects and have also seen some common pitfalls that businesses have fallen into.
At OpenCredo we are seeing an increase in adoption of Apache Cassandra as a leading NoSQL database for managing large data volumes, but we have also seen many clients experiencing difficulty converting their high expectations into operational Cassandra performance. Here we present a high-level technical overview of the major strengths and limitations of Cassandra that we have observed over the last few years while helping our clients resolve the real-world issues that they have experienced.
Perhaps the most important of Cassandra’s selling points is its completely distributed architecture and its ability to easily extend the cluster with virtually any number of nodes. Implementing a classical RDBMS-style transaction consisting of “put locks on the database, modify the data, then commit the transaction”-style operations are simply not feasible in such an architecture (i.e. that doesn’t scale well).
Cassandra 2.0 was released in early September this year and came with some interesting new features, including “lightweight transactions” and triggers.
Despite the rising interest in the various non-relational databases in recent years, there are still numerous use-cases for which a relational database system is a better choice. The latest major release of Cassandra (version 2.0) provides some interesting features that aim to close this gap, and offers its fast and distributed storage engine enhanced with new options that will make users’ lives easier.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.