Tuesday, September 3, 2013

Implementing Cassandra at Kenshoo - Lessons Learned

We had a problem...

Our tracking system at Kenshoo got too big.
Too much data.
Spread over many isolated MySQL servers.
So around two years ago we turned to NoSQL.

How to choose a NoSQL solution?

Out of the many options that exist out there, we considered at first three: MongoDB, Hadoop (HBase) and Cassandra.

We finally picked Cassandra, because:
  1. we liked its clean, symmetric design.
  2. our benchmarks showed it seemed to be good at writes (tracking systems are generally very write-intensive).
  3. was easy to setup (contra Hadoop).
  4. had a more appropriate data-model (contra MongoDB).
  5. had a growing, active community (they all do).


Lesson #1: Cassandra is very good at writes, but the benchmarks we did were not helpful. Even if you get to run your tests on production-like scenarios on production-like machines and network topology (we didn’t), the actual performance you eventually see in Cassandra is extremely sensitive to server configuration, real-life scenarios, and internal operations (compaction, read-repair etc.) done by Cassandra. There are benchmarks out there that show Cassandra performs very well in comparison with other leading NoSQL solutions (outperforms with writes, comparable with reads, improves linearly with number of nodes). But we suggest to take even these (presumably) more carefully crafted benchmarks with more than a grain of salt.

Lesson #2: NoSQL frameworks are “fast moving targets” and are currently evolving very rapidly, copying off features from one another, closing some gaps from traditional SQL etc. Since implementing this over existing projects usually takes quite some time overall (over a year in our case) by the time you finish - some of the original considerations are no longer valid. Not much one can do here - aside from maybe the obvious advice to take a look at these projects’ road-maps and hope they do reflect their future trajectory.

To rewrite on not to rewrite?

The next issue we faced was just how much rewriting of the existing tracking system we should do. Our existing code was complicated , tightly coupling business logic, database access, configuration and customizations. The temptation to simply “start afresh” with a new system was strong, though that would mean continuous merges of any bug-fixes/new-features from the “MySQL-branch” to the “Cassandra-branch”. We eventually chose the other path - of gradual evolution from one system to the next. This meant first of all adding a lot of missing tests - unit, end-to-end etc. - to add confidence to the change, then abstracting the repository and database access layers, implementing the new Cassandra-facing repository, and finally being able to work in a “composite” mode against the two very different systems - MySQL and Cassandra.

Lesson #3: although not a universal maxim, it seems that overcoming the temptations of rewrites usually pays off.

Data model

We store most of our tracking data in a single “events” column-family:

Events column-family
Rows are indexed by user-id
Column names represent single events
Column values are JSON’s like so:
{"event-time": 1367515779,
"type": "click",
device”: “mobile”
.
}
Since we occasionally need to look-up data that’s not user-specific, we also created an additional “index-lookup” column-family. This data model allows us to serve the most time-critical queries (vs. “get all events for given user”) in the fastest manner. Alas, we did not at first duplicate the data appropriately in the indexes, thus needing for each “index-lookup” to issue two round-trips to the cluster: 1) find the index-entry, then 2) get the additional data needed for the event.

Lesson #4: model your data to best serve your critical lookups, don’t be shy in duplicating data - round-trips to the cluster are costly.

Lesson #5: JSON is great in terms of flexibility and readability, but there are faster binary serializations out there, and after trimming all other factors, this can in turn become an issue.

Consistency

Cassandra is an eventual consistency system. We ran into consistency issues in two cases: 1) nodes down; and 2) read-before-write race conditions (which is actually an “anti-pattern”). The first issue was solved by scheduling read-repairs at reasonable intervals - so Cassandra could overcome the inconsistencies itself. The second needed changes in the application workflow - locally caching results and making sure no read-before-write race conditions occur.

Lesson #6: consistency is obviously something you need to think about in these types of systems. Anti-patterns are generally best avoided.


Erasing data

We have two basic scenarios where we need to erase data: cleaning-up data that’s too old and correcting erroneous data. We started out by implementing an erasing process over Cassandra - as we did with MySQL. This turned out to be not only ineffective but actually completely unneeded: old-data can be cleared very easily using Cassandra’s built-in time-to-live feature; and instead of deleting erroneous data we now keep all events with an “event version” allowing us to use only the latest.

Lesson #7: Since tracking data is a type of CRAP, we gave up on the update and delete parts of CRUD, and stuck with creating and reading only.

Current installation specs and numbers

  • 16 node-cluster
  • Dell R720 servers
  • 2 x CPU sockets with total 24 cores (12 physical cores and hyperthreading).
  • 32GB of RAM
  • 6 x 600GB SAS 10K Drives
  • 2 x 2TB SATA (for backup)
  • 2 x 1GB/sec NICs (for redundancy)
  • Currently about 10TB of data in the cluster
  • Average write latency: 0.8 ms/op (min 0.5 ms/op, max: 2.5 ms/op)
  • Average read latency: 23 ms/op (min: 11.7 ms/op, max: 40 ms/op)

Development and testing environment

Although setting up a local cluster for development and testing is relatively easy, it nevertheless has some pitfalls: how do you clear up data between tests? how do you allow developers to work without Cassandra at all? what about QA and staging? Currently, our tests run on both an embedded server and a dev cluster; there is a separate cluster for staging; and the application can be configured to be loaded w/o Cassandra at all.

Lesson #8: as always, dev-ops are tedious, time consuming tasks that must be accounted for in any effort-estimation


Transitioning to Cassandra

Due to risk management, we didn’t want to transition our tracking application all at once from MySQL to Cassandra. The path chosen was to:
  1. have the data written both to Cassandra and to MySQL but read-off only from MySQL, just to see how the cluster performs. Then,
  2. gradually move application servers to read from Cassandra, all the time continuing to write to MySQL in case we needed to backtrack (which indeed we needed, more than once). And finally,
  3. Stop using MySQL for tracking.
We obviously also had to migrate old data from MySQL to Cassandra - we wrote a dedicated process for this. The only down-side of this play-it-safe approach is that being able to easily “switch back to MySQL” can create a negative incentive to move forward in the face of obstacles.

Lesson #9: keep the old data around - you’ll probably need it.