Our new PoC segmentation engine is a real-time, streaming data filtering and segmentation engine.
This is one of those projects, that comes up with its own name…
From even before the first line of code, the idea (if brought to life), seemed so insanely fast and efficient, that the word Whack kept coming up.
So we kept "Whack" as the working title / code name / pet name.
It is able to chuck through so much data, at such a crazy pace, that Whack seems to correctly describe it.
It draws inspiration from a lot of high performance / big data projects created over the years.
Some of the inspiration comes from functional languages like Haskell (https://www.haskell.org).
Some of the ideas comes from ideas and implementation solutions done in
Differential Dataflow: (https://github.com/TimelyDataflow/differential-dataflow)
and the underlying
Timely Dataflow: https://github.com/timelydataflow/timely-dataflow
Both of these ideas are based on the paper:
"Naiad: a timely dataflow system"
and the paper "Differential dataflow":
Frank McSherry (https://github.com/frankmcsherry/blog) was one of the authors of both papers, while working at Microsoft Research.
Frank went on to build an SQL engine on top of it all called Materialize (https://materialize.io)
Besides outside inspiration, it is also a product of our internal shared knowledge and experience from decades of working with high performance and huge amounts of data.
We have drawn from our experience with pushing databases to the limits and building custom multi-server databases when general purpose databases wasn't enough.
There's also a lot of focus on the data itself, what it is, how it looks and how to extract distilled info from it, in the fastest way possible (that we can think of at least).
The engine is built in Go https://golang.org and utilizes Protocol Buffers (Protobuf) (https://github.com/protocolbuffers/protobuf) and gRPC (https://www.grpc.io) for streaming data in and out.
There's always room for optimizations and we have lots of ideas and know of at least a couple of areas, where we know we can push the limits even further.
The goal is to make the network the only bottleneck left, which would push the throughput to several millions on a 10 Gbit network.
When handling 4 mio profiles, we're currently able to stream 500,000 profile updates through the system. each. second...
How much is 500,000 profile updates?
A profile update is any change of data or events on a customer.
It is 500,000 pageviews, webshop checkouts, products viewed, products bought, logins, clicks in emails, newsletter subscriptions...
For comparison - it is more than 100 times the page views per second on Black Friday 2018 across the US top 100 retailers (Amazon included)!
According to: https://www.similarweb.com/blog/holiday-season-2018-thanksgiving-black-friday-numbers
That is on 1 cloud server with 8 virtual cores being pushed by 4 cloud servers with 2 virtual cores each.
Memory? Don't worry about it. Whack chucks up ~4 GB of RAM under these loads - the CPU cores are being utilized at a safe ~100%.
In our tests, we have used a real-time faker, to ensure that we have random data, that is also valid data in our live system.
Can it scale to more servers? Well, yes of course it can.
It is a truly Scandinavian simple design. We have cut the bloat, kept the necessary and functional parts and solved hard issues pragmatically.
It is designed to be used like memcached by sharding on the keys.
It is also built to be tolerant to failures. When it goes down, it just needs a stream of full profiles, that it chuck through in a matter of seconds.
memcached, redis, Apache Kafka, Apache Spark, Apache Storm, Timely Dataflow, Differential Dataflow, Haskell, gRPC
Now go make your own… Or talk to us! :)