Sampling from data streams

A simple sampling strategy for data streams

Sampling data from a continuous stream of data can be useful in a variety of ways. Whether you’d like to extrapolate data based on samples or get an overview of your data based on a statistically sound subset of your data, there are a few questions that you have to answer.

  1. How to traverse a stream of data ideally?
  2. How to select probes during traversal?
  3. Given a continuous, infinite stream of data, how to curtail the set of samples to have a representative set of data with a finite size?

There are a couple of sampling strategies in literature that vary in their degree of complexity. I’d like to introduce you to a rather simple sampling strategy that is easy to implement as well as easy to reason about and might take you a long way until you have to go for more advanced solutions. I’m talking about Bernoulli sampling.

This is a blog post from our Community Stream: by developers, for developers. Don’t miss to stop by our community to find similar articles or join the conversation.

Bernoulli sampling

The basic idea is that we add a record with probability p \in (0, 1) to the set of samples. As you can see, this makes it easy to reason about the expected size of the set of sample: If we processed a total of N records at a specific point in time, the set of samples would have an expected size of pN.

For each record seen we toss a coin, where the probability of showing heads is p and the probability of showing tails is 1-p. Each time we see heads, we add the record to the set of samples. Each time we see tails, we process the record normally, but don’t add it to the set of samples. In theory, we would perform this toss of coin for each and every record that we encounter. This seems somewhat impractical and inefficient though. Luckily, we’re able to calculate the number of records to skip, before the next record is added to the set of samples in such a way that the probability of inclusion p is respected. If you’d like to look it up yourself, we’re getting into the properties of the inverse binomial distribution.

Consider the following problem:

A farmer produces and packages apples for sale in packs. The chance of an apple being bruised in transit is 5%. How many apples should the farmer put in a pack if he wants to claim that there is an 80% chance that they will arrive with no bruises at all?

The mathematical discipline of stochastics teaches us that this problem can be regarded as a repeated Bernoulli experiment. The probability that an apple bruises is p = 0.05 (5%), whereas the probability that the apple remains healthy is 1-p=0.95 (95%). We’re looking for the occurence of event X=0, in which no apples have been bruised during transit: P(X=0)=0.8. This equates to (1-p)^n=0.8, where n is the number of apples that the farmer should pack (or: the number of subsequent Bernoulli experiments that we need to consider). Solving this for n yields:

(1-p)^n = 0.8

0.95^n = 0.8

\log(0.95)^n = \log(0.8)

n \log(0.95) = \log(0.8)

n = \frac{\log(0.8)}{\log(0.95)}

Apples or oranges, what has this to do with sampling records? Well, the problem stated above can actually be translated to suit our needs. Let’s see.

An external system continuously streams events. The chance that an intermediate stream processor samples such an event is 5%. How many events should the stream processor skip, before the next event should be included into the set of samples? (such that the size of the resulting sample set is a binomially distributed variable)

We’re now solving P(X=0) = U where U \in \text{Unif}(0,1). This means that U is a uniformly distributed random variable. If you want to read up on the details, be sure to look up Inverse transform sampling for starters. We’re basically turning a uniform distribution into a binomial distribution such that it respects the parameters of the given problem. We’ll have to solve (1-p)^n=U for n. Let’s have a go, shall we?

(1-p)^n = U

\log(1-p)^n = \log(U)

n \log(1-p) = \log(U)

n = \frac{\log(U)}{\log(1-p)}

We’re looking for an integer-based value, so we’ll have to floor the result. This yields n = \lfloor\frac{\log(U)}{\log(1-p)}\rfloor, where n is the number of events to skip before the next event is added to the set of samples.

This can easily be expressed in Java code, as the following listing suggests.

Random sourceOfRandomness = new Random();
double p = 0.05;
double U = sourceOfRandomness.nextDouble();
int skip = (int) Math.floor(Math.log(U) / Math.log(1-p))

Integrating the skip technique into a Bernoulli sampler saves us from costly calls to a random number generator for every record seen. We’ll only have to do this whenever a record has been added to set of samples.

Curtailing strategy

We still have to deal with an infinitely growing set of samples. To compensate this, the set of samples should be restricted to a given maximum size. As long as the size of the sample set is not exhausted, we’ll include every record that we see. As soon as its capacity is reached though, we’ll apply the semantics of the Bernoulli sampler. On the event that a record is to be included into the set of samples, we have to make the decision which one of the already existing records can be exchanged for it. I’d suggest to start off easy and choose a random offset, remove the old sample associated with the offset and add the new one. In any case, the set of samples evolves with time, incorporating new samples from the tail of the stream while also giving a glimpse into the past of the stream by retaining old samples as well. Of course, these kind of curtailing strategies are usually a bit more complex. It all depends on your use case.

Showcase

I put together a small showcase that implements a curtailed Bernoulli sampling strategy (cf. class CurtailedBernoulliSampler) that integrates with a Kafka consumer to sample domain events from a dedicated Kafka topic. In case you’re interested, you’ll find the sources over at my GitHub.

This article has been originally published at Markus’s blog.

About Markus

Markus is an independent freelance software architect (iSAQB certified), developer and trainer with main focus on Java-/Kotlin-based enterprise solutions, systems integration and modern architectural styles. He is working with Apache Kafka and its ecosystem for a couple of years now. Apart from day-to-day business, Markus regularly publishes articles on messaging solutions and streaming concepts, both in print and on his blog. He is also the maintainer of the small, but crafty testing library Kafka for JUnit. In recognition of his work for the Kafka community, he has been elected as a Confluent Community Catalyst (Class of 2020-2021). You can reach Markus via his e-mail.