Kafka

Kafka Streams Introduction

So this post will be an introductory one on Kafka Streams. It is not intended to be one on Apache Kafka itself. For that there are many interesting books/posts/documents available which cover this topic.

 

That said there are still a few key concepts that we will need to introduce before we can start talking about Kafka Streams, such as

 

  • Kafka overall architecture
  • Topics
  • Commit Log
  • Consumer group

 

After we have talked about these (and I will only be skimming the surface of these, you really would be better off reading about Kafka if this is brand new territory for you), I will then move on to talk about some of the basic building blocks of Kafka Streams

 

Kafka overall architecture

Image result for kafka partitions

 

As shown in the diagram above, Kafka is a broker and is intended to run in a cluster (which might be a single broker, but that is obviously not what’s recommended). It also makes use of Zookeeper for various state storage/metadata.

 

Topics

Kafka uses topics as a way of organizing what data is produced/read. Producers push data to topics, whilst consumers read data from topics.

 

Topics are divided into a number of partitions (there is some default partitioning strategy at play, but you can write you own). Partitions allow you to parallelize the data by potentially storing it across multiple machines. It may also be read in parallel

 

Another key part of a topic is the offset, which is maintained by the consumer, and it’s a indicator at to where the consumer should start reading data from within the topic. As one could imagine for this to be possible the messages must be ordered, this is something that Kafka guarantees for you.

 

Image result for kafka topic

Commit Log

Kafka maintains message for a configurable amount of time. Meaning that a consumer may go down, and restart again, and providing its within the period of log retention, it will just start reading messages from where it left of.

 

Consumer Groups

Kafka has Consumers, which read from a single partition. However Consumers can also be organized into Consumer groups for a Topic. Each Consumer will read from a partition, and the Consumer group as a whole will read the entire Topic. If you organize your consumers in such as way that you have more consumers than partitions, some will go idle. However if you have more partitions than consumers, some consumers will get messages from multiple partitions

 

Image result for kafka consumer group

 

Further Reading

 

As I say I did not want to get too bogged down with the Kafka fundamentals, as this series is really about Kafka Streams, but the few points above as well as the “further reading” section should get you to a pretty decent position to understand the rest of this post.

 

Kafka Streams Introduction

Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.

Kafka Streams has a low barrier to entry: You can quickly write and run a small-scale proof-of-concept on a single machine; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka’s parallelism model.

 

Stream Processing Topology

  • A stream is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair.
  • A stream processing application is any program that makes use of the Kafka Streams library. It defines its computational logic through one or more processor topologies, where a processor topology is a graph of stream processors (nodes) that are connected by streams (edges).
  • A stream processor is a node in the processor topology; it represents a processing step to transform data in streams by receiving one input record at a time from its upstream processors in the topology, applying its operation to it, and may subsequently produce one or more output records to its downstream processors.
    There are two special processors in the topology:
    Source Processor: A source processor is a special type of stream processor that does not have any upstream processors. It produces an input stream to its topology from one or multiple Kafka topics by consuming records from these topics and forwarding them to its down-stream processors.
    Sink Processor: A sink processor is a special type of stream processor that does not have down-stream processors. It sends any received records from its up-stream processors to a specified Kafka topic.
    Note that in normal processor nodes other remote systems can also be accessed while processing the current record. Therefore the processed results can either be streamed back into Kafka or written to an external system.

Although this post will not show any code at all, I want to introduce a couple of things that will be coming up in the next posts, namely

  • KStream
  • KTable
  • Global KTable
  • State Stores

 

KStream

A KStream is an abstraction of a record stream, where each data record represents a self-contained datum in the unbounded data set. Using the table analogy, data records in a record stream are always interpreted as an “INSERT” — think: adding more entries to an append-only ledger — because no record replaces an existing row with the same key. Examples are a credit card transaction, a page view event, or a server log entry.

To illustrate, let’s imagine the following two data records are being sent to the stream:

(“alice”, 1) –> (“alice”, 3)

If your stream processing application were to sum the values per user, it would return 4 for alice. Why? Because the second data record would not be considered an update of the previous record.

https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#streams_concepts_kstream

 

KTable

A KTable is an abstraction of a changelog stream, where each data record represents an update. More precisely, the value in a data record is interpreted as an “UPDATE” of the last value for the same record key, if any (if a corresponding key doesn’t exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT aka INSERT/UPDATE because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE” or tombstone for the record’s key.

To illustrate, let’s imagine the following two data records are being sent to the stream:

(“alice”, 1) –> (“alice”, 3)

If your stream processing application were to sum the values per user, it would return 3 for alice. Why? Because the second data record would be considered an update of the previous record.

https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#streams_concepts_ktable

 

GlobalKTable

Like a KTable, a GlobalKTable is an abstraction of a changelog stream, where each data record represents an update.

A GlobalKTable differs from a KTable in the data that they are being populated with, i.e. which data from the underlying Kafka topic is being read into the respective table. Slightly simplified, imagine you have an input topic with 5 partitions. In your application, you want to read this topic into a table. Also, you want to run your application across 5 application instances for maximum parallelism.

  • If you read the input topic into a KTable, then the “local” KTable instance of each application instance will be populated with data from only 1 partition of the topic’s 5 partitions.
  • If you read the input topic into a GlobalKTable, then the local GlobalKTable instance of each application instance will be populated with data from all partitions of the topic.

GlobalKTable provides the ability to look up current values of data records by keys.

https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#streams_concepts_globalktable

 

State Stores

Kafka Streams provides so-called state stores, which can be used by stream processing applications to store and query data. This is an important capability when implementing stateful operations. Every task in Kafka Streams embeds one or more state stores that can be accessed via APIs to store and query data required for processing. These state stores can either be a persistent key-value store, an in-memory hashmap, or another convenient data structure. Kafka Streams offers fault-tolerance and automatic recovery for local state stores.

Kafka Streams allows direct read-only queries of the state stores by methods, threads, processes or applications external to the stream processing application that created the state stores. This is provided through a feature called Interactive Queries. All stores are named and Interactive Queries exposes only the read operations of the underlying implementation.

https://kafka.apache.org/21/documentation/streams/core-concepts#streams_state

 

That’s It

So that’s it for now. I have borrowed a lot of material from the official docs for this post, but as we move through the series we will start to form our own code/topologies, and as such our own tests/descriptions. So please forgive me this one.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s