This post does not cover all the theory behind kafka but it focuses on the important aspects that you need to know to start working and be productive with kafka.
Now I will summarise below the minimal concepts required to start.
- A kafka cluster is made of brokers (nodes). Kafka uses Zookeeper to coordinate brokers.
- Zookeeper is a key-value store, used to coordinate distributed workloads, to configure services, etc.You interact with kafka mainly by consuming or publishing data to topics
- Each broker stores / replicates the state (i.e.: topics).
- To scale a topic you create partitions.
- kafka is known to scale up to 200k partitions per cluster. When creating a topic you need to provide the number of partitions and the replication factor. The replication factor means how many times the topic’s state is duplicated.
- The default number of partitions is 6 and the default replication number is 3.
- An offset is an unique identifier of a message within a partition.
- Consumers groups is a Kafka abstraction that enables supporting both point-to-point and publish/subscribe messaging.
- Consumer groups is also a way of supporting parallel consumption of the data i.e. different consumers of the same consumer group consume data in parallel from different partitions.
- The consumers in a group then divides the topic partitions as fairly amongst themselves as possible by establishing that each partition is only consumed by a single consumer from the group. i.e.: Ideally, the number of partitions is equal to the number of consumers.
- If the number of consumers is greater, the excess consumers are idle, wasting client resources. If the number of partitions is greater, some consumers will read from multiple partitions which should not be an issue unless the ordering of messages is important to the use case.
- You can use the CLI tools to create API keys, topics, create ACLs (incl. consumer groups), create schemas (incl. schema registry) and service accounts, creation of connectors, ksql
You need clients (consumers/producers) and a kafka cluster.
Setting up clients
To build clients, you will need the kafka client libraries and/or kafka CLI tools (link to download here)
Setting up clusters
There are endless ways to setup a kafka cluster, incl. running the binaries yourself, running kafka in kubernetes, consuming kafka as a service, etc. In this post I will use kafka as a service from confluent. Obviously confluent is not the only kafka as a service provider, you can also get it from AWS, Google, etc.
Browse to https://confluent.cloud, create an account and create a cluster and then create an API key which we will use to connect to consume and produce topics.
The next step is to create a topic, you can do that with the UI or with the CLI. We will focus on the CLI.
To create a topic using the CLI, you need to create a configuration file (jaas.conf). You can just copy and paste the required values from confluent cloud’s CLI Tool Configuration page. I also have an gist example below:
Then you use the CLI as per below:
Now you can start publishing messages into those topics as follows:
To list the existing kafka topics you use the command below:
That’s all for prerequisites 🙂
Using kafka topics
You can use kafka in two modes: as a traditional queue or as a streaming system. Both options are explained below.
Kafka as a traditional queue
The code below better illustrates this 🙂
kafka as a streaming platform
When using kafka as a stream platform you can (extract from here):
- To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
- To store streams of events durably and reliably for as long as you want.
- To process streams of events as they occur or retrospectively.
An example of a kafka Streaming sample app in Java is available in my gitrepo here: https://github.com/jacace/sample_tech_kafka_java_streaming
All consumers work in a load-balanced mode; in other words, each message will be seen by one consumer in the group. If a consumer goes away, the partition is assigned to another consumer in the group — that’s rebalance.
“On start up, a broker is marked as the coordinator for the subset of consumer groups which receive the RegisterConsumer Request from consumers and returns the RegisterConsumer Response containing the list of partitions they should own. The coordinator also starts failure detection to check if the consumers are alive or dead. When the consumer fails to send a heartbeat to the coordinator broker before the session timeout, the coordinator marks the consumer as dead and a rebalance is set in place to occur” (extracted from here)
Adding partitions doesn’t change the partitioning of existing data so this may disturb consumers if they rely on that partition.
Other topics for later
- Kafka connect
- Schema Registry
- KStreams (inserts9 and KTable (upserts)
- KStreams and KTable (advanced operations, stateful -count VS multiply-) #6
- Jjoins KStreams to GlobalKTable #9
- Testing kafkastreams & kafka applications #10
Originally published at http://jacace.wordpress.com on February 1, 2021.