Kafka Schema Registry

This is the 3rd post in a small mini series that I will be doing using Apache Kafka + Avro. The programming language will be Scala. As such the following prerequisites need to be obtained should you wish to run the code that goes along with each post. The other point is that I am mainly a Windows user, as such the instructions, scripts will have a Windows bias to them. So if you are not a Windows user you will need to find the instructions for your OS of choice.

 

Prerequisites

So go and grab that lot if you want to follow along.

 

Last time we talked about how to create a Kafka Producer/Consumer which uses the KafkaAvroSerializer when using Avro data of a specific type. We also used the Kafka Schema Registry and had a tiny introduction as to what the Kafka Schema Registry was all about. This time we will be looking at the Kafka Schema Registry in a lot more detail.

 

 

Where is the code for this post?

You can grab the code that goes with this post from here : https://github.com/sachabarber/KafkaAvroExamples/tree/master/KafkaSpecificAvroBrokenProducer and here too : https://github.com/sachabarber/KafkaAvroExamples/tree/master/KafkaSchemaRegistryTests

 

Kafka Schema Registry Deep Dive

The idea behind the Schema Registry is that Confluent provide it as a service that exposes a REST API that integrates closely with the rest of the Kafka stack. The Schema Registry acts as a single store for schema metadata, in a versioned historical manner. It also provides compatibility settings that allow the evolution of schemas according to the configured compatibility setting. As we have already seen in the last post there are also serializers that are provided that work with the Registry and Avro format. Thus far we have only seen a Producer/Consumer both of which use the KafkaAvroSerializer, but there is also a Serde (serializer/deserializer) available when working with Kafka Streams.Which we will talk about in the next post.

 

You may be wondering just why you may want to use some sort of schema registry. Well have you ever found yourself in a situation where you store a JSON blob in a row in a database, and then later you make changes to the format of the original object that was used to store these historic rows. Then you find yourself in the unenviable position of no longer being able to easily deserialize your old database JSON data. You need to either write some mapping code or evolve the data to the latest schema format. This is not an easy task, I have done this by hand once or twice and its hard to get right. This is the job of the Kafka Schema registry essentially. It’s a helping/guiding light to help you traverse this issue.

 

I have taken the next section from the official docs, as there is good detail in here that is important to the rest of this post, and it really is a case of the official docs saying it best, so let’s just borrow this bit of text, the rest of the post will put this into action

 

The Schema Registry is a distributed storage layer for Avro Schemas which uses Kafka as its underlying storage mechanism. Some key design decisions:

  • Assigns globally unique id to each registered schema. Allocated ids are guaranteed to be monotonically increasing but not necessarily consecutive.
  • Kafka provides the durable backend, and functions as a write-ahead changelog for the state of the Schema Registry and the schemas it contains.
  • The Schema Registry is designed to be distributed, with single-master architecture, and ZooKeeper/Kafka coordinates master election (based on the configuration).

 

Kafka Backend

Kafka is used as the Schema Registry storage backend. The special Kafka topic <kafkastore.topic> (default _schemas), with a single partition, is used as a highly available write ahead log. All schemas, subject/version and id metadata, and compatibility settings are appended as messages to this log. A Schema Registry instance therefore both produces and consumes messages under the _schemas topic. It produces messages to the log when, for example, new schemas are registered under a subject, or when updates to compatibility settings are registered. The Schema Registry consumes from the _schemas log in a background thread, and updates its local caches on consumption of each new _schemas message to reflect the newly added schema or compatibility setting. Updating local state from the Kafka log in this manner ensures durability, ordering, and easy recoverability.

 

Compatibility

The Schema Registry server can enforce certain compatibility rules when new schemas are registered in a subject. Currently, we support the following compatibility rules.

  • Backward compatibility (default): A new schema is backwards compatible if it can be used to read the data written in the latest (older) registered schema.
  • Transitive backward compatibility: A new schema is transitively backwards compatible if it can be used to read the data written in all previously registered schemas. Backward compatibility is useful for loading data into systems like Hadoop since one can always query data of all versions using the latest schema.
  • Forward compatibility: A new schema is forward compatible if the latest (old) registered schema can read data written in this schema, or to put this another way “data encoded with a newer schema can be read with an older schema.”
  • Transitive forward compatibility: A new schema is transitively forward compatible if all previous schemas can read data written in this schema. Forward compatibility is useful for consumer applications that can only deal with data in a particular version that may not always be the latest version.
  • Full compatibility: A new schema is fully compatible if it’s both backward and forward compatible with the latest registered schema.
  • Transitive full compatibility: A new schema is transitively full compatible if it’s both backward and forward compatible with all previously registered schemas.
  • No compatibility: A new schema can be any schema as long as it’s a valid Avro.

 

An Evolution Example

Say we had this initial schema

{"namespace": "barbers.avro",
 "type": "record",
 "name": "person",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "age",  "type": "int"},
 ]
}

We can evolve this schema into this one, where we supply a default location.

{"namespace": "barbers.avro",
 "type": "record",
 "name": "person",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "age",  "type": "int"},
	 {"name": "location", "type": "string", "default": "uk"}
 ]
}

 

This allows data encoded with the old one to be read using this new schema. This is backward compatible. If we had not of supplied this default for ”location” this would no longer be backward compatible.

 

Forward compatibility means that data encoded with a newer schema can be read with an older schema. This new schema is also forward compatible with the original schema, since we can just drop the defaulted new field “location” when projecting new data into the old. Had the new schema chosen to no longer include “age” column it would not be forward compatible, as we would not know how to fill in this field for the the old schema.

For more information on this you should have a bit more of a read here : https://docs.confluent.io/current/avro.html

 

Ok so I think that gives us a flavor of what the Schema Registry is all about, let’s carry on and see some code examples.

 

So how do I run this stuff?

 

As I stated above you will need to download a few things, but once you have those in place you may find the small PowerShell script useful that is inside the projects called “RunThePipeline.ps1”. This script does a few things, such as cleans the Kafka/Zookeeper logs, stops any previous instances, starts new instances and also creates the Kafka topic (which you must have before you can use the code).

IMPORTANT NOTE : I have altered the Kafka log paths, and where Zookeeper logs to. This can be done using the relevant properties files from the Confluent installation

 

image

 

server.properties

This line was changed

# A comma seperated list of directories under which to store log files
log.dirs=c:/temp/kafka-logs

 

zookeeper.properties

This line was changed

# the directory where the snapshot is stored.
dataDir=c:/temp/zookeeper

 

The PowerShell script will be making assumptions based on the where you changed these values to, so if you have different settings for these please edit the PowerShell script too )

 

Essentially you need to have the following running before you can run the code example (this is what the PowerShell script automates for you, though you may prefer to use the Confluent CLI, I just had this script from a previous project, and it also cleans out old data which is useful when you are experimenting, and it also creates the required topic)

 

  • Zookeeper
  • Kafka broker
  • Starts the Schema Registry
  • Kafka topic created

 

 

Using the producer to test out the Registry

This is this project in the GitHub demo code : https://github.com/sachabarber/KafkaAvroExamples/tree/master/KafkaSpecificAvroBrokenProducer

 

Since we worked with the Schema Registry and a Kafka Producer that made use of the Registry in the last post, I thought it might a good idea to make some changes to the publisher of the last post, to see if we can create some compatibility tests against the Schema Registry that the Producer is using.

 

So so before we see the running code, lets examine some schemas that are run in the PublisherApp in this exact order

 

User schema

{
    "namespace": "com.barber.kafka.avro",
     "type": "record",
     "name": "User",
     "fields":[
         {
            "name": "id", "type": "int"
         },
         {
            "name": "name",  "type": "string"
         }
     ]
}

 

User without name schema

{
    "namespace": "com.barber.kafka.avro",
     "type": "record",
     "name": "User",
     "fields":[
         {
            "name": "id", "type": "int"
         }
     ]
}

 

User with boolean “Id” field schema

{
    "namespace": "com.barber.kafka.avro",
     "type": "record",
     "name": "User",
     "fields":[
         {
            "name": "id", "type": "boolean"
         }
     ]
}

 

Another completely new type with string “Id” field schema

{
    "namespace": "com.barber.kafka.avro",
     "type": "record",
     "name": "AnotherExampleWithStringId",
     "fields":[
         {
            "name": "id", "type": "string"
         }
     ]
}

 

So since the Kafka Producer is setup to use the Kafka Schema Registry and is sending Avro using the KafkaAvroSerializer for the key, we start with the 1st schema (User Schema) shown above being the one that is registered against the Kafka Schema Registry subject Kafka-value (we will see more of the Registry API below for now just understand that when using the Schema Registry a auto registration is done against the topic/value for the given producer, on the 1st one to use the topic being the one that sets these items for the Registry).

 

Ok so we have stated that we start out with the 1st schema (User Schema) shown above, and then we progress through each of the other schemas shown above and try and send the Avro serialized data across the Kafka topic, here is the code from the Producer App

package com.barber.kafka.avro

import java.util.{Properties, UUID}

import io.confluent.kafka.serializers.KafkaAvroSerializer
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

class KafkaDemoAvroPublisher(val topic:String) {

  private val props = new Properties()
  props.put("bootstrap.servers", "localhost:9092")
  props.put("schema.registry.url", "http://localhost:8081")
  props.put("key.serializer", classOf[StringSerializer].getCanonicalName)
  props.put("value.serializer",classOf[KafkaAvroSerializer].getCanonicalName)
  props.put("client.id", UUID.randomUUID().toString())

  private val producer =   new KafkaProducer[String,User](props)
  private val producerUserWithoutName =   new KafkaProducer[String,UserWithoutName](props)
  private val producerUserWithBooleanIdBad =   new KafkaProducer[String,UserWithBooleanId](props)
  private val producerAnotherExampleWithStringIdBad = new KafkaProducer[String,AnotherExampleWithStringId](props)

  def send(): Unit = {
    try {
      val rand =  new scala.util.Random(44343)

      //we expect this to work, as its the one that is going to define the Avro format of then topic
      //since its the 1st published message on the topic (Assuming you have not preregistered the topic key + Avro schemea
      //with the schema registry already)
      for(i <- 1 to 10) {
        val id = rand.nextInt()
        val itemToSend = User(id , "ishot.com")
        println(s"Producer sending data ${itemToSend.toString}")
        producer.send(new ProducerRecord[String, User](topic, itemToSend))
        producer.flush()
      }


      //we expect this to work as having a User without a name is ok, as Name is a "string" so can be empty
      for(i <- 1 to 10) {
        val id = rand.nextInt()
        val itemToSend = UserWithoutName(id)
        println(s"Producer sending data ${itemToSend.toString}")
        producerUserWithoutName.send(new ProducerRecord[String, UserWithoutName](topic, itemToSend))
        producerUserWithoutName.flush()
      }


      //we expect this to fail as its trying to send a different incompatible Avro object on the topic
      //which is currently using the "User" (Avro object / Schema)
      sendBadProducerValue("UserWithBooleanId", () => {
        val itemToSend = UserWithBooleanId(true)
        println(s"Producer sending data ${itemToSend.toString}")
        producerUserWithBooleanIdBad.send(new ProducerRecord[String, UserWithBooleanId](topic, itemToSend))
        producerUserWithBooleanIdBad.flush()
      })

      //we expect this to fail as its trying to send a different incompatible Avro object on the topic
      //which is currently using the "User" (Avro object / Schema)
      sendBadProducerValue("AnotherExampleWithStringId", () => {
        val itemToSend = AnotherExampleWithStringId("fdfdfdsdsfs")
        println(s"Producer sending data ${itemToSend.toString}")
        producerAnotherExampleWithStringIdBad.send(new ProducerRecord[String, AnotherExampleWithStringId](topic, itemToSend))
        producerAnotherExampleWithStringIdBad.flush()
      })

    } catch {
      case ex: Exception =>
        println(ex.printStackTrace().toString)
        ex.printStackTrace()
    }
  }

  def sendBadProducerValue(itemType: String, produceCallback: () => Unit) : Unit = {
    try {
      //we expect this to fail as its trying to send a different incompatible
      // Avro object on the topic which is currently using the "User" (Avro object / Schema)
      println(s"Sending $itemType")
      produceCallback()
    } catch {
      case ex: Exception => {
        println("==============================================================================\r\n")
        println(s" We were expecting this due to incompatble '$itemType' item being sent\r\n")
        println("==============================================================================\r\n")
        println(ex.printStackTrace().toString)
        println()
        ex.printStackTrace()
      }
    }
  }
}

 

Now lets see the relevant output

 

NOTE : This assumes a clean run with a new empty topic

 

 

We start by sending some User Schema data, which is fine as it’s the first data on the topic

 

Producer sending data {“id”: -777306865, “name”: “ishot.com”}
06:47:05.816 [kafka-producer-network-thread | 6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] DEBUG org.apache.kafka.clients.NetworkClient – [Producer clientId=6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] Using older server API v3 to send PRODUCE {acks=1,timeout=30000,partitionSizes=[avro-specific-demo-topic-0=88]} with correlation id 8 to node 0
Producer sending data {“id”: 1227013473, “name”: “ishot.com”}
06:47:05.818 [kafka-producer-network-thread | 6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] DEBUG org.apache.kafka.clients.NetworkClient – [Producer clientId=6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] Using older server API v3 to send PRODUCE {acks=1,timeout=30000,partitionSizes=[avro-specific-demo-topic-0=88]} with correlation id 9 to node 0
Producer sending data {“id”: 1899269599, “name”: “ishot.com”}
06:47:05.821 [kafka-producer-network-thread | 6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] DEBUG org.apache.kafka.clients.NetworkClient – [Producer clientId=6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] Using older server API v3 to send PRODUCE {acks=1,timeout=30000,partitionSizes=[avro-specific-demo-topic-0=88]} with correlation id 10 to node 0
Producer sending data {“id”: -1764893671, “name”: “ishot.com”}

 

We then try send some User Without Name Schema data, which is fine as it’s compatible with the User Schema already registered in the Kafka Registry

 

Producer sending data {“id”: -1005363706}
06:47:05.895 [kafka-producer-network-thread | 6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] DEBUG org.apache.kafka.clients.NetworkClient – [Producer clientId=6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] Using older server API v3 to send PRODUCE {acks=1,timeout=30000,partitionSizes=[avro-specific-demo-topic-0=78]} with correlation id 4 to node 0
Producer sending data {“id”: -528430870}
06:47:05.900 [kafka-producer-network-thread | 6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] DEBUG org.apache.kafka.clients.NetworkClient – [Producer clientId=6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] Using older server API v3 to send PRODUCE {acks=1,timeout=30000,partitionSizes=[avro-specific-demo-topic-0=78]} with correlation id 5 to node 0
Producer sending data {“id”: -877322591}
06:47:05.905 [kafka-producer-network-thread | 6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] DEBUG org.apache.kafka.clients.NetworkClient – [Producer clientId=6c9d7365-1216-4c4e-9cfe-da3e3d5654d1] Using older server API v3 to send PRODUCE {acks=1,timeout=30000,partitionSizes=[avro-specific-demo-topic-0=78]} with correlation id 6 to node 0
Producer sending data {“id”: 2048308094}

 

We then try send some User With Boolean Id Field Schema data, which we expect to be NOT acceptable/incompatible with the User Schema already registered in the Kafka Registry. This is due to the fact we have changed the Id field into a boolean, where the already registered User Schema expects this field to be an “int”. So does it fail?

 

In a word yes, here is the relevant output

 

Sending UserWithBooleanId
Producer sending data {“id”: true}

06:47:05.954 [main] DEBUG io.confluent.kafka.schemaregistry.client.rest.RestService – Sending POST with input {“schema”:”{\”type\”:\”record\”,\”name\”:\”User\”,\”namespace\”:\”com.barber.kafka.avro\”,\”fields\”:[{\”name\”:\”id\”,\”type\”:\”boolean\”}]}”} to http://localhost:8081/subjects/avro-specific-demo-topic-value/versions
==============================================================================

We were expecting this due to incompatble ‘UserWithBooleanId’ item being sent

==============================================================================

org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: {“type”:”record”,”name”:”User”,”namespace”:”com.barber.kafka.avro”,”fields”:[{“name”:”id”,”type”:”boolean”}]}
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered is incompatible with an earlier schema; error code: 409
    at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:170)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:188)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:245)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:237)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:232)
    at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.registerAndGetId(CachedSchemaRegistryClient.java:59)
    at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:91)
    at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:72)
    at io.confluent.kafka.serializers.KafkaAvroSerializer.serialize(KafkaAvroSerializer.java:54)
    at org.apache.kafka.common.serialization.ExtendedSerializer$Wrapper.serialize(ExtendedSerializer.java:65)
    at org.apache.kafka.common.serialization.ExtendedSerializer$Wrapper.serialize(ExtendedSerializer.java:55)
    at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:807)
    at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:784)
    at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:671)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher.$anonfun$send$3(KafkaDemoAvroPublisher.scala:54)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher$$Lambda$17/746074699.apply$mcV$sp(Unknown Source)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher.sendBadProducerValue(KafkaDemoAvroPublisher.scala:79)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher.send(KafkaDemoAvroPublisher.scala:51)
    at com.barber.kafka.avro.PublisherApp$.delayedEndpoint$com$barber$kafka$avro$PublisherApp$1(PublisherApp.scala:6)
    at com.barber.kafka.avro.PublisherApp$delayedInit$body.apply(PublisherApp.scala:3)
    at scala.Function0.apply$mcV$sp(Function0.scala:34)
    at scala.Function0.apply$mcV$sp$(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App.$anonfun$main$1$adapted(App.scala:76)
    at scala.App$$Lambda$5/511833308.apply(Unknown Source)
    at scala.collection.immutable.List.foreach(List.scala:378)
    at scala.App.main(App.scala:76)
    at scala.App.main$(App.scala:74)
    at com.barber.kafka.avro.PublisherApp$.main(PublisherApp.scala:3)
    at com.barber.kafka.avro.PublisherApp.main(PublisherApp.scala)
o

 

 

We then try send some Completely Different Object With String Id Field Schema data, which we expect to be NOT acceptable/incompatible with the User Schema already registered in the Kafka Registry. This is due to the fact we have changed the Id field into a string, where the already registered User Schema expects this field to be an “int”, and we have also omitted a required “name” field in this new type.

So does it fail?

 

In a word yes, here is the relevant output

 

Sending AnotherExampleWithStringId
Producer sending data {“id”: “fdfdfdsdsfs”}

06:47:06.059 [main] DEBUG io.confluent.kafka.schemaregistry.client.rest.RestService – Sending POST with input {“schema”:”{\”type\”:\”record\”,\”name\”:\”AnotherExampleWithStringId\”,\”namespace\”:\”com.barber.kafka.avro\”,\”fields\”:[{\”name\”:\”id\”,\”type\”:\”string\”}]}”} to http://localhost:8081/subjects/avro-specific-demo-topic-value/versions
==============================================================================

We were expecting this due to incompatble ‘AnotherExampleWithStringId’ item being sent

==============================================================================

()

org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: {“type”:”record”,”name”:”AnotherExampleWithStringId”,”namespace”:”com.barber.kafka.avro”,”fields”:[{“name”:”id”,”type”:”string”}]}
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered is incompatible with an earlier schema; error code: 409
    at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:170)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:188)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:245)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:237)
    at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:232)
    at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.registerAndGetId(CachedSchemaRegistryClient.java:59)
    at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:91)
    at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:72)
    at io.confluent.kafka.serializers.KafkaAvroSerializer.serialize(KafkaAvroSerializer.java:54)
    at org.apache.kafka.common.serialization.ExtendedSerializer$Wrapper.serialize(ExtendedSerializer.java:65)
    at org.apache.kafka.common.serialization.ExtendedSerializer$Wrapper.serialize(ExtendedSerializer.java:55)
    at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:807)
    at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:784)
    at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:671)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher.$anonfun$send$4(KafkaDemoAvroPublisher.scala:63)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher$$Lambda$18/871790326.apply$mcV$sp(Unknown Source)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher.sendBadProducerValue(KafkaDemoAvroPublisher.scala:79)
    at com.barber.kafka.avro.KafkaDemoAvroPublisher.send(KafkaDemoAvroPublisher.scala:60)
    at com.barber.kafka.avro.PublisherApp$.delayedEndpoint$com$barber$kafka$avro$PublisherApp$1(PublisherApp.scala:6)
    at com.barber.kafka.avro.PublisherApp$delayedInit$body.apply(PublisherApp.scala:3)
    at scala.Function0.apply$mcV$sp(Function0.scala:34)
    at scala.Function0.apply$mcV$sp$(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App.$anonfun$main$1$adapted(App.scala:76)
    at scala.App$$Lambda$5/511833308.apply(Unknown Source)
    at scala.collection.immutable.List.foreach(List.scala:378)
    at scala.App.main(App.scala:76)
    at scala.App.main$(App.scala:74)
    at com.barber.kafka.avro.PublisherApp$.main(PublisherApp.scala:3)
    at com.barber.kafka.avro.PublisherApp.main(PublisherApp.scala)

Ok so that gives us a glimpse of what it’s like to work with the Kafka Producer and the Schema Registry. But surely there is more we can do with the Schema Registry REST API that is mentioned above?

 

Well yeah there is, we will now look at a second example in the codebase which will try a few more Schema Registry REST API calls.

 

 

 

 

Understanding Registry API

 

This is this project in the GitHub demo code : https://github.com/sachabarber/KafkaAvroExamples/tree/master/KafkaSchemaRegistryTests

 

I have chosen to use Akka Http for the REST API calls.

 

These are the 3 resources used by the code

 

compatibilityBACKWARD.json

{"compatibility": "BACKWARD"}

 

compatibilityNONE.json

{"compatibility": "NONE"}

 

simpleStringSchema.avsc

{"schema": "{\"type\": \"string\"}"}

 

 

Here is the entire codebase for various operations against the Schema Registry

import scala.io.Source
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import scala.concurrent._
import scala.concurrent.duration._
import akka.http.scaladsl.model._
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.model._
import akka.stream.ActorMaterializer
import scala.concurrent.Future
import scala.util.{Failure, Success}
import akka.http.scaladsl.unmarshalling.Unmarshal
import AkkaHttpImplicits.{executionContext, materializer, system}

object RegistryApp extends App {

  var payload = ""
  var result = ""
  val schemaRegistryMediaType = MediaType.custom("application/vnd.schemaregistry.v1+json",false)
  implicit  val c1 = ContentType(schemaRegistryMediaType, () => HttpCharsets.`UTF-8`)

  //These queries are the same ones found here, but instead of using curl I am using Akka Http
  //https://docs.confluent.io/current/schema-registry/docs/intro.html


  //  # Register a new version of a schema under the subject "Kafka-key"
  //  $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"schema": "{\"type\": \"string\"}"}' \
  //  http://localhost:8081/subjects/Kafka-key/versions
  //  {"id":1}
  payload = Source.fromURL(getClass.getResource("/simpleStringSchema.avsc")).mkString
  result = post(payload,"http://localhost:8081/subjects/Kafka-key/versions")
  println("Register a new version of a schema under the subject \"Kafka-key\"")
  println("EXPECTING {\"id\":1}")
  println(s"GOT ${result}\r\n")


  //  # Register a new version of a schema under the subject "Kafka-value"
  //  $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"schema": "{\"type\": \"string\"}"}' \
  //  http://localhost:8081/subjects/Kafka-value/versions
  //  {"id":1}
  payload = Source.fromURL(getClass.getResource("/simpleStringSchema.avsc")).mkString
  result = post(payload,"http://localhost:8081/subjects/Kafka-value/versions")
  println("Register a new version of a schema under the subject \"Kafka-value\"")
  println("EXPECTING {\"id\":1}")
  println(s"GOT ${result}\r\n")


  //  # List all subjects
  //  $ curl -X GET http://localhost:8081/subjects
  //    ["Kafka-value","Kafka-key"]
  result = get("http://localhost:8081/subjects")
  println("List all subjects")
  println("EXPECTING [\"Kafka-value\",\"Kafka-key\"]")
  println(s"GOT ${result}\r\n")


  //  # Fetch a schema by globally unique id 1
  //  $ curl -X GET http://localhost:8081/schemas/ids/1
  //  {"schema":"\"string\""}
  result = get("http://localhost:8081/schemas/ids/1")
  println("Fetch a schema by globally unique id 1")
  println("EXPECTING {\"schema\":\"\\\"string\\\"\"}")
  println(s"GOT ${result}\r\n")


  //  # List all schema versions registered under the subject "Kafka-value"
  //  $ curl -X GET http://localhost:8081/subjects/Kafka-value/versions
  //    [1]
  result = get("http://localhost:8081/subjects/Kafka-value/versions")
  println("List all schema versions registered under the subject \"Kafka-value\"")
  println("EXPECTING [1]")
  println(s"GOT ${result}\r\n")


  //  # Fetch version 1 of the schema registered under subject "Kafka-value"
  //  $ curl -X GET http://localhost:8081/subjects/Kafka-value/versions/1
  //  {"subject":"Kafka-value","version":1,"id":1,"schema":"\"string\""}
  result = get("http://localhost:8081/subjects/Kafka-value/versions/1")
  println("Fetch version 1 of the schema registered under subject \"Kafka-value\"")
  println("EXPECTING {\"subject\":\"Kafka-value\",\"version\":1,\"id\":1,\"schema\":\"\\\"string\\\"\"}")
  println(s"GOT ${result}\r\n")


  //  # Deletes version 1 of the schema registered under subject "Kafka-value"
  //  $ curl -X DELETE http://localhost:8081/subjects/Kafka-value/versions/1
  //  1
  result = delete("http://localhost:8081/subjects/Kafka-value/versions/1")
  println("Deletes version 1 of the schema registered under subject \"Kafka-value\"")
  println("EXPECTING 1")
  println(s"GOT ${result}\r\n")


  //  # Register the same schema under the subject "Kafka-value"
  //  $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"schema": "{\"type\": \"string\"}"}' \
  //  http://localhost:8081/subjects/Kafka-value/versions
  //  {"id":1}
  payload = Source.fromURL(getClass.getResource("/simpleStringSchema.avsc")).mkString
  result = post(payload,"http://localhost:8081/subjects/Kafka-value/versions")
  println("Register the same schema under the subject \"Kafka-value\"")
  println("EXPECTING {\"id\":1}")
  println(s"GOT ${result}\r\n")


  //  # Deletes the most recently registered schema under subject "Kafka-value"
  //  $ curl -X DELETE http://localhost:8081/subjects/Kafka-value/versions/latest
  //  2
  result = delete("http://localhost:8081/subjects/Kafka-value/versions/latest")
  println("Deletes the most recently registered schema under subject \"Kafka-value\"")
  println("EXPECTING 2")
  println(s"GOT ${result}\r\n")


  //  # Register the same schema under the subject "Kafka-value"
  //  $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"schema": "{\"type\": \"string\"}"}' \
  //  http://localhost:8081/subjects/Kafka-value/versions
  //  {"id":1}
  payload = Source.fromURL(getClass.getResource("/simpleStringSchema.avsc")).mkString
  result = post(payload,"http://localhost:8081/subjects/Kafka-value/versions")
  println("Register the same schema under the subject \"Kafka-value\"")
  println("EXPECTING {\"id\":1}")
  println(s"GOT ${result}\r\n")


  //  # Fetch the schema again by globally unique id 1
  //  $ curl -X GET http://localhost:8081/schemas/ids/1
  //  {"schema":"\"string\""}
  result = get("http://localhost:8081/schemas/ids/1")
  println("Fetch the schema again by globally unique id 1")
  println("EXPECTING {\"schema\":\"\\\"string\\\"\"}")
  println(s"GOT ${result}\r\n")


  //  # Check whether a schema has been registered under subject "Kafka-key"
  //  $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"schema": "{\"type\": \"string\"}"}' \
  //  http://localhost:8081/subjects/Kafka-key
  //  {"subject":"Kafka-key","version":1,"id":1,"schema":"\"string\""}
  payload = Source.fromURL(getClass.getResource("/simpleStringSchema.avsc")).mkString
  result = post(payload,"http://localhost:8081/subjects/Kafka-key")
  println("Check whether a schema has been registered under subject \"Kafka-key\"")
  println("EXPECTING {\"subject\":\"Kafka-key\",\"version\":1,\"id\":1,\"schema\":\"\\\"string\\\"\"}")
  println(s"GOT ${result}\r\n")


  //  # Test compatibility of a schema with the latest schema under subject "Kafka-value"
  //  $ curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"schema": "{\"type\": \"string\"}"}' \
  //  http://localhost:8081/compatibility/subjects/Kafka-value/versions/latest
  //  {"is_compatible":true}
  payload = Source.fromURL(getClass.getResource("/simpleStringSchema.avsc")).mkString
  result = post(payload,"http://localhost:8081/compatibility/subjects/Kafka-value/versions/latest")
  println("Test compatibility of a schema with the latest schema under subject \"Kafka-value\"")
  println("EXPECTING {\"is_compatible\":true}")
  println(s"GOT ${result}\r\n")


  //  # Get top level config
  //    $ curl -X GET http://localhost:8081/config
  //    {"compatibilityLevel":"BACKWARD"}
  result = get("http://localhost:8081/config")
  println("Get top level config")
  println("EXPECTING {\"compatibilityLevel\":\"BACKWARD\"}")
  println(s"GOT ${result}\r\n")


  //  # Update compatibility requirements globally
  //    $ curl -X PUT -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"compatibility": "NONE"}' \
  //  http://localhost:8081/config
  //    {"compatibility":"NONE"}
  payload = Source.fromURL(getClass.getResource("/compatibilityNONE.json")).mkString
  result = put(payload,"http://localhost:8081/config")
  println("Update compatibility requirements globally")
  println("EXPECTING {\"compatibility\":\"NONE\"}")
  println(s"GOT ${result}\r\n")


  //  # Update compatibility requirements under the subject "Kafka-value"
  //  $ curl -X PUT -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  //  --data '{"compatibility": "BACKWARD"}' \
  //  http://localhost:8081/config/Kafka-value
  //  {"compatibility":"BACKWARD"}
  payload = Source.fromURL(getClass.getResource("/compatibilityBACKWARD.json")).mkString
  result = put(payload,"http://localhost:8081/config/Kafka-value")
  println("Update compatibility requirements under the subject \"Kafka-value\"")
  println("EXPECTING {\"compatibility\":\"BACKWARD\"}")
  println(s"GOT ${result}\r\n")


  //  # Deletes all schema versions registered under the subject "Kafka-value"
  //  $ curl -X DELETE http://localhost:8081/subjects/Kafka-value
  //    [3]
  result = delete("http://localhost:8081/subjects/Kafka-value")
  println("Deletes all schema versions registered under the subject \"Kafka-value\"")
  println("EXPECTING [3]")
  println(s"GOT ${result}\r\n")

  //  # List all subjects
  //  $ curl -X GET http://localhost:8081/subjects
  //    ["Kafka-key"]
  result = get("http://localhost:8081/subjects")
  println("List all subjects")
  println("EXPECTING [\"Kafka-key\"]")
  println(s"GOT ${result}\r\n")


  private[RegistryApp] def post(data: String, url:String)(implicit contentType:ContentType): String = {
    sendData(data, url, HttpMethods.POST, contentType)
  }

  private[RegistryApp] def put(data: String, url:String)(implicit contentType:ContentType): String = {
    sendData(data, url, HttpMethods.PUT, contentType)
  }

  private[RegistryApp] def sendData(data: String, url:String, method:HttpMethod, contentType:ContentType): String = {
    val responseFuture: Future[HttpResponse] =
      Http(system).singleRequest(
        HttpRequest(
          method,
          url,
          entity = HttpEntity(contentType, data.getBytes())
        )
      )
    val html = Await.result(responseFuture.flatMap(x => Unmarshal(x.entity).to[String]), 5 seconds)
    html
  }

  private[RegistryApp] def get(url:String)(implicit contentType:ContentType): String = {
    noBodiedRequest(url, HttpMethods.GET, contentType)
  }

  private[RegistryApp] def delete(url:String)(implicit contentType:ContentType): String = {
    noBodiedRequest(url, HttpMethods.DELETE, contentType)
  }

  private[RegistryApp] def noBodiedRequest(url:String,method:HttpMethod, contentType:ContentType): String = {
    val responseFuture: Future[HttpResponse] = Http(system).singleRequest(HttpRequest(method,url))
    val html = Await.result(responseFuture.flatMap(x => Unmarshal(x.entity).to[String]), 5 seconds)
    html
  }
}

 

 

And this is the output, which I hope is pretty self explanatory now that we have spent some time with the previous explanations/tests. You can read more about the Kafka Schema Registry REST API here : https://docs.confluent.io/current/schema-registry/docs/api.html

 

Register a new version of a schema under the subject “Kafka-key”
EXPECTING {“id”:1}
GOT {“id”:1}

 

Register a new version of a schema under the subject “Kafka-value”
EXPECTING {“id”:1}
GOT {“id”:1}

 

List all subjects
EXPECTING [“Kafka-value”,”Kafka-key”]
GOT [“Kafka-value”,”Kafka-key”]

 

Fetch a schema by globally unique id 1
EXPECTING {“schema”:”\”string\””}
GOT {“schema”:”\”string\””}

 

List all schema versions registered under the subject “Kafka-value”
EXPECTING [1]
GOT [1]

 

Fetch version 1 of the schema registered under subject “Kafka-value”
EXPECTING {“subject”:”Kafka-value”,”version”:1,”id”:1,”schema”:”\”string\””}
GOT {“subject”:”Kafka-value”,”version”:1,”id”:1,”schema”:”\”string\””}

 

Deletes version 1 of the schema registered under subject “Kafka-value”
EXPECTING 1
GOT 1

 

Register the same schema under the subject “Kafka-value”
EXPECTING {“id”:1}
GOT {“id”:1}

 

Deletes the most recently registered schema under subject “Kafka-value”
EXPECTING 2
GOT 2

 

Register the same schema under the subject “Kafka-value”
EXPECTING {“id”:1}
GOT {“id”:1}

 

Fetch the schema again by globally unique id 1
EXPECTING {“schema”:”\”string\””}
GOT {“schema”:”\”string\””}

 

Check whether a schema has been registered under subject “Kafka-key”
EXPECTING {“subject”:”Kafka-key”,”version”:1,”id”:1,”schema”:”\”string\””}
GOT {“subject”:”Kafka-key”,”version”:1,”id”:1,”schema”:”\”string\””}

 

Test compatibility of a schema with the latest schema under subject “Kafka-value”
EXPECTING {“is_compatible”:true}
GOT {“is_compatible”:true}

 

Get top level config
EXPECTING {“compatibilityLevel”:”BACKWARD”}
GOT {“compatibilityLevel”:”BACKWARD”}

 

Update compatibility requirements globally
EXPECTING {“compatibility”:”NONE”}
GOT {“compatibility”:”NONE”}

 

Update compatibility requirements under the subject “Kafka-value”
EXPECTING {“compatibility”:”BACKWARD”}
GOT {“compatibility”:”BACKWARD”}

 

Deletes all schema versions registered under the subject “Kafka-value”
EXPECTING [3]
GOT [3]

 

List all subjects
EXPECTING [“Kafka-key”]
GOT [“Kafka-key”]

 

 

 

 

Conclusion

As with most things in the Confluent platform the Schema Registry is a great bit of kit and easy to use. I urge you all to give it a spin

Advertisements

MADCAP IDEA 9 : KAFKA STREAMS INTERACTIVE QUERIES

Last Time

 

So last time we came up with a sort of 1/2 way house type post that would pave the way for this one, where we examined several different types of REST frameworks for use with Scala. The requirements were that we would be able to use JSON and that there would be both a client and server side API for the chosen library. For http4s and Akka Http worked. I decided to go with Akka Http due to being more familiar with it.

So that examination of REST APIs has allowed this post to happen. In this post what we will be doing is looking at

  • How to install Kafka/Zookeeper and get them running on Windows
  • Walking through a KafkaProducer
  • Walking through a Kafka Streams processing node, and the duality of streams
  • Walking through Kafka Streams interactive queries

 

PreAmble

Just as a reminder this is part of my ongoing set of posts which I talk about here :

https://sachabarbs.wordpress.com/2017/05/01/madcap-idea/, where we will be building up to a point where we have a full app using lots of different stuff, such as these

 

  • WebPack
  • React.js
  • React Router
  • TypeScript
  • Babel.js
  • Akka
  • Scala
  • Play (Scala Http Stack)
  • MySql
  • SBT
  • Kafka
  • Kafka Streams

 

Ok so now that we have the introductions out of the way, lets crack on with what we want to cover in this post.

 

Where is the code?

As usual the code is on GitHub here : https://github.com/sachabarber/MadCapIdea

 

Before we start, what is this post all about?

 

Well I don’t know if you recall but we are attempting to create a uber simple uber type application where there are drivers/clients. A client can put out a new job, drivers bid for it. And at the end of a job they can rate each other, see the job completion/rating/view rating sections of this previous post : https://sachabarbs.wordpress.com/2017/06/27/madcap-idea-part-6-static-screen-design/

 

The ratings will be placed into streams and aggregated into permanent storage, and will be available for querying later to display in the react front end. The rating should be grouped by email, and also searchable via the use of an email.

 

So that is what we are attempting to cover in this post. However since this is the 1st time we have had to use Kafka, this post will also talk through what you have to go through to get Kafka setup on windows. In a subsequent post I will attempt to get EVERYTHING up and working (including the Play front end) in Docker containers, but for now we will assume a local install of Kafka, if nothing else its good to know how to set this up

 

How to install Kafka/Zookeeper and get them running on Windows

This section will talk you through how to get Kafka and get it working on Windows

 

Step 1 : Download Kafka

Grab Confluence Platform 3.3.0 Open Source : http://packages.confluent.io/archive/3.3/confluent-oss-3.3.0-2.11.zip

 

Step 2 : Update Dodgy BAT Files

The official Kafka windows BAT files don’t seem to work in the Confluence Platform 3.3.0 Open Source download. So replace the official [YOUR INSTALL PATH]\confluent-3.3.0\bin\windows BAT files with the ones found here : https://github.com/renukaradhya/confluentplatform/tree/master/bin/windows

 

Step 3 : Make Some Minor Changes To Log Locations etc etc

Kafka/Zookeeper as installed are setup for Linux, as such these paths won’t work on Windows. So we need to adjust that a bit. So lets do that now

 

  • Modify the [YOUR INSTALL PATH]\confluent-3.3.0\etc\kafka\zookeeper.properties file to change the dataDir to something like dataDir=c:/temp/zookeeper
  • Modify the [YOUR INSTALL PATH]\confluent-3.3.0\etc\kafka\server.properties file to uncomment the line delete.topic.enable=true

 

Step 4 : Running Zookeeper + Kafka + Creating Topics

Now that we have installed everything, it’s just a matter of running stuff up. Sadly before we can run Kafka we need to run Zookeeper, and before Kafka can send messages we need to ensure that the Kafka topics are created. Topics must exist before messages

 

Mmm that sounds like a fair bit of work. Well it is, so I decided to script this into a little PowerShell script. This script is available within the source code at https://github.com/sachabarber/MadCapIdea/tree/master/PowerShellProject/PowerShellProject where the script is called RunPipeline.ps1

 

So all we need to do is change to the directory that the RunPipeline.ps1 is in, and run it.

Obviously it is setup to my installation folders, you WILL have to change the variables at the top of the script if you want to run this on your own machine

 

Here is the contents of the RunPipeline.ps1 file

 

$global:mongoDbInstallationFolder = "C:\Program Files\MongoDB\Server\3.5\bin\"
$global:kafkaWindowsBatFolder = "C:\Apache\confluent-3.3.0\bin\windows\"
$global:kafkaTopics = 
	"rating-submit-topic",
	"rating-output-topic"
$global:ProcessesToKill = @()



function RunPipeLine() 
{
	WriteHeader "STOPPING PREVIOUS SERVICES"
	StopZookeeper
	StopKafka

	Start-Sleep -s 20
	
	WriteHeader "STARTING NEW SERVICE INSTANCES"
	StartZookeeper
	StartKafka
	
	Start-Sleep -s 20

	CreateKafkaTopics
    RunMongo

	WaitForKeyPress

	WriteHeader "KILLING PROCESSES CREATED BY SCRIPT"
	KillProcesses
}

function WriteHeader($text) 
{
	Write-Host "========================================`r`n"
	Write-Host "$text`r`n"
	Write-Host "========================================`r`n"
}


function StopZookeeper() {
	$zookeeperCommandLine = $global:kafkaWindowsBatFolder + "zookeeper-server-stop.bat"
	Write-Host "> Zookeeper Command Line : $zookeeperCommandLine`r`n"
    $global:ProcessesToKill += start-process $zookeeperCommandLine -WindowStyle Normal -PassThru
}

function StopKafka() {
	$kafkaServerCommandLine = $global:kafkaWindowsBatFolder + "kafka-server-stop.bat" 
	Write-Host "> Kafka Server Command Line : $kafkaServerCommandLine`r`n"
    $global:ProcessesToKill += start-process $kafkaServerCommandLine  -WindowStyle Normal -PassThru
}

function StartZookeeper() {
	$zookeeperCommandLine = $global:kafkaWindowsBatFolder + "zookeeper-server-start.bat"
	$arguments = $global:kafkaWindowsBatFolder + "..\..\etc\kafka\zookeeper.properties"
	Write-Host "> Zookeeper Command Line : $zookeeperCommandLine args: $arguments `r`n"
    $global:ProcessesToKill += start-process $zookeeperCommandLine $arguments -WindowStyle Normal -PassThru
}

function StartKafka() {
	$kafkaServerCommandLine = $global:kafkaWindowsBatFolder + "kafka-server-start.bat" 
	$arguments = $global:kafkaWindowsBatFolder + "..\..\etc\kafka\server.properties"
	Write-Host "> Kafka Server Command Line : $kafkaServerCommandLine args: $arguments `r`n"
    $global:ProcessesToKill += start-process $kafkaServerCommandLine $arguments -WindowStyle Normal -PassThru
}

function CreateKafkaTopics() 
{
	Foreach ($topic in $global:kafkaTopics )
	{
		$kafkaCommandLine = $global:kafkaWindowsBatFolder + "kafka-topics.bat"
		$arguments = "--zookeeper localhost:2181 --create  --replication-factor 1 --partitions 1 --topic $topic"
		Write-Host "> Create Kafka Topic Command Line : $kafkaCommandLine args: $arguments `r`n"
		$global:ProcessesToKill += start-process $kafkaCommandLine $arguments -WindowStyle Normal -PassThru
	}
}

function RunMongo() {
	$mongoexe = $global:mongoDbInstallationFolder + "mongod.exe"
	Write-Host "> Mongo Command Line : $mongoexe `r`n" 
	$global:ProcessesToKill += Start-Process -FilePath $mongoexe  -WindowStyle Normal -PassThru
}

function WaitForKeyPress
{
	Write-Host -NoNewLine "Press any key to continue....`r`n"
	[Console]::ReadKey()
}


function KillProcesses() 
{
	Foreach ($processToKill in $global:ProcessesToKill )
	{
		$name = $processToKill | Get-ChildItem -Name
		Write-Host "Killing Process : $name `r`n" 
		$processToKill | Stop-Process -Force
	}
}


# Kick of the entire pipeline
RunPipeLine

 

 

As you can see this script does a bit more than just run up Zookeeper and Kafka, it also create the topics and runs Mongo DB that is also required by the main Play application (remember we are using Reactive Mongo for the login/registration side of things)

 

So far I have not had many issues with the script. Though occasionally when you are trying out new code, I do tend to clear out all the Zookeeper/Kafka state so far, which for me is stored here

 

image

 

It just allows me to start with a clean slate as it were, you should need to do this that often

 

Walking through a KafkaProducer

So the Kafka Producer I present here will send a String key and a JSON Ranking object as a payload. Lets have a quick look at the Ranking object and how it gets turned to and from JSON before we look at the producer code

 

The Domain Entities

Here is what the domain entities looks like, it can be seen that these use the Spray formatters (part of Akka Http) Marshaller/UnMarshaller JSON support. This is not required by the producer but is required by the REST API, which we will look at later. The producer and Kafka streams code work with a different serialization abstraction, something known as SERDES, which as far as I know is only used as a term in Kafka. Its quite simple it stands for Serializer-Deserializer (Serdes)

 

Anyway here are the domain entities and Spray formatters that allow Akka (but not Kafka Streams, more on this later) to work

 

package Entities

import akka.http.scaladsl.marshallers.sprayjson.SprayJsonSupport._
import spray.json.DefaultJsonProtocol._

case class Ranking(fromEmail: String, toEmail: String, score: Float)
case class HostStoreInfo(host: String, port: Int, storeNames: List[String])

object AkkaHttpEntitiesJsonFormats {
  implicit val RankingFormat = jsonFormat3(Ranking)
  implicit val HostStoreInfoFormat = jsonFormat3(HostStoreInfo)
}

 

Serdes

So as we just described Kafka Streams actually cares not for the standard Akka Http/Spray JSON formatters, its not part of Kafka Streams after all. However Kafka Streams still has some of the same concerns where it needs to serialize and de-serialize data (like when it re-partitions (i.e. shuffles data)), so how does it realize that. Well it uses a weirdly name thing called a “serde”. There are MANY inbuilt “serde” types, but you can of course create your own to represent your “on the wire format”. I am using JSON, so this is what my generic “Serde” implementation looks like. I should point out that I owe a great many things to a chap called Jendrik whom I made contact with who has been doing some great stuff with Kafka, his blog has really helped me out : https://www.madewithtea.com/category/kafka-streams.html

 

Anyway here is my/his/yours/ours “Serde” code for JSON

 

import java.lang.reflect.{ParameterizedType, Type}
import java.util
import com.fasterxml.jackson.annotation.JsonInclude
import com.fasterxml.jackson.core.JsonParseException
import com.fasterxml.jackson.core.`type`.TypeReference
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.exc.{UnrecognizedPropertyException => UPE}
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.kafka.common.serialization.{Deserializer, Serde, Serializer}


package Serialization {

  object Json {

    type ParseException = JsonParseException
    type UnrecognizedPropertyException = UPE

    private val mapper = new ObjectMapper()
    mapper.registerModule(DefaultScalaModule)
    mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)

    private def typeReference[T: Manifest] = new TypeReference[T] {
      override def getType = typeFromManifest(manifest[T])
    }

    private def typeFromManifest(m: Manifest[_]): Type = {
      if (m.typeArguments.isEmpty) {
        m.runtimeClass
      }
      else new ParameterizedType {
        def getRawType = m.runtimeClass

        def getActualTypeArguments = m.typeArguments.map(typeFromManifest).toArray

        def getOwnerType = null
      }
    }

    object ByteArray {
      def encode(value: Any): Array[Byte] = mapper.writeValueAsBytes(value)

      def decode[T: Manifest](value: Array[Byte]): T =
        mapper.readValue(value, typeReference[T])
    }

  }

  /**
    * JSON serializer for JSON serde
    *
    * @tparam T
    */
  class JSONSerializer[T] extends Serializer[T] {
    override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = ()

    override def serialize(topic: String, data: T): Array[Byte] =
      Json.ByteArray.encode(data)

    override def close(): Unit = ()
  }

  /**
    * JSON deserializer for JSON serde
    *
    * @tparam T
    */
  class JSONDeserializer[T >: Null <: Any : Manifest] extends Deserializer[T] {
    override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = ()

    override def close(): Unit = ()

    override def deserialize(topic: String, data: Array[Byte]): T = {
      if (data == null) {
        return null
      } else {
        Json.ByteArray.decode[T](data)
      }
    }
  }

  /**
    * JSON serde for local state serialization
    *
    * @tparam T
    */
  class JSONSerde[T >: Null <: Any : Manifest] extends Serde[T] {
    override def deserializer(): Deserializer[T] = new JSONDeserializer[T]

    override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = ()

    override def close(): Unit = ()

    override def serializer(): Serializer[T] = new JSONSerializer[T]
  }

}

 

And finally here is what the producer code looks like

 

As you can see it is actually fairly straight forward, its simple produces 10 messages (though you can uncomment the line in the code to make it endless) of String as a key and a Ranking as the message. The key would be who the ranking is destined for. These will then be aggregated by the streams processing node, and stored as a list against a Key (ie email)

 


package Processing.Ratings {

  import java.util.concurrent.TimeUnit

  import Entities.Ranking
  import Serialization.JSONSerde
  import Topics.RatingsTopics

  import scala.util.Random
  import org.apache.kafka.clients.producer.ProducerRecord
  import org.apache.kafka.clients.producer.KafkaProducer
  import org.apache.kafka.common.serialization.Serdes
  import Utils.Settings
  import org.apache.kafka.clients.producer.ProducerConfig

  object RatingsProducerApp extends App {

   run()

    private def run(): Unit = {

      val jSONSerde = new JSONSerde[Ranking]
      val random = new Random
      val producerProps = Settings.createBasicProducerProperties
      val rankingList = List(
        Ranking("jarden@here.com","sacha@here.com", 1.5f),
        Ranking("miro@here.com","mary@here.com", 1.5f),
        Ranking("anne@here.com","margeret@here.com", 3.5f),
        Ranking("frank@here.com","bert@here.com", 2.5f),
        Ranking("morgan@here.com","ruth@here.com", 1.5f))

      producerProps.put(ProducerConfig.ACKS_CONFIG, "all")

      System.out.println("Connecting to Kafka cluster via bootstrap servers " +
        s"${producerProps.getProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG)}")

      // send a random string from List event every 100 milliseconds
      val rankingProducer = new KafkaProducer[String, Array[Byte]](
        producerProps, Serdes.String.serializer, Serdes.ByteArray.serializer)

      //while (true) {
      for (i <- 0 to 10) {
        val ranking = rankingList(random.nextInt(rankingList.size))
        val rankingBytes = jSONSerde.serializer().serialize("", ranking)
        System.out.println(s"Writing ranking ${ranking} to input topic ${RatingsTopics.RATING_SUBMIT_TOPIC}")
        rankingProducer.send(new ProducerRecord[String, Array[Byte]](
          RatingsTopics.RATING_SUBMIT_TOPIC, ranking.toEmail, rankingBytes))
        Thread.sleep(100)
      }

      Runtime.getRuntime.addShutdownHook(new Thread(() => {
        rankingProducer.close(10, TimeUnit.SECONDS)
      }))
    }
  }
}

 

 

 

Walking through a Kafka Streams processing node, and the duality of streams

Before we get started I just wanted to include a severak excerpts taken from the official Kafka docs : http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables which talks about KStream and KTable objects (which are the stream and table objects inside Kafka streams)

 

When implementing stream processing use cases in practice, you typically need both streams and also databases. An example use case that is very common in practice is an e-commerce application that enriches an incoming stream of customer transactions with the latest customer information from a database table. In other words, streams are everywhere, but databases are everywhere, too.

Any stream processing technology must therefore provide first-class support for streams and tables. Kafka’s Streams API provides such functionality through its core abstractions for streams and tables, which we will talk about in a minute. Now, an interesting observation is that there is actually a close relationship between streams and tables, the so-called stream-table duality. And Kafka exploits this duality in many ways: for example, to make your applications elastic, to support fault-tolerant stateful processing, or to run interactive queries against your application’s latest processing results. And, beyond its internal usage, the Kafka Streams API also allows developers to exploit this duality in their own applications.

 

A simple form of a table is a collection of key-value pairs, also called a map or associative array. Such a table may look as follows:

 

    ../_images/streams-table-duality-01.jpg

     

    The stream-table duality describes the close relationship between streams and tables.

     

    • Stream as Table: A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table. A stream is thus a table in disguise, and it can be easily turned into a “real” table by replaying the changelog from beginning to end to reconstruct the table. Similarly, aggregating data records in a stream will return a table. For example, we could compute the total number of pageviews by user from an input stream of pageview events, and the result would be a table, with the table key being the user and the value being the corresponding pageview count.
    • Table as Stream: A table can be considered a snapshot, at a point in time, of the latest value for each key in a stream (a stream’s data records are key-value pairs). A table is thus a stream in disguise, and it can be easily turned into a “real” stream by iterating over each key-value entry in the table.

     

    Let’s illustrate this with an example. Imagine a table that tracks the total number of pageviews by user (first column of diagram below). Over time, whenever a new pageview event is processed, the state of the table is updated accordingly. Here, the state changes between different points in time – and different revisions of the table – can be represented as a changelog stream (second column).

     

    image

     

    Because of the stream-table duality, the same stream can be used to reconstruct the original table (third column):

     

    ../_images/streams-table-duality-03.jpg

     

    The same mechanism is used, for example, to replicate databases via change data capture (CDC) and, within Kafka Streams, to replicate its so-called state stores across machines for fault tolerance. The stream-table duality is such an important concept for stream processing applications in practice that Kafka Streams models it explicitly via the KStream and KTable abstractions, which we describe in the next sections.

     

     

    I would STRONLY urge you all to read the section of the official docs above, as it will really help you should you want to get into Kafka Streams.

     

    Anyway with all that in mind how does that relate to the use case we are trying to solve. So far we have a publisher that pushes out Rating objects, and as stated ideally we would like to query these across all  processor nodes. As such we should now know that this will involve a KStream and some sort of aggregation to an eventual KTable (where a state store will be used).

     

    Probably the easiest thing to do is to start with the code, which looks like this for the main stream processing code for the Rating section of then final app.

      import java.util.concurrent.TimeUnit
      import org.apache.kafka.common.serialization._
      import org.apache.kafka.streams._
      import org.apache.kafka.streams.kstream._
      import Entities.Ranking
      import Serialization.JSONSerde
      import Topics.RatingsTopics
      import Utils.Settings
      import Stores.StateStores
      import org.apache.kafka.streams.state.HostInfo
      import scala.concurrent.ExecutionContext
    
    
      package Processing.Ratings {
    
        class RankingByEmailInitializer extends Initializer[List[Ranking]] {
          override def apply(): List[Ranking] = List[Ranking]()
        }
    
        class RankingByEmailAggregator extends Aggregator[String, Ranking,List[Ranking]] {
          override def apply(aggKey: String, value: Ranking, aggregate: List[Ranking]) = {
            value :: aggregate
          }
        }
    
    
        object RatingStreamProcessingApp extends App {
    
          implicit val ec = ExecutionContext.global
    
          run()
    
          private def run() : Unit = {
            val stringSerde = Serdes.String
            val rankingSerde = new JSONSerde[Ranking]
            val listRankingSerde = new JSONSerde[List[Ranking]]
            val builder: KStreamBuilder = new KStreamBuilder
            val rankings = builder.stream(stringSerde, rankingSerde, RatingsTopics.RATING_SUBMIT_TOPIC)
    
            //aggrgate by (user email -> their rankings)
            val rankingTable = rankings.groupByKey(stringSerde,rankingSerde)
              .aggregate(
                new RankingByEmailInitializer(),
                new RankingByEmailAggregator(),
                listRankingSerde,
                StateStores.RANKINGS_BY_EMAIL_STORE
              )
    
            //useful debugging aid, print KTable contents
            rankingTable.toStream.print()
    
            val streams: KafkaStreams = new KafkaStreams(builder, Settings.createRatingStreamsProperties)
            val restEndpoint:HostInfo  = new HostInfo(Settings.restApiDefaultHostName, Settings.restApiDefaultPort)
            System.out.println(s"Connecting to Kafka cluster via bootstrap servers ${Settings.bootStrapServers}")
            System.out.println(s"REST endpoint at http://${restEndpoint.host}:${restEndpoint.port}")
    
            // Always (and unconditionally) clean local state prior to starting the processing topology.
            // We opt for this unconditional call here because this will make it easier for you to 
            // play around with the example when resetting the application for doing a re-run 
            // (via the Application Reset Tool,
            // http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool).
            //
            // The drawback of cleaning up local state prior is that your app must rebuilt its local 
            // state from scratch, which will take time and will require reading all the state-relevant 
            // data from the Kafka cluster over the network.
            // Thus in a production scenario you typically do not want to clean up always as we do 
            // here but rather only when it is truly needed, i.e., only under certain conditions 
            // (e.g., the presence of a command line flag for your app).
            // See `ApplicationResetExample.java` for a production-like example.
            streams.cleanUp();
            streams.start()
            val restService = new RatingRestService(streams, restEndpoint)
            restService.start()
    
            Runtime.getRuntime.addShutdownHook(new Thread(() => {
              streams.close(10, TimeUnit.SECONDS)
              restService.stop
            }))
    
            //return unit
            ()
          }
        }
      }
    

     

     

    Remember the idea is to get a Rating for  a user (based on their email address), and store all the Rating associated with them in some sequence/list such that they can be retrieved in one go based on a a key, where the key would be the users email, and the value would be this list of Rating objects.I think with the formal discussion from the official Kafka docs and my actual Rating requirement, the above should hopefully be pretty clear.

     

     

    Walking through Kafka Streams interactive queries

    So now that we have gone through how data is produced, and transformed (well actually I did not do too much transformation other than a simple map, but trust me you can), and how we aggregate results from a KStream to a KTable (and its state store), we will move on to see how we can use Kafka interactive queries to query the state stores.

     

    One important concept is that if you used multiple partitions for your original topic, the state may be spread across n-many processing node. For this project I have only chosen to use 1 partition, but have written the code to support n-many.

     

    So lets assume that each node could read a different segment of data, or that each node must read from n-many partitions (there is not actually a mapping to nodes and partitions these are 2 mut read chapters elastic-scaling-of-your-application and parallelism-model) we would need each node to expose a REST API to allow its OWN state store to be read. By reading ALL the state stores we are able to get a total view of ALL the persisted data across ALL the partitions. I urge all of you to read this section of the official docs : http://docs.confluent.io/current/streams/developer-guide.html#querying-remote-state-stores-for-the-entire-application

     

    This diagram has also be shamelessly stolen from the official docs:

     

    ../_images/streams-interactive-queries-api-02.png

    I think this diagram does an excellent job of showing you 3 separate processor nodes, and each of them may have a bit of state. ONLY be assembling ALL the data from these nodes are we able to see the ENTIRE dataset.

     

    Kafka allows this via metadata about the streams, where we can use the exposed metadata to help us gather the state store data. To do this we first need a MetadataService, which for me is as follows:

     

    package Processing.Ratings
    
    import org.apache.kafka.streams.KafkaStreams
    import org.apache.kafka.streams.state.StreamsMetadata
    import java.util.stream.Collectors
    import Entities.HostStoreInfo
    import org.apache.kafka.common.serialization.Serializer
    import org.apache.kafka.connect.errors.NotFoundException
    import scala.collection.JavaConverters._
    
    
    /**
      * Looks up StreamsMetadata from KafkaStreams
      */
    class MetadataService(val streams: KafkaStreams) {
    
    
       /**
        * Get the metadata for all of the instances of this Kafka Streams application
        *
        * @return List of { @link HostStoreInfo}
        */
      def streamsMetadata() : List[HostStoreInfo] = {
    
        // Get metadata for all of the instances of this Kafka Streams application
        val metadata = streams.allMetadata
        return mapInstancesToHostStoreInfo(metadata)
      }
    
    
      /**
        * Get the metadata for all instances of this Kafka Streams application that currently
        * has the provided store.
        *
        * @param store The store to locate
        * @return List of { @link HostStoreInfo}
        */
      def streamsMetadataForStore(store: String) : List[HostStoreInfo] = {
    
        // Get metadata for all of the instances of this Kafka Streams application hosting the store
        val metadata = streams.allMetadataForStore(store)
        return mapInstancesToHostStoreInfo(metadata)
      }
    
    
      /**
        * Find the metadata for the instance of this Kafka Streams Application that has the given
        * store and would have the given key if it exists.
        *
        * @param store Store to find
        * @param key   The key to find
        * @return { @link HostStoreInfo}
        */
      def streamsMetadataForStoreAndKey[T](store: String, key: T, serializer: Serializer[T]) : HostStoreInfo = {
        // Get metadata for the instances of this Kafka Streams application hosting the store and
        // potentially the value for key
        val metadata = streams.metadataForKey(store, key, serializer)
        if (metadata == null)
          throw new NotFoundException(
            s"No metadata could be found for store : ${store}, and key type : ${key.getClass.getName}")
    
        HostStoreInfo(metadata.host, metadata.port, metadata.stateStoreNames.asScala.toList)
      }
    
    
      def mapInstancesToHostStoreInfo(metadatas : java.util.Collection[StreamsMetadata]) : List[HostStoreInfo] = {
    
        metadatas.stream.map[HostStoreInfo](metadata =>
          HostStoreInfo(
            metadata.host(),
            metadata.port,
            metadata.stateStoreNames.asScala.toList))
          .collect(Collectors.toList())
          .asScala.toList
      }
    
    }
    

     

    This metadata service is used to obtain the state store information, which we can then use to extract the state data we want (it’s a key value store really).

     

    The next thing we need to do is expose a REST API to allow us to get the state. lets see that now

     

    package Processing.Ratings
    
    import org.apache.kafka.streams.KafkaStreams
    import org.apache.kafka.streams.state.HostInfo
    import akka.actor.ActorSystem
    import akka.http.scaladsl.Http
    import akka.http.scaladsl.model._
    import akka.http.scaladsl.server.Directives._
    import akka.stream.ActorMaterializer
    import akka.http.scaladsl.marshallers.sprayjson.SprayJsonSupport._
    import spray.json.DefaultJsonProtocol._
    import Entities.AkkaHttpEntitiesJsonFormats._
    import Entities._
    import Stores.StateStores
    import akka.http.scaladsl.marshalling.ToResponseMarshallable
    import org.apache.kafka.common.serialization.Serdes
    import scala.concurrent.{Await, ExecutionContext, Future}
    import akka.http.scaladsl.unmarshalling.Unmarshal
    import spray.json._
    import scala.util.{Failure, Success}
    import org.apache.kafka.streams.state.QueryableStoreTypes
    import scala.concurrent.duration._
    
    
    
    object RestService {
      val DEFAULT_REST_ENDPOINT_HOSTNAME  = "localhost"
    }
    
    
    class RatingRestService(val streams: KafkaStreams, val hostInfo: HostInfo) {
    
      val metadataService = new MetadataService(streams)
      var bindingFuture: Future[Http.ServerBinding] = null
    
      implicit val system = ActorSystem("rating-system")
      implicit val materializer = ActorMaterializer()
      implicit val executionContext = system.dispatcher
    
    
      def start() : Unit = {
        val emailRegexPattern =  """\w+""".r
        val storeNameRegexPattern =  """\w+""".r
    
        val route =
    
    
          path("test") {
            get {
              parameters('email.as[String]) { (email) =>
                complete(HttpEntity(ContentTypes.`text/html(UTF-8)`,
                            s"<h1>${email}</h1>"))
              }
            }
          } ~
          path("ratingByEmail") {
            get {
              parameters('email.as[String]) { (email) =>
                try {
    
                  val host = metadataService.streamsMetadataForStoreAndKey[String](
                    StateStores.RANKINGS_BY_EMAIL_STORE,
                    email,
                    Serdes.String().serializer()
                  )
    
                  var future:Future[List[Ranking]] = null
    
                  //store is hosted on another process, REST Call
                  if(!thisHost(host))
                    future = fetchRemoteRatingByEmail(host, email)
                  else
                    future = fetchLocalRatingByEmail(email)
    
                  val rankings = Await.result(future, 20 seconds)
                  complete(rankings)
                }
                catch {
                  case (ex: Exception) => {
                    val finalList:List[Ranking] = scala.collection.immutable.List[Ranking]()
                    complete(finalList)
                  }
                }
              }
            }
          } ~
          path("instances") {
            get {
              complete(ToResponseMarshallable.apply(metadataService.streamsMetadata))
            }
          }~
          path("instances" / storeNameRegexPattern) { storeName =>
            get {
              complete(ToResponseMarshallable.apply(metadataService.streamsMetadataForStore(storeName)))
            }
          }
    
        bindingFuture = Http().bindAndHandle(route, hostInfo.host, hostInfo.port)
        println(s"Server online at http://${hostInfo.host}:${hostInfo.port}/\n")
    
        Runtime.getRuntime.addShutdownHook(new Thread(() => {
          bindingFuture
            .flatMap(_.unbind()) // trigger unbinding from the port
            .onComplete(_ => system.terminate()) // and shutdown when done
        }))
      }
    
    
      def fetchRemoteRatingByEmail(host:HostStoreInfo, email: String) : Future[List[Ranking]] = {
    
        val requestPath = s"http://${hostInfo.host}:${hostInfo.port}/ratingByEmail?email=${email}"
        println(s"Client attempting to fetch from online at ${requestPath}")
    
        val responseFuture: Future[List[Ranking]] = {
          Http().singleRequest(HttpRequest(uri = requestPath))
            .flatMap(response => Unmarshal(response.entity).to[List[Ranking]])
        }
    
        responseFuture
      }
    
      def fetchLocalRatingByEmail(email: String) : Future[List[Ranking]] = {
    
        val ec = ExecutionContext.global
    
        val host = metadataService.streamsMetadataForStoreAndKey[String](
          StateStores.RANKINGS_BY_EMAIL_STORE,
          email,
          Serdes.String().serializer()
        )
    
        val f = StateStores.waitUntilStoreIsQueryable(
          StateStores.RANKINGS_BY_EMAIL_STORE,
          QueryableStoreTypes.keyValueStore[String,List[Ranking]](),
          streams
        ).map(_.get(email))(ec)
    
        val mapped = f.map(ranking => {
          if (ranking == null)
            List[Ranking]()
          else
            ranking
        })
    
        mapped
      }
    
      def stop() : Unit = {
        bindingFuture
          .flatMap(_.unbind()) // trigger unbinding from the port
          .onComplete(_ => system.terminate()) // and shutdown when done
      }
    
      def thisHost(hostStoreInfo: HostStoreInfo) : Boolean = {
        hostStoreInfo.host.equals(hostInfo.host()) &&
          hostStoreInfo.port == hostInfo.port
      }
    }
    

     

    With that final class we are able to run the application and query it using the url http://localhost:8080/ratingByEmail?email=sacha@here.com (the key to the Kafka store here is “sacha@here.com” and the value could either be an empty list or a List[Ranking] objects as JSON, the results of which are shown below after we have run the producer and used Chrome (or any other REST tool of your picking) to get the results

     

    image

     

     

    Conclusion

    I have found the journey to get here an interesting one. The main issue being that the Kafka docs and example are all written in Java and some are not even using Java Lambdas (Java 1.8) so the translation from that to Scala code (where there is lambda everywhere) is sometimes trickier than you might think.

     

    The other thing that has caught me out a few times is that the Scala type system is pretty good at inferring the correct types, so you kind of let it get on with its job. But occasionally it doesn’t/can’t infer the type correctly, this may happen at compile time if you are lucky, or at run time. In the case of a runtime issue, I found it fairly hard to see exactly which part of the Kafka streams API would need to be told a bit more type information.

     

    As a general rule of thumb, if there is an overloaded method that takes a serde, and  one that doesn’t ALWAYS use the one that takes a serde and specify the generic type parameters explicitly. The methods that take Serdes are usually ones that involve some sort of shuffling around within partitions so need Serdes to serialize and deserialize correctly.

     

    Other than that I am VERY happy with working with Kafka streams, and once you get into it, its not that different from working with Apache Spark and RDDs

     

     

    Next time

    Next time we will be turning our attention back to the web site, where we will expose an endpoint that can be called from the ratings dialog that is launched at the end of a job. This endpoint will take the place of the RatingsProducerApp demonstrated in this app. For that we will be using https://github.com/akka/reactive-kafka. We would also expose a new end point to fetch the rating (via email address) fro the Kafka stream processor node

     

    The idea being that when a job is completed a Rating from a driver to passenger is given, this is sent to the Kafka stream processor node, and the combined rating are accumulated for users (by email address) and are exposed to be queried. As you can see this post covers the later part of this requirement already. The only thing we would need to do (as stated above) is replace the RatingsProducerApp demonstrated in this app with new reactive kafka producer in the main Play application

    scala environment config options

    This is going to be a slightly weird post in a way as it is going to go round the houses a bit, and not going to contain any actual code, but shall talk about possible techniques of how to best manage specific environment config values for a multi project scala setup

    Coming from .NET

    So as many of you know I came from .NET, where we have a simple config model. We have App.Config or Web.Config.

    We have tools at our disposal such as the XmlTransformation MsBuild tasks which allow us to maintain a set of different App.Config values in them that will be transformed for your different environments

    I wrote a post about this here

    https://sachabarbs.wordpress.com/2015/07/07/app-config-transforms-outside-of-web-project/

    Here is a small example of what this might look like

     

    image

    So when I started doing multi project Scala projects using SBT where I might have the following project requirements

    image

    In the above diagram the following is assumed

    • There is a Sacha.Common library that we want to use across multiple projects
    • That both Sacaha.Rest.Endpoints and Sacha.Play.FrontEnd are both apps which will need some environment specific configuration in them
    • That there is going to be more than 1 environment that we wish to use such as
      • PROD
      • QA
      • DEV

    Now coming from .NET my initial instinct was to put a a bunch of folders in the 2 apps, so taking the Sacha.Rest.Endpoints app as an example we may have something like this

    image

    So the idea would be that we would have specific application.conf files for the different environments that we need to target (I am assuming there is some build process which takes care of putting the correct file in place for the correct build within the CI server).

    This is very easy to understand, if we want QA we would end up using the QA version of the application.conf file

    This is a very common way of thinking about this problem in .NET.

    Why Is This Bad?

    But hang on here this is only 1 app, what if we had 100 apps that made up our software in total. That means we need to maintain all these separate config files for all the environments in ALL the separate apps.

    Wow that doesn’t sound so cool anymore.

    Another Approach!

    A colleague and I were talking about this in some scala code that was being written for a new project, and this is kind of what was being discussed.

    I should point out that this idea was not in fact mine, but my colleagues Andy Sprague, which is not something I credited him for in the 1st draft of this post. Which is bad, sorry Andy.

    Anyway how about this for another idea. How about the Sacha.Common JAR hold just the specific bits of changing config in separate config files such as

    • “Qa.conf”
    • “Prod.conf”
    • “Dev.conf”

    And then the individual apps that already reference the Sacha.Common JAR just include the environment config they need.

    This is entirely possible thanks to the way that the Typesafe config library works, where it is designed to include extra config files. These extra config files in this case are just inside of the a JAR that is external -> Sacha.Common

    Here is what this might like look for a consumer of the Sacha.Common jar

    image

    Where we just include the relevant environment config from Sacha.Common in the application.conf for the current app

    And this is what the Sacha.Common may look like, where it provides the separate environment config files that consumers may use

    image

    This diagram may help to illustrate this further

    image

    Why Is This Cool?

    The good thing about this design over the separate config files per environment per application is that we now ONLY need to maintain one set of environment specific settings, which are those in the common Jar Sacha.Common

    I wish we could do this within the .NET configuration system.

    Hope this helps, I personally think that this is a fairly nice way to manage your configs for multiple applications and multiple environments

     

     

     

     

     

     

    Apache Kafka 0.9 Scala Producer/Consumer

    For my job at the moment, I am roughly spending 50% of my time working on .NET and the other 50% of the time working with Scala. As such a lot of Scala/JVM toys have spiked my interest of late. My latest quest was to try and learn Apache Kafka, well enough that I at least understood the core concepts. I have even read a book or two on Apache Kafka, now, so feel I am at least talking partial sense in this article.

    So what is Apache Kafka, exactly?

    Here is what the Apache Kafka folks have to say about their own tool.

    Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
    Fast
    A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

    Scalable
    Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers

    Durable
    Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

    Distributed by Design
    Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

    Taken from http://kafka.apache.org/ up on date 11/03/16

    Apache Kafka was designed and built by a team of engineers at LinkedIn, where I am sure you will agree they probably had to deal with quite a bit of data.

     

    I decided to learn a bit more about all this and have written an article on this over at code project :

     

    http://www.codeproject.com/Articles/1085758/Apache-Kafka-Scala-Producer-Consumer-With-Some-RxS

     

    In this article I will talk you through some of the core Apache Kafka concepts, and will also show how to create a Scala Apache Kafka Producer and a Scala Apache Kafka Consumer. I will also sprinkle some RxScala pixie dust on top of the Apache Kafka Consumer code such that the RX operators to be applied to the incoming Apache Kafka messages.

    CASSANDRA + SPARK 2 OF 2

    Last time I walked you through how to install Cassandra in the simplest manner possible. Which was as a single node installation using the DataStax community edition.

    All good stuff. So I have also just written up how you might use Scala/DataStax Cassandra/Spark connector, to allow you to retrieve data from Cassandra into Spark RDDs and Cassandra tables to be hydrated into Spark RDDs.

    These 2 Cassandra articles and the 1st Spark one kind of form a series of articles, which you can find using the series links at the top of the articles.

    Anyway here is the latest installment

    Apache Spark/Cassandra 2 of 2

    Scala : multi project sbt setup

    A while ago I wrote a post about how to use SBT (Scala Build Tool):

    https://sachabarbs.wordpress.com/2015/10/13/sbt-wheres-my-nuget/

    In that post I showed simple usages of SBT. Thing is that was not really that realistic, so I wanted to have a go at a more real world example of this. One where we might have multiple projects, say like this:

     

    SachasSBTDemo-App which depends on 2 sub projects

    • SachasSBTDemo-Server
    • SachasSBTDemo-Common

    So how do we go about doing this with SBT?

    There are 5 main steps to do this. Which we look at in turn.

    SBT Directory Structure

    The first that we need to do is create a new project folder (if you are from Visual Studio / .NET background think of this as the solution folder) called “project”

    In here we will create 2 files

    build.properties which just lists the version of SBT we will use. It looks like this

    sbt.version=0.13.8
    

    SachaSBTDemo.scala is what I have called the other file, but you can call it what you like. Here is the contents of that file, this is the main SBT file that governs how it all hangs together. I will be explaining each of these parts as we go.

      import sbt._
      import Keys._
    
    object BuildSettings {
    
    
      val buildOrganization = "sas"
      val buildVersion      = "1.0"
      val buildScalaVersion = "2.11.5"
    
      val buildSettings = Defaults.defaultSettings ++ Seq (
        organization := buildOrganization,
        version      := buildVersion,
        scalaVersion := buildScalaVersion
      )
    }
    
    
    object Dependencies {
      val jacksonjson = "org.codehaus.jackson" % "jackson-core-lgpl" % "1.7.2"
      val scalatest = "org.scalatest" % "scalatest_2.9.0" % "1.4.1" % "test"
    }
    
    
    object SachasSBTDemo extends Build {
    
      import Dependencies._
      import BuildSettings._
    
      // Sub-project specific dependencies
      val commonDeps = Seq (
         jacksonjson,
         scalatest
      )
    
      val serverDeps = Seq (
         scalatest
      )
    
    
      lazy val demoApp = Project (
        "SachasSBTDemo-App",
        file ("SachasSBTDemo-App"),
        settings = buildSettings
      )
      //build these projects when main App project gets built
      .aggregate(common, server)
      .dependsOn(common, server)
    
      lazy val common = Project (
        "common",
        file ("SachasSBTDemo-Common"),
        settings = buildSettings ++ Seq (libraryDependencies ++= commonDeps)
      )
    
      lazy val server = Project (
        "server",
        file ("SachasSBTDemo-Server"),
        settings = buildSettings ++ Seq (libraryDependencies ++= serverDeps)
      ) dependsOn (common)
      
    }
    

     

    Projects

    In order to have separate project we need to use the Project item from the SBT library JARs. A minimal Project setup will tell SBT where to create the new Project. Here is an example of a Project, where the folder we expect SBT to create will be called “SachasSBTDemo-App”.

    lazy val demoApp = Project (
        "SachasSBTDemo-App",
        file ("SachasSBTDemo-App"),
        settings = buildSettings
      )
    

    Project Dependencies

    We can also specify Project dependencies using “dependsOn” which takes a Seq of other projects that this Project depends on.

    That means that when we apply an action to the Project that is depended on, the Project that has the dependency will also have the action applied.

    lazy val demoApp = Project (
        "SachasSBTDemo-App",
        file ("SachasSBTDemo-App"),
        settings = buildSettings
      )
      //build these projects when main App project gets built
      .aggregate(common, server)
      .dependsOn(common, server)
    

    Project Aggregation

    We can also specify Project aggregates results from other projects, using “aggregate” which takes a Seq of other projects that this Project aggregates.

    What “aggregate” means is that whenever we apply an action on the aggregating Project we should also see the same action applied to the aggregated Projects.

    lazy val demoApp = Project (
        "SachasSBTDemo-App",
        file ("SachasSBTDemo-App"),
        settings = buildSettings
      )
      //build these projects when main App project gets built
      .aggregate(common, server)
      .dependsOn(common, server)
    

    Library Dependencies

    Just like the simple post I did before, we still need to bring in our JAR files using SBT. But this time we come up with a nicer way to manage them. We simply wrap them all up in a simple object, and then use the object to satisfy the various dependencies of the Projects. Much neater.

    import sbt._
    import Keys._
    
    
    object Dependencies {
      val jacksonjson = "org.codehaus.jackson" % "jackson-core-lgpl" % "1.7.2"
      val scalatest = "org.scalatest" % "scalatest_2.9.0" % "1.4.1" % "test"
    }
    
      // Sub-project specific dependencies
      val serverDeps = Seq (
         scalatest
      )
    
      .....
      .....
      lazy val server = Project (
        "server",
        file ("SachasSBTDemo-Server"),
        //bring in the library dependencies
        settings = buildSettings ++ Seq (libraryDependencies ++= serverDeps)
      ) dependsOn (common)
    

    The Finished Product

    The final product once run through SBT should be something like this if viewed in IntelliJ IDEA:

    image

     

    Or like on the file system

    image

    If you want to grab my source files, they are available here at GitHub : https://github.com/sachabarber/SBT_MultiProject_Demo