Friday, 17 November 2023

Kafka indepth

 https://www.spiceworks.com/tech/data-management/articles/what-is-kafka/#_002

How Does Kafka Work?

Kafka combines these two ideas using a partitioned log model. A log is an organized collection of records, and these logs are divided into partitions that represent various subscribers. This enables greater scalability allowing several subscribers to the same subject, each of whom is given a partition. Replayability is another part of the Kafka ecosystem. It lets multiple independent applications that read from data streams work independently and at their own speed.


Applications (also known as producers) send data records (i.e., messages) to a Kafka node (broker), where they are processed by other applications (also known as consumers). Apache gathers readable data from a vast array of data sources and groups it into “topics.”


For consumers to get fresh communications, they must subscribe to the topic, and once they do so, the messages mentioned earlier will be saved automatically. Because of their potential to become rather extensive, topics are broken down into more manageable subtopics to enhance their performance and scalability.



Kafka VS REST 

https://stackoverflow.com/questions/57852689/kafka-msg-vs-rest-calls

HTTP protocol is sync


There is very wide misconception that HTTP is async. Http is synchronous protocol but your client could deal it async. E.g. when you call any service using http your http client would schedule is on the backend thread (async). However The http call will be waiting until either it's timeout or response is back , during all this time the http call chain is awaiting synchronously. Now if you have hundreds of requests at a time you can image how many http calls are scheduled synchronously and you may run of sockets.


AMQP


In Microservices architecture we prefer AMQP (Advance message queue protocol) . Which means the service drops the message in queue and forgets about it. This is true async transport protocol since your service is done once it drops the message in the queue and interested services will pick those.


This type of protocol is preferred since you can scale without worry even when other services are down as they will eventually get message/event/data.


So it really depends on your particular case. HTTP are easy to implement but you can't scale them well. Message services come with own challenges like order of messages and workers but that make the architecture scaleable and is preferred way. For write operation always prefer queue, for read operation you can use HTTP but make sure you are not doing a long chain where one service is calling another and that calls another.



When will Kafka close idle connection :

https://www.reddit.com/r/apachekafka/comments/uy20l6/when_will_kafka_subscriber_gets_dropped_removed/?onetap_auto=true&show_am=true


afka Consumers are ultimately removed when any offsets they have committed to the consumer-offsets topic are TTL'd/dropped.


By default that's 7-days I think, it has changed a bit over the years.


Kafka protocol 

https://kafka.apache.org/protocol.html

Kafka uses a binary protocol over TCP. The protocol defines all APIs as request response message pairs. All messages are size delimited and are made up of the following primitive types.


The client initiates a socket connection and then writes a sequence of request messages and reads back the corresponding response message. No handshake is required on connection or disconnection. TCP is happier if you maintain persistent connections used for many requests to amortize the cost of the TCP handshake, but beyond this penalty connecting is pretty cheap.


The client will likely need to maintain a connection to multiple brokers, as data is partitioned and the clients will need to talk to the server that has their data. However it should not generally be necessary to maintain multiple connections to a single broker from a single client instance (i.e. connection pooling).


The server guarantees that on a single TCP connection, requests will be processed in the order they are sent and responses will return in that order as well. The broker's request processing allows only a single in-flight request per connection in order to guarantee this ordering. Note that clients can (and ideally should) use non-blocking IO to implement request pipelining and achieve higher throughput. i.e., clients can send requests even while awaiting responses for preceding requests since the outstanding requests will be buffered in the underlying OS socket buffer. All requests are initiated by the client, and result in a corresponding response message from the server except where noted.


The server has a configurable maximum limit on request size and any request that exceeds this limit will result in the socket being disconnected.


Kafka consumer implementation :

using kafka SDK :

https://stackoverflow.com/questions/66301419/why-and-when-close-a-consumer

https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html

try {

    while (true) { 

        /*Code with consuming process*/

        }

    }

} finally {

    consumer.close(); 

}


// Like all consumers its inifinite loop, while using the SDK not only the socket will be open, but consumer will send heart beat to tell kafka broker he is alive.


try {

    while (true) { 1

        ConsumerRecords<String, String> records = consumer.poll(100); 2

        for (ConsumerRecord<String, String> record : records) 3

        {

            log.debug("topic = %s, partition = %d, offset = %d,"

                customer = %s, country = %s\n",

                record.topic(), record.partition(), record.offset(),

                record.key(), record.value());


            int updatedCount = 1;

            if (custCountryMap.countainsKey(record.value())) {

                updatedCount = custCountryMap.get(record.value()) + 1;

            }

            custCountryMap.put(record.value(), updatedCount)


            JSONObject json = new JSONObject(custCountryMap);

            System.out.println(json.toString(4)) 4

        }

    }

} finally {

    consumer.close(); 5

}

1

This is indeed an infinite loop. Consumers are usually long-running applications that continuously poll Kafka for more data. We will show later in the chapter how to cleanly exit the loop and close the consumer.


2

This is the most important line in the chapter. The same way that sharks must keep moving or they die, consumers must keep polling Kafka or they will be considered dead and the partitions they are consuming will be handed to another consumer in the group to continue consuming. The parameter we pass, poll(), is a timeout interval and controls how long poll() will block if data is not available in the consumer buffer. If this is set to 0, poll() will return immediately; otherwise, it will wait for the specified number of milliseconds for data to arrive from the broker.


3

poll() returns a list of records. Each record contains the topic and partition the record came from, the offset of the record within the partition, and of course the key and the value of the record. Typically we want to iterate over the list and process the records individually.


4

Processing usually ends in writing a result in a data store or updating a stored record. Here, the goal is to keep a running count of customers from each county, so we update a hashtable and print the result as JSON. A more realistic example would store the updates result in a data store.


5

Always close() the consumer before exiting. This will close the network connections and sockets. It will also trigger a rebalance immediately rather than wait for the group coordinator to discover that the consumer stopped sending heartbeats and is likely dead, which will take longer and therefore result in a longer period of time in which consumers can’t consume messages from a subset of the partitions.

Kafka vs rabbitMQ:


https://aws.amazon.com/compare/the-difference-between-rabbitmq-and-kafka/#:~:text=RabbitMQ%20and%20Apache%20Kafka%20move,exchange%20of%20continuous%20big%20data.

How do Kafka and RabbitMQ handle messaging differently?

RabbitMQ and Apache Kafka move data from producers to consumers in different ways. RabbitMQ is a general-purpose message broker that prioritizes end-to-end message delivery. Kafka is a distributed event streaming platform that supports the real-time exchange of continuous big data.



No comments:

Post a Comment