Skip to main content

Suggest an Idea

Vote for your favorite Qlik product ideas and add your own suggestions.

Announcements
Qlik Connect 2024! Seize endless possibilities! LEARN MORE

Access to advanced Kafka telemetry exposed by librdkafka statistics

Jose_Pena
Employee
Employee

Access to advanced Kafka telemetry exposed by librdkafka statistics

Attachments

The Kafka target endpoint uses the librdkafka library to provide a Producer client to Kafka clusters from Replicate. The librdkafka library is able to emit internal metrics at regular intervals if the Producer is configured accordingly. Today, the Kafka target endpoint does not expose those metrics for their subsequent analysis.

The method to tell the Producer to return the metrics requires setting the statistics.interval.ms configuration property to a value > 0 and registering an internal callback method (normally through the property stats_cb) to handle the storage of the information produced.

The feature request consists of:

  • exposing the property statistics.interval.ms, for example, as a "rdkafkaProperty" internal parameter in the Kafka endpoint connection configuration.
  • showing the produced statistics either in a designated log component of the logging system or in a dedicated log file.

Having access to the advanced telemetry is very important in every stage in a use case implementation, but most importantly during PERF testing and performance fine-tuning. The granularity provided by librdkafka statistics is essential for configuration and performance analysis. It would give the opportunity to anticipate potential throughput/latency issues in production environments or find root causes in the case of production issues.

Telemetry description

The advanced Kafka telemetry exposed by librdkafka has the following levels:

  • Top-level: general statistics considering all brokers
  • Brokers: per broker statistics
  • Topics: per topic statistics
    • Partitions: per partition statistics

As most operations are windowed operations (operating on slices of time), Topics and Partitions levels include Windows stats: moving average, smallest and largest values, sum of values, percentile values, etc.

Each level provides very valuable telemetry of librdkafka, as producer, and therefore Replicate as producer. Below are examples of fields that provide information on the producer performance:

Top-level

 

Field

Type

Description

tx

int

Total number of requests sent to Kafka brokers

tx_bytes

int

Total number of bytes transmitted to Kafka brokers

rx

int

Total number of responses received from Kafka brokers

rx_bytes

int

Total number of bytes received from Kafka brokers

txmsgs

int

Total number of messages transmitted (produced) to Kafka brokers

txmsg_bytes

int

Total number of message bytes (including framing, such as per-Message framing and MessageSet/batch framing) transmitted to Kafka brokers

rxmsgs

int

Total number of messages consumed, not including ignored messages (due to offset, etc), from Kafka brokers.

rxmsg_bytes

int

Total number of message bytes (including framing) received from Kafka brokers

 

Brokers

 

Field

Type

Description

state

string

Broker state (INIT, DOWN, CONNECT, AUTH, APIVERSION_QUERY, AUTH_HANDSHAKE, UP, UPDATE)

stateage

int gauge

Time since last broker state change (microseconds)

outbuf_cnt

int gauge

Number of requests awaiting transmission to broker

outbuf_msg_cnt

int gauge

Number of messages awaiting transmission to broker

waitresp_cnt

int gauge

Number of requests in-flight to broker awaiting response

waitresp_msg_cnt

int gauge

Number of messages in-flight to broker awaiting response

tx

int

Total number of requests sent

txbytes

int

Total number of bytes sent

txerrs

int

Total number of transmission errors

txretries

int

Total number of request retries

req_timeouts

int

Total number of requests timed out

rx

int

Total number of responses received

rxbytes

int

Total number of bytes received

rxerrs

int

Total number of receive errors

rxcorriderrs

int

Total number of unmatched correlation ids in response (typically for timed out requests)

rxpartial

int

Total number of partial MessageSets received. The broker may return partial responses if the full MessageSet could not fit in the remaining Fetch response size.

disconnects

int

Number of disconnects (triggered by broker, network, load-balancer, etc.).

int_latency

object

Internal producer queue latency in microseconds. See Window stats below

outbuf_latency

object

Internal request queue latency in microseconds. This is the time between a request is enqueued on the transmit (outbuf) queue and the time the request is written to the TCP socket. Additional buffering and latency may be incurred by the TCP stack and network. See Window stats below

rtt

object

Broker latency / round-trip time in microseconds. See Window stats below

throttle

object

Broker throttling time in milliseconds. See Window stats below

 

Topics

 

Field

Type

Description

batchsize

object

Batch sizes in bytes. See Window stats·

batchcnt

object

Batch message counts. See Window stats·

partitions

object

Partitions dict, key is partition id. See partitions below.

 

Partitions

 

Field

Type

Description

msgq_cnt

int gauge

Number of messages waiting to be produced in first-level queue

msgq_bytes

int gauge

Number of bytes in msgq_cnt

xmit_msgq_cnt

int gauge

Number of messages ready to be produced in transmit queue

xmit_msgq_bytes

int gauge

Number of bytes in xmit_msgq

fetchq_cnt

int gauge

Number of pre-fetched messages in fetch queue

fetchq_size

int gauge

Bytes in fetchq

committed_offset

int gauge

Last committed offset

txmsgs

int

Total number of messages transmitted (produced)

txbytes

int

Total number of bytes transmitted for txmsgs

rxbytes

int

Total number of bytes received for rxmsgs

msgs

int

Total number of messages received (consumer, same as rxmsgs), or total number of messages produced (possibly not yet transmitted) (producer).

msgs_inflight

int gauge

Current number of messages in-flight to/from broker

next_ack_seq

int gauge

Next expected acked sequence (idempotent producer)

next_err_seq

int gauge

Next expected errored sequence (idempotent producer)

acked_msgid

int

Last acked internal message id (idempotent producer)

 

Window stats

 

Field

Type

Description

min

int gauge

Smallest value

max

int gauge

Largest value

avg

int gauge

Average value

sum

int gauge

Sum of values

cnt

int gauge

Number of values sampled

stddev

int gauge

Standard deviation (based on histogram)

hdrsize

int gauge

Memory size of Hdr Histogram

p50

int gauge

50th percentile

p75

int gauge

75th percentile

p90

int gauge

90th percentile

p95

int gauge

95th percentile

p99

int gauge

99th percentile

p99_99

int gauge

99.99th percentile

outofrange

int gauge

Values skipped due to out of histogram range

 

Telemetry example

Attached is an example of the information returned in every statistics dump at regular intervals.

4 Comments
Ola_Mayer
Employee
Employee
 
Status changed to: Open - Collecting Feedback
David_E
Contributor
Contributor

This would be extremely helpful.  Many of us are pushing Apache Kafka into new territories and developing blind is very difficult and sometimes very time consuming.  To see details of what's happening from Producer clients to broker would be big for us.

Meghann_MacDonald

From now on, please track this idea from the Ideation portal. 

Link to new idea

Meghann

NOTE: Upon clicking this link 2 tabs may open - please feel free to close the one with a login page. If you only see 1 tab with the login page, please try clicking this link first: Authenticate me! then try the link above again. Ensure pop-up blocker is off.

Ideation
Explorer II
Explorer II
 
Status changed to: Closed - Archived