Setting up proactive, synthetic monitoring is critical for complex, distributed systems like Apache Kafka®, especially when deployed on Kubernetes and where the end-user experience is concerned, and is paramount for healthy real-time data pipelines. A key benefit for operations teams running Kafka on Kubernetes is infrastructure abstraction: it can be configured once and run everywhere.
When integrated with Confluent Platform, Datadog can help visualize the performance of the Kafka cluster in real time and also correlate the performance of Kafka with the rest of your applications.
While Confluent recommends our customers use Confluent Cloud to monitor clusters for you, there are situations where you may need to self-host a Confluent Platform cluster on a cloud provider or on premises.
This blog post shows you how you can get more comprehensive visibility into your deployed Confluent Platform using Confluent for Kubernetes (CFK) on Amazon Kubernetes Service (AWS EKS), by collecting all Kafka telemetry data in one place and tracking it over time using Datadog.
Preliminary
Confluent for Kubernetes (CFK) is a cloud-native control plane for deploying and managing Confluent in your private cloud environment. It provides a standard and simple interface to customize, deploy, and manage Confluent Platform through a declarative API.
Datadog is a monitoring and analytics tool for IT and DevOps teams that can be used to determine performance metrics as well as event monitoring for infrastructure and cloud services. It can monitor services such as servers, databases, cloud infrastructure, system processes, serverless functions, etc. To get started on monitoring Kafka clusters using Datadog, you may refer to this documentation from Datadog.
Kubernetes, or K8s, is an open source platform that automates Linux container operations, eliminating manual procedures involved in deploying and scaling containerized applications. AWS's Elastic Kubernetes Service (EKS) is a managed service that lets you deploy, manage, and scale containerized applications on Kubernetes. Datadog helps you monitor your EKS environments in real time. Because Datadog already integrates with Kubernetes and AWS, it is ready-made to monitor EKS.
Deploy Confluent Platform with CFK on AWS EKS
The Confluent for Kubernetes (CFK) bundle contains Helm charts, templates, and scripts for deploying Confluent Platform to your Kubernetes cluster. You can deploy CFK using one of the following methods:
This blog post assumes you have Confluent Platform deployed on an AWS EKS cluster and running as described here. In order to get started with the AWS EKS cluster deployment, follow the steps in the documentation. Once you have the K8s cluster at your disposal, you can get started on installing CFK and Confluent Platform on the AWS EKS cluster nodes.
Creating Datadog API keys
API keys are unique to your organization. An API key is required by the Datadog agent to submit metrics and events to Datadog. Once you are logged into the Datadog console, navigate to the Organizational settings in your Datadog UI and scroll to the API keys section. Create a new key and save it for future usage in Confluent Platform for integration on Kubernetes nodes. For the next steps, refer to this documentation: Create API key.
Install Datadog agents
First, Datadog agents need to be installed on every node of the K8s cluster to collect metrics, logs, and traces from your Kafka deployment. For that to happen, you first need to ensure that Kafka and ZooKeeper are sending JMX data, then install and configure the Datadog agent on each of the producers, consumers, and brokers. It collects events and metrics from hosts and sends them to Datadog, where you can analyze your monitoring and performance data. It can run on your local hosts (Windows, macOS), containerized environments (Docker, Kubernetes), and in on-premises data centers. You can install and configure it using configuration management tools such as Chef, Puppet, or Ansible.
Configuring the Datadog site name
Datadog’s site name has to be set if you’re not using the default on datadoghq.com. You can pass it in the values.yaml file or, more preferably, via the Helm command as shown above.
Note: If the datadog.site variable is not explicitly set, it defaults to the US site datadoghq.com. If you are using one of the other sites (EU, US3, or US1-FED) this will result in an invalid API key message. Use Datadog’s documentation site selector to see appropriate names for the site you’re using.
To install the chart for Datadog, identify the right release name:
1. Install Helm.
2. Using the Datadog values.yaml configuration file as a reference, create a values.yaml parameterized for your enterprise. Datadog recommends that your values.yaml only contain values that need to be overridden, as it allows a smooth experience when upgrading chart versions. If this is a fresh install, add the Helm Datadog repo:
helm repo add datadog https://helm.datadoghq.com
helm repo update
3. Retrieve your Datadog API key from your agent installation instructions and run:
helm repo add datadog https://helm.datadoghq.com
helm repo update
export ddkey=<DATADOG_API_KEY>
helm install mbldd -f values.yaml \
--set datadog.apiKey=$ddkey datadog/datadog \
--set targetSystem=linux \
--set datadog.logs.enabled=true \
--set datadog.logs.containerCollectAll=true \
--set agents.image.tagSuffix=jmx \
--set clusterChecksRunner.image.tagSuffix=jmx \
--set datadog.site=datadoghq.eu # only for none us folks
Integrating Confluent Platform with DataDog
Annotations with CP and CFK
Modify your Confluent Platform’s yaml file to reflect the Datadog annotations. Add the following annotations to each component-specific CRD (used for Datadog events). So autodiscovery will work, this example shows Kafka after the " / ’ , this is the name of the CR. The annotations are for Kafka, ZooKeeper, Connect, and Schema Registry. Replace the <cp-component> with the respective name.
Spec:
podTemplate
annotations:
ad.datadoghq.com/<cp-component>.check_names: '["confluent_platform"]'
ad.datadoghq.com/<cp-component>.init_configs: '[{"is_jmx": true, "collect_default_metrics": true, "service_check_prefix": "confluent", "new_gc_metrics": true, "collect_default_jvm_metrics": true}]'
ad.datadoghq.com/<cp-component>.instances:'[{"host":"%%host%%","port":"7203","max_returned_metrics":300]'
ad.datadoghq.com/<cp-component>.logs: '[{"source":"confluent_platform","service":"confluent_platform"}]'
Refer to the complete Confluent Platform yaml in this GitHub repo. After all the annotations are configured correctly in each component Custom Resource, you will now redeploy Confluent Platform on K8s using the following command:
kubectl apply -f $CONFLUENT_HOME/confluent-platform-datadog-cfk.yaml
Install Confluent Platform plugin on DataDog
Now it's time to integrate the Confluent Platform with Datadog. First, you need to install the integration with the Datadog Confluent Platform integration tile as shown in Figure 3. Navigate to the “Integrations” section on the left-hand side vertical menu.
Click the Install button on the Confluent Platform tile and you will now be presented with a widget that lets you configure the Datadog agents on your Kubernetes nodes where Confluent Platform’s Kafka clusters are located. Figures 4 and 5 demonstrate the overview of Confluent Platform-specific components from which Datadog collects JMX metrics and respective configurations.
Validate Datadog agent installation
When Datadog agents are installed on each of the K8s nodes, they should be displayed when you run the following command:
kubectl get pods -l app.kubernetes.io/component=agent
kubectl exec -it <any of the pods above> -- bash
Desired output:
Execute into one of the Datadog agent pods and check the Datadog agent status:
kubectl exec -it <datadog agent pods > -- bash
agent status
Look for the jmxfetch section of the agent status output. It should now show the established Confluent Platform integration.
========
JMXFetch
========
Information
==================
runtime_version : 11.0.16
version : 0.46.0
Initialized checks
==================
confluent_platform
instance_name : confluent_platform-10.92.6.5-7203
message : <no value>
metric_count : 115
service_check_count : 0
status : OK
Verify Confluent Platform dashboard
You will now be fully equipped with a comprehensive dashboard that shows all Confluent Platform metrics ranging from producer, consumer, broker, connect, ISRs, under replicated partitions, ksqlDB, and so on. According to your business need, you are now ready to explore, slice, and dice the individual widget.
Conclusion
Monitoring your Kubernetized Confluent Platform clusters deployed on AWS allows for proactive response, data security and gathering, and contributes to an overall healthy data pipeline. Datadog is one of the predominantly used SaaS network monitoring, infrastructure management, and application monitoring solutions used by many Confluent customers. This post walked through the integration of Confluent Platform with Datadog on a K8s platform like EKS, to monitor key metrics, logs, and traces from your Kafka environment. This allows you to leverage improved visibility into Kafka health and performance, and create automated alerts tailored to your infrastructure needs.
Comments