Flink Monitoring
Following article lists the steps to follow for generating flink metrics , scenarios where flink runs in two different setups -
Flink provides built-in support for Prometheus. You need to configure Flink to expose its metrics in a format that Prometheus can scrape.
1. Scenerio where flink running on container
Enable Prometheus Metrics in Flink :
Step 1 : Add following lines in the docker-compose.yaml file of the flink container.
Lines -
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9256
Open port 9256 in the docker container. You can use any port.
You can also skip the port line as its optional.
The port on which the Prometheus exporter listens on by defaults is 9249
metrics.reporter.prom.port: 9256 #we can skip this line
Step 2 : Similarly we can add same steps for taskmanager configuration.
Make sure both containers jobmanager, taskmanager are in the same network.
Step 3 : Run the container
# docker-compose up -d
Step 4 : Now we need to add the ports in prometheus.yaml file
Ssh into the Prometheus server.
Go to prometheus.yaml file and add following lines :
- job_name: 'flink-sit'
scrape_interval: 5s
scrape_timeout: 5s
static_configs:
- targets: ['13.114.144.179:9257','13.114.144.179:9256']
where,
13.114.144.179 is the public ip of the instance where flink container is running
9257 -> port of taskmanager
9256 -> port of jobmanager
Then restart prometheus# sudo systemctl restart prometheus
# sudo systemctl status prometheus
Step 5 : Add the ports 9257,9256 in inbound rules of flink server security group to allow Prometheus server to get the metrics.
If port is not opened in the security group, we get the following error in Prometheus -
Ex: context deadline exceeded
Step 6 : Check the promethus url
# http://<pub-ip-prometheus-server>:9090
go to /targets
View the metrics by clicking on the endpoint.
Sample docker-compose.yaml file for flink.
# Example docker-compose.yaml file of flink container
version: '3.4'
services:
summarization-jobmanager:
restart: always
image: flink:latest
container_name: summarization-jobmanager
ports:
- "8088:8081"
- "9256:9256"
command: jobmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: summarization-jobmanager
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9256
summarization-taskmanager:
restart: always
image: flink:latest
container_name: summarization-taskmanager
depends_on:
- summarization-jobmanager
ports:
- "9257:9257"
command: taskmanager
environment:
- |
FLINK_PROPERTIES=
jobmanager.rpc.address: summarization-jobmanager
taskmanager.numberOfTaskSlots: 6
taskmanager.memory.process.size: 2000m
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9257
networks:
default:
external:
name: flink-network
2.Scenerio where flink runs on Server (EC2 in case of AWS)
To setup monitoring for this, we have followed following steps -
Step 1: Make the PrometheusReporter jar available to the classpath of the Flink cluster (it comes with the Flink distribution):# cp /opt/flink/opt/flink-metrics-prometheus-1.7.2.jar /opt/flink/lib
Step 2 : Open the flink-conf.yaml file and add following lines for exporting prometheus metrics
Lines :#Expose metrics on the configured port to Prometheus reporter
metrics.reporters: prom
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory taskmanager.network.detailed-metrics: true
metrics.system-resource: true
metrics.system-resource-probing-interval: 5000
metrics.reporter.prom.port: 9250-9251
Note :
My flink version is 1.19.1 so in my case earlier I was using following lines -
# Expose metrics on the configured port to Prometheus reporter
metrics.enabled: true
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9259
But below highlighted linemetrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
is not supported it throws following error in task manager logs,
So I replaced this line by -metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
Flink conf file exposes 2 ports for flink metrics , one port for taskmanager related metrics and other port for jobmanager related metrics.
So we can either give big range like -metrics.reporter.prom.port: 9250-9260
where any 2 ports will be picked by flink to expose metrics from the port range 9250-9260
or
we can also give the range of ports which flink should choose like -metrics.reporter.prom.port: 9250-9251
Step 3 : Restart Flink
# ./stop-cluster.sh
# ./start-cluster.sh
Step 4 : Configure the prometheus.yaml file
job_name: 'Flink-DS'
scrape_interval: 5s
scrape_timeout: 5s
static_configs:
- targets: ['52.33.9.91:9250','52.33.9.91:9251']
Where,
52.33.9.91:9250 -> ip of the server where flink is running
9250, 9251 -> ports where flink metrics are exposed
Step 5 : Restart prometheus
# sudo systemctl restart prometheus
# sudo systemctl status prometheus
Step 6 : Add port 9250,9251 to the security group of the flink server to allow prometheus server to get the metrics.
Step 7 : Check the promethus url
# http://<pub-ip-prometheus-server>:9090
go to /targets
View the metrics by clicking on the endpoint.
Note :
If flink is not able to connect on the port assigned in the conf.yaml file of flink
i.e metrics.reporter.prom.port: 9250/9251
then we get connect: connection refused error.
Ex:
Further we can use these metrics to create dashboard in Grafana.
Hope you find this useful.
Subscribe to my newsletter
Read articles from Sonal Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by