Understanding and using performance metrics in Talend ESB Web Services

TalendSolutionExpert · Jan 22, 2024 9:35:30 PM

This article demonstrates how to monitor the performance of Web Services deployed in Talend Runtime. It explains the statistics generated by Talend Web Services and how to leverage the metrics using JMX and Nagios to improve the Quality of the Service.

The Metrics feature of CXF management module provides aggregate statistics for Web Services running in the CXF Bus such as response times, endpoint status, and throughput. Understanding these metrics facilitates the development of high-performance Web Services, and on Production systems it finds problems at an early stage and troubleshoots them faster.

Prerequisites

Linux distribution—this example uses Ubuntu 17.10
Active Internet connection — to download dependencies and installation packages
Talend ESB — installed
DemoService ESB — demo sample in Studio — is deployed successfully
JConsole — for monitoring the metrics using JMX
Nagios installed and configured with ESB templates for monitoring (optional)

Monitoring Web Services using Nagios (optional)

Nagios is an open source monitoring solution that allows users to identify infrastructure problems before they effect important business processes. Nagios monitors the entire IT infrastructure to ensure services, applications, and business processes are working as expected. Talend ESB can also be monitored using Nagios.

Prerequisites for monitoring Talend ESB with Nagios

Install Nagios and Jmx4Perl.
Configure the CXF template files, cxf.cfg and cxf-host.cfg, jmx_commands.cfg, delivered with Talend Runtime in the directory TalendRuntimePath/add-ons/adapters/nagios, see Talend ESB Nagios configuration template files.

Note: The CXF template files, located in the cxf_templates_nagios.zip file attached to this article, are customized for this demonstration. If you use these templates, modify the Critical and Warning status parameters to display your business needs.

Host monitoring using Nagios

Monitor host_esb to display the statistics of the host, for example, host status and up-time.

Monitoring Talend Runtime Server using Nagios

Monitoring Web Services using JMX

If you are not using a monitoring solution like Nagios, you can monitor the JMX metrics using JConsole.

If your Talend Runtime is installed on a remote server, modify the settings in the container/etc/SERVICE_NAME-wrapper.conf file, to enable JMX monitoring. For example, Talend-Runtime-wrapper.conf or karaf-wrapper.conf:

# Uncomment or add below lines to enable JMX
wrapper.java.additional.10=-Dcom.sun.management.jmxremote.port=1616
wrapper.java.additional.11=-Dcom.sun.management.jmxremote.authenticate=false
wrapper.java.additional.12=-Dcom.sun.management.jmxremote.ssl=false
wrapper.java.additional.13=-Djava.rmi.server.hostname=HOST_NAME or IP_ADDRESS_OF_RUNTIME

Open JConsole, and navigate to MBeans > Metrics.Server.

Metrics MBeans for DemoService

Using the Metrics feature in Talend ESB

The Metrics feature in Talend ESB is implemented using the Codahale metrics library, and provides several commonly used metric types such as Meter, Counter, Timer, and Histogram, to output metrics values. Before analyzing these metrics, you need to know some basics about the metric types.

Meter: measures event occurrences count and rate. A meter metric measures Mean throughput and one-, five-, and fifteen-minute exponentially-weighted moving average throughputs.

MBeans type Meter and its attributes

Average Rates

nMinuteRate: returns the last n-minute exponentially-weighted moving average rate at which events have occurred since the meter was created. For example, where n = 1, 5, 15.
MeanRate: rate at which events have occurred after the meter is started.

An example from Wikipedia:

Comparison of common averages of values { 1, 2, 2, 3, 4, 7, 9 }
Type	Description	Example	Result
Mean	Sum of values of a data set divided by number of values	(1+2+2+3+4+7+9) / 7	4

Counter: records incrementations and decrementations. For example, counting the number of DemoService invocations.

MBeans type Counter and its attributes
Timer: provides the duration statistics. It aggregates Min, Mean, and Max durations since the start of the meter. It encapsulates and gathers throughput statistics from Meter, response statistics from Histogram, and statistics from Counter.

MBeans type Timer and its attributes
Histogram: keeps track of a stream of long values, and analyzes their statistical characteristics such as Max, Min, Mean, Median, standard deviation, 75th percentile, and 99th percentile. Statistics generated by Histogram can be retrieved using a Timer Object as Snapshots.
- Mean: is the average response time an application takes to process a request.
- Max: is the maximum response time an application takes to process a request from a user.
- Min: is the minimum response time an application takes to return a request to a user.
- Percentiles: are useful for giving the relative standing that is, how one particular data value compares to the rest of the data. The Timer Object also provides metrics for 50th, 75th, 95th, and 98th percentile scores.
For example: A service after running continuously for a few days. The average response time for a service could be low for example, 1 to 2 secs. However, the Max response time may go up, for example, 60 secs depending on the health of the server, network, increase in load, or the number of concurrent users.

Choosing which metrics to monitor for Web Services

Web Services performance can be measured in terms of throughput, response times, latency, availability, and many other metrics. Higher throughput, lower latency, low response times, low error rate, and high availability are some of the characteristics every application should have.

Throughput: is the total number of transactions processed by the server in a given time. The time is calculated from the start of the first sample to the end of the last sample. Throughput is also measured in terms of the number of bytes exchanged per second.
Response Time: is the time taken for processing at the server before it returns.
Latency: is the additional delay involved for a processed request to reach the remote client. Latency increases if the network quality decreases.
Faults and Errors: must be in an acceptable limit and might increase with network issues and high load.
Availability: measures if the Endpoint is reachable.

Measure	MBeans Type	Metric Attributes	Statistics to monitor
Throughput (events/sec)	type=Metrics.Server Attribute=Totals	MeanRate OneMinuteRate FiveMinuteRate FifteenMinuteRate	MeanRate: measures throughput from the start of the Timer, and may not represent the actual current load. nMinuteRate: measures the moving average throughput for the last n-minute. The nMinuteRate is useful to understand the real time load on the system for last 1, 5, and 15 minutes respectively.
Response Times (millisecs)	type=Metrics.Server Attribute=Totals	Min Max Mean Percentiles	Observes the deviations between the Min and Max response times. Percentiles can also be monitored if Average measurements don't reflect actual load conditions.
In Flight orders (count)	type=Metrics.Server Attribute=Totals	Count	Returns count of pending requests. If the count is increasing, either the load on the system is increasing, or the server needs performance tuning.
Availability	type = Bus.Service.Endpoint	State	Endpoint is reachable if it is in the STARTED state.
Errors (count)	type=Metrics.Server Attribute=Checked Application Faults Attribute=Logical Runtime Faults Attribute=Runtime Faults Attribute=Unchecked Application Faults	Count	Returns count of checked, unchecked, logical, and Runtime Faults.
Number of Invocations (count)	type=Metrics.Server Attribute=Totals	Count	Number of requests processed after the service is started.

Performing a Load test

To get a better understanding of the performance of your Web Service, and to understand the Metric attributes, you should perform Load tests. Perform Load tests using many concurrent users for a minimum duration of 1 or 2 minutes, depending on your business requirements. Jmeter or SoapUI are popular for performing Load tests of Web Services.

Test Summary

Server: Talend Runtime is rebooted to ensure the metrics set to a default value 0.

Service: DemoService is deployed in Karaf.

Nagios: Metric refresh interval is 90 secs (open source)

Jmeter settings:

Uses Web Service template
Number of samples: 10,000
Number of users: 25
Loop count: 400
Ramp-up period: 1 sec

Goal

Observe the statistics generated by the metrics in multiple steps, and understand how the readings change with and without a Load on the server.
Capture the screen shots in both JConsole and Nagios.

Test 1: Statistics generated one minute after triggering the Load test.

Note: The Load test is initiated, and the container is under a load of 10,000 requests.

Throughput

1minuteRate, 5minuteRate, MeanRate: 180 transactions/sec
Note: Observe the warning message in Nagios console (high load 10000 transactions)

Response Times

Min - 1.73 ms
Max - 304 ms
Mean - 61 ms
99th Percentile - 175 ms

NumOfInvocations

10000
No errors are reported, and the Endpoint state is Green.

Conclusion:

nMinuteRate and the Max response time increased to a new high level.
In production scenarios, the InFlight or Pending orders should be closely monitored.

Test 2: Statistics generated two minutes after triggering Load test.

Note: Load test is complete, and the container is idle.

Throughput

1minuteRate: 11 transactions/sec, Warning signal in Nagios disappears.
5minuteRate, MeanRate also reduced to 50 to 100 transactions/sec.

Response Times

Min - 1.73 ms
Max - 304 ms
Mean - 61 ms
99thPercentile - 175 ms

NumOfInvocations

10000
No errors are reported, and the Endpoint state is Green.

Conclusion:

The nMinuteRate readings reflect the real time load on the system, and the Max response time remained the same.
The Max and Min response times help to analyze and read timeouts.
The Count from the InFlight and Error metric attributes can identify increasing load or performance issues.

Frequently asked questions

Question: Can I monitor Talend ESB Rest services?

Answer: Yes. CXF templates can be used for both SOAP and REST Data Services.

Question: Can I monitor Talend ESB Camel Routes?

Answer: Yes. Camel templates must be used for Routes using cSOAP and cREST components. For more information, see the Camel metrics templates.

Question: Why is the Nagios console displaying an alert or warning message?

Answer: The CXF template used in this demo is configured to display a critical alert if the number of transactions/sec crosses 100, and a warning alert for more than 50 transactions/sec. So, the color for oneMinuteRate fluctuates between green, pink, and yellow.

<Check OneMinuteRate>
 MBean = org.apache.cxf:bus.id=*,type=Metrics.Server,service="$1",port="$0",Attribute=Totals
 Attribute = OneMinuteRate
 Name = OneMinuteRate
 #no of events per sec
 Critical 100
 Warning 50
</Check>

Question: What other monitoring solutions are supported by Talend ESB?

Answer: Talend supports monitoring using JMX protocol, and any vendor API that can query JMX complying to standard security policies.

Understanding and using performance metrics in Talend ESB Web Services