Effective Performance Monitoring

Tuesday. July 11, 2017

Close monitoring of your performance metrics is extremely critical for ensuring that your application, integration services, and infrastructure are available and performing well, so you can keep providing a great customer experience successfully fast.

To have your performance metrics under your control I would recommend to follow the best practises described below.

Collect and analyze your performance metrics

Collect All Performance Metrics

Monitoring and collecting different types of performance data is useful for current bottlenecks identifying and for watching the trends of your application’s performance.

Collecting data is cheap, but not having it when you need it can be expensive, so you should instrument everything, and collect all the useful data you reasonably can. Ilan Rabinovitch

There are following main areas you should closely keep your eye on:

Hardware resources: CPU, memory, disc and network
JVM metrics: GC activity and memory utilization
Business transactions: execution time and throughput
Integration calls: amount and response time
Caches utilization: cache capacity and hits ratio
Errors: types and amount of failures

Performance metrics are usually collected at a regular interval over the time, e.g. once per second or once per minute. Depending on a particular metric’s nature choose a sufficient granularity for data being collected.

And it’s really important to store all your collected data as long as possible. It could be a concern to have all this data stored somewhere, however, having collected data for a year or more makes it much easier to understand long-time trends and make forecasts for the future.

Follow the Trends

Having your application’s performance metrics being collected you have to keep watching the trend of your system’s capacity. It will give you an ability to analyze the current system’s performance and act appropriately with trending improvements or degradations.

Also, it’s important to have automatic reports with weekly and monthly trends. Such reports could be generated with help of your APM tool, such as AppDynamics, Datadog or Dynatrace.

Define SLAs

The Service Level Agreement (SLA) establishes the metrics for evaluating the system performance and provides the definitions for availability and the scalability targets.

SLA for all your performance metrics should be clearly defined and listed somewhere in your Wiki. As a result you might have something like the following:

Type	Metric	SLA
Hardware	Application server CPU utilization	70%
JVM	GC Major collections	0
Execution	Place Order transaction	500 ms.
Integration	Data grid remote call	10 ms.
Cache	Hit ratio	0.8
Error	HTTP 500 errors amount	0

These defined SLAs will help you to react appropriately on performance degradation in a real time.

Setup Alerting

Having SLA defined for all your performance metrics it’s required to setup alert notifications for each particular case. With help of Nagios and Splunk you can have all your systems metrics monitored, and if any of them became out of defined SLA, you’ll receive an alerting email, so you can immediately react to the situation.

Alert messages should be clear and meaningful with a description of possible cause of the occurred issue. Also, you might want them to have a run-book with required actions to be immediately taken.

Conclusion

Efficient monitoring of your performance metrics helps you to have a detailed picture of your system’s internal health. Keeping watching system’s performance trends makes it easy to plan your application’s scaling in accordance with business forecasts, such as holiday season traffic amount.

Having monitoring, reporting and alerting properly configured you’re always able to answer to the following questions:

Is the application available and working fast as it’s expected?
What is the current throughput?
Is it suffering some bottlenecks?