Email Subscription Form

Saturday, April 6, 2019

Logging, Monitoring, and Alerting

This week I'm writing about three things not often associated with testing: logging, monitoring, and alerting.  Perhaps you've taken advantage of logging in your testing, but monitoring and alerting seem like a problem for IT or DevOps.  However, a bug-free application doesn't mean a thing if your users can't get to it because the server crashed!  For this reason, it's important to understand logging, monitoring, and alerting so that we as testers can participate in ensuring the health of our applications.

Logging:

Logging is simply recording information about what happens in an application.  This can be done through writing to a file or a database.  Often developers will include logging statements in their code to help determine what's going on with the application below the UI.  This is especially helpful in applications that make calls to a number of servers or databases.

Recently I tested a notification system that passed a message from a function to a number of different channels.  Logging was so helpful in testing because it enabled me to follow the message through the channels.  If I hadn't had good logging, I wouldn't have had any way to figure out where the bug was when I didn't get a message I was expecting.

Good log messages should be easy to access and easy to search.  You shouldn't have to log on to some obscure remote desktop and sift through tens of thousands of entries with no line breaks.  One helpful tool for logging is Kibana- an open-source tool that lets you search and sort through logs in an easy-to-read format.

Good log messages should also be easy to understand and provide helpful information.  It's so frustrating to find a log message about an error and discover that it says "An unknown error occurred", or "Error TSGB-45667".  Ask your developer if he or she can provide log messages that make it clear what went wrong and where in the code it happened.

Another helpful tactic for logging is to give each event a specific GUID as an identifier.  The GUID will stay associated with everything that happens with the event, so you can follow it as it moves from one area of an application to another. 


Monitoring:

Monitoring means setting up automatic processes to watch the health of your application and the servers that run it.  Good monitoring ensures that any potential problems can be discovered and dealt with before they reach the end user.  For example, if it becomes clear that a server's disk space is reaching maximum capacity, additional servers can be added to handle the load.

Things to monitor include:
  • server response times
  • load on the server
  • server errors, such as 500-level response errors
  • CPU usage
  • memory usage
  • disk space
One way to monitor application health is with a periodic health check or a ping.  A job is set up to make a request to the server every few minutes and record whether the response was positive or negative.  Monitoring can also happen through a tool that watches the number of requests to the server and records whether those requests were successful.  Data points such as response times and CPU usage can also be recorded and examined to see if there are any trends that might indicate that the application is unhealthy.  One example of a tool that monitors application and server health is AppDynamics.  

Alerting:

All the logging and monitoring in the world won't be helpful if no one is watching to see if there are problems!  This is where alerting comes in.  Alerts can be set to notify the appropriate people so that immediate action can be taken when there is a problem.  

Some situations that might call for an alert would be:
  • CPU or memory usage goes above a certain threshold
  • Disk space goes below a certain threshold
  • The number of 500 errors goes above a certain level
  • A health check fails twice in a row
  • Response times are slower than expected
  • Load is higher than normal
There are a number of ways to alert people of problems.  Alerts can be set up that will send emails, text messages, or phone calls.  PagerDuty is one service that provides this alerting functionality.  An important thing to consider, however, is to set off-hours alerts only for serious cases in which users might be affected.  No one wants to be woken up in the middle of the night by an alert that says that the QA servers are down!  However, a problem in the QA environment could indicate an issue that could be seen in the production environment in the future.  So a less invasive alert, such as a message to a team chat room, could be set up for this situation.  

You may be saying to yourself at this point, "But I'm a software tester!  It's not my job to set up logging, monitoring, and alerting for the company."  The health of your application is the responsibility of everyone who works on the application, including you!  While you might not have the clout to purchase server monitoring software, you still have the power to ask questions of your team, such as:
  • How can we troubleshoot user issues?
  • How do we know that we have enough servers to handle our application's load?
  • How will we know if our API is responding correctly?
  • How will we know if a DDoS attack is being attempted on our application?
  • How will we know if our end users are experiencing long wait times?
  • How will we know if we are running out of disk space?
Hopefully these questions will motivate you and your team to set up logging, monitoring, and alerting that will ensure the health and reliability of your application.  

3 comments:

  1. So glad to see you encourage testers to get involved with logging, monitoring and alerting. I would add to that, observability. We testers have good skills for spotting patterns in data and identifying risks, it's another way we can make valuable contributions to our team and product.

    ReplyDelete
    Replies
    1. That is an excellent point, Lisa! Thanks for including it.

      Delete
  2. Monitoring is the new testing!

    ReplyDelete

Why You Should Be Testing in Production

This is a true story; I'm keeping the details vague to protect those involved.  Once there was a software team that was implementing new...