More than a full day out of a developer’s work week can be spent troubleshooting production errors. In many cases, it’s even more than that. We hear all the time from engineering teams that their developers are spending at least 25% of their time, on average, solving (or trying to solve) production issues. That means they’re dedicating more than a full day of their work week to troubleshooting.1. Identifying there was an error
The first step in solving any problem is admitting you have one in the first place. Then, you should probably figure out what the problem is, if you don’t already know. Surprisingly (or not), this is actually one of the parts in the debugging process that gives developers the most trouble. How do you first figure out that something isn’t working right?
From the people that we spoke with, we learned that, on average, well over half of production errors are reported by the end users. Many of those companies rely on error reporting tools and users to provide feedback for up to 80-90% of errors. Right away, we know that it isn’t good practice to rely on your users to tell you when you have a problem or that something isn’t working right. Even worse, you don’t want to have problems that aren’t being reported by users, because that will relate directly to low customer satisfaction and customer churn.
The problem with relying on end users to report errors is that more often than not, their reports are missing critical information about the error. The dialogue, if you’re lucky enough to be able to have one, takes up a lot of time and there’s a lot of back and forth between engineering, support, and QA that sucks up tremendous amounts of time to retrieve information for troubleshooting.
With Growlytics, you can get all kind of issue notifications over email, slack and other channels with all details needed to resolve the issue starting from customer actions to server logs.
2. Determining the impact of the error
Whether your information is coming from user feedback or is showing up in the logs, determining the error rate is essential to recognizing if you’re dealing with a critical issue or not. If the error happens infrequently and doesn’t have a large impact on users, dedicating time to reproducing and resolving the problem may not be worth the cost when there are much more critical errors happening.When relying on logs and end users, your means for understanding error rates and severity are limited.
When looking at exceptions, logged warnings or errors, the error rate is a key metric for triaging the system’s health as a whole. To identify what are the most critical issues that require our developers’ attention, you can use Growlytics to sort through new errors and their frequency, before zooming in on their root cause analysis. Some of our most successful users are implementing an “Inbox Zero” approach to exceptions, as if they were emails that require handling.
3. Locating the affected code (or, Sifting through log files)
Once the error has been identified, it’s time to find its actual location in the source code and make sure it’s assigned to the right developer. There are a couple of ways to do this, but usually developers spend time on sifting through log files and looking for clues. The developer who is tasked with resolving the issue sometimes may not have access to logs or clear idea of what he or she is looking for. Plus, the information they need may not have been written to the logs at all.
Using log management tools, like Splunk or Elk or Loggly, can help cut through the noise, but they can’t help when the piece of information that you’re looking for was never written to the logs to begin with. In that case, the only way to get the information you need is to add logging verbosity and hope that when (if!) the error happens again, you can see what’s going on.
With Growlytics, you can exactly see what line of code is breaking with all information required on what was the data posted with for given api and logs written for that api, with this option, you can cut the effort by almost 80%.
4. Reproducing the errorIf your logs don’t give you a clear answer to why the error happened, and this is highly likely, trying to reproduce it is the next step. Plus, reproducing an error before you attempt to fix it is good practice regardless. After all, if you don’t first reproduce the problem then how can you be sure that you’ve fixed it when you’re done debugging? Unfortunately, with vague reports coming from users and unclear logging statements surrounding the error, finding the exact flow of events that caused it takes a lot of time and may even be impossible. If you’re lucky, the problem might have occurred in a part of the code that you recently worked on and you can easily deduce the steps that led to it. Otherwise, the most common way of finding the event flow is to try different things and observe the results, send the ticket to QA for more testing, look through the logs or maybe adding additional logging statements. More often than not, failure in this step is what causes tickets to be closed with the infamous… “could not reproduce”. With Growlytics, you can exactly see all customer actions on browser/mobile, all APIs executed for given session. With this all data you can save more than 80% of time on reproducing the issue.
5. Entering “war room mode”It’s hard to dispute the benefits of working in a “war room-like” set-up. It can bring with it project clarity, stronger communication and increased productivity. So, maybe you’re thinking about rearranging the furniture in your office to boost innovation and progress, or maybe your application crashed for some unknown reason and you need to figure out why RIGHT NOW.
If you’re lucky, you might only have to do a war room once or twice a year when something catastrophic happens. For others, war room situations occur on a weekly, or even daily, basis. With so much time being dedicated to resolve issues, there’s hardly any time left to advance the product roadmap.
Some of the developers that we’ve spoken with recently described gathering for a war room situation and still not having the information to move forward with a solution. Some described war room situations that lasted for 5 or 6 days. That’s bad. Not only do these situations take time away from the rest of your work tasks, it can hurt the reputation and revenue of the company.
Root Cause Automation
The most effective way that we’ve found to cut down on the time that your team spends debugging is to automate the error resolution process. That means automating not only the identification of errors, but more importantly the automation of root cause analysis, with access to the complete source code and variable state across the entire call stack of every error.
Growlytics created a tool that does just that. Not only does it provide you with the source and full call stack of any exception or error, it reveals the exact variable state at the time of error so teams no longer need to spend endless hours trying to reproduce it.