By Jon Gifford, founder and chief search officer, Loggly
Developers pride themselves on writing code that’s solid and reliable – and passes QA with flying colours.
This is no surprise: Smart designs and great code are what tend to drive recognition—and rewards—within development teams and across the industry.
But in big, distributed systems like ours, a lot of factors are out of the control of developers.
Running tens or hundreds of machines increases the risk that one of them will start misbehaving. The most elegant design in the world can still have edge-cases that throw them for a loop. Customers have an uncanny ability to do things in ways no sane developer would anticipate. And those factors can—and do—make even the best code fail.
When code starts its “real life” in production, you need to be ahead of the curve. You need to be able to understand why something is happening when your code fails. That’s where instrumentation comes in. But how do you convince your development team to put appropriate instrumentation into practice?
Believing in appropriate instrumentation requires a mindset change
Engineers are often shielded from nitty-gritty of the production environment that their code will be living in, either through choice or process. Traditionally, developers move on to the next cool piece of work when the code has passed QA and is production bound.
This can lead to a mindset where the code in production is yesterday’s news. If it’s a little difficult to manage, then that is “not my problem”, or “next release”. But the complexity and release cadence of today’s cloud-based applications have made this boundary a lot more porous than it has ever been in the past, and a new mindset is required.
Specifically, engineers should put as much effort into making sure that the system is easy to manage in production as you do into making sure that it is nicely designed and passes all the unit tests.
In the initial design, development and QA phase, the most important question is “is my code doing what the design says it should be doing?” Once it hits production, an equally important question must be asked: “Is the system handling the real world as well as we thought it would?” And you need to be ready to answer that question very quickly.
Since this is a very open-ended question, one of the best ways to answer it is to use instrumentation to give you visibility into the system. In other words, if you don’t know what questions you’re going to need to answer, make sure you at least have the answers to the questions you do know. The real answer may lie in the gaps, but at least you know where those gaps are.
… typically learned at the school of hard knocks
As often happens, the most zealous converts to the new mindset are usually those developers who have paid the price, losing hours of sleep in wild goose chases after something has gone awry at 2 a.m.
Once you see your code fail in ways you never would have anticipated, have a pet belief destroyed, or find that you simply have no idea why something is misbehaving, you’re much more likely to question your other assumptions about how things should behave. And you (hopefully) become a believer in the idea that logging what you DO know is a really good way to discover what you don’t.
An example: at Loggly, we needed an industrial-strength service for handling some of our data thatwas easily accessible by a wide range of clients.
Enter Amazon S3. We expected that it would be rock-solid and fast enough for our purposes. Every now and then the second of these expectations has proved to be an issue, because although the 99.9th percentile behavior is just fine, the 99.99th percentile proved a little harder to handle. How did we know that this was the issue? 1 in 10,000 requests is a pretty small needle in a pretty big haystack, after all.
Truth be told, we didn’t even suspect S3 would be problematic, but our general rule of instrumenting everything that went out of process gave us the data we needed to find and explain some very sporadic issues in our system. We still use S3, and we’re very happy with it. But we now know more about how it behaves in extremis, we can factor that into our design.
Reshape the wall between development and Ops with instrumentation
Once your development team starts to change its mindset, it’s easier to give development ownership over the gap between design models and the real world. I’ve spent a long time building distributed systems, and I’ve never felt comfortable just handing them over to Ops without first building a suite of internal monitoring tools.
These tools help diagnose the inevitable 2 a.m. problems and make it easy to stay on top of how things behave when all is well. That second point is easy to brush over, but is essential to understanding what is going on when things do go wrong. If you know that your system latency or throughput has a particular “shape”, you can look at a graph and know that all is well – but you have to invest some time to learn that shape.
Traditional ops monitoring tools (Nagios, Ganglia, etc) are great at what they do, but if the system you’re monitoring has complex internal behavior that isn’t accessible to those tools, you’re setting everyone up for a painful experience when things go wrong. The most knowledgeable people when it comes to understanding a complex system are usually the people who wrote it. If they are walled off from it, then you’re just making things harder for yourself.
There are probably as many definitions of DevOps as there are developers and ops people, but as far as I’m concerned, the approach I outline above is one of the most important. Your developers should want to know how their code is behaving in production – they should “know the shapes.”
Your Ops people should want to know about the internal monitoring, and should be comfortable using it to dig a little deeper than they otherwise could. There should be as few barriers as possible between the two groups. They have different specialisations, but the end goal for both should be a smoothly running, high performance, well understood system.
Educate your team on what machine-based log management can do
Taking responsibility for high quality production instrumentation doesn’t mean that developers should turn into log watchers. They should be thinking about making their logs watchable by a machine:
- Problems become visible before they affect every user
- The logs contain all of the data necessary to identify causes of failure
- The logs contain performance data for easy analytics
Engineers tend to push back on idea of adding instrumentation because they don’t want to see their pretty logs full of “clutter”. Logs as prose has a long tradition in software engineering, and can often provide insights into the applications that emit them – profanity is a sure sign that your application is heading towards, or is already in a state, that the developer is not happy about. If your engineers think of logs this way, then its understandable that anything that doesn’t look like prose is indeed bad because it makes it harder for them to quickly read and understand what the application is doing.
When the logs are written to be read by machines rather than humans, this “clutter” is actually the best possible way to log because it is natively understandable by those machines. JSON is easier to read than, say, XML, and is trivial for machines to consume.
Your engineers may not be happy about the change, but show them a graph of the latency of their application, generated with a couple of mouse clicks rather than a grep | sed | sort | awk| gnuplot pipeline that always goes wrong the first 3 times, and they start to see the value. Add some alerts, and they’ll realize they don’t have to sit and tail files to see what is going on in the system.
With a better understanding of how log management solution like Loggly can clean up the picture, they will be more eager to adopt my best practices [link to previous blog]. Eventually, they’ll wonder how they ever survived without it.
Treat visibility as the mother of more instrumentation
Once your development team starts down the path of putting the right logs in the right format for the right consumer, you’ll find that positive feedback loops will form. Your team starts to see what’s going on at a level of detail they never could before – and also see where you can’t see enough. They’ll save hours or even days in debugging and operational troubleshooting by:
- Focusing their time on areas that are obviously problematic rather than experimenting with theories
- Being able to measure performance and understand the magnitude of issues
Eventually, you’ll get to the point where your instrumentation is telling you what’s really going on, as opposed to what your design tells you should be going on. Hopefully there’s not a big divergence, but when there is it’s nice to be able to measure it. And once you do, you’ll never go back.
Look forward, not just back
The value of log data is not just in understanding what has already happened, but in guiding future action. Bringing log metrics into your planning exercises embeds it deeper into how you operate as a development and production organization.
At Loggly, we can easily look at a wide range of metrics for every part of our system – data volumes, flow through each component, indexing rate, search latency… you name it, we measure it. Knowing these values means that we can make plans on how to grow the world’s most popular cloud-based log management service based on real data, not idealized tests that don’t always reflect the complexity of our production service.
The real world will always throw you curve-balls, no matter how hard you try. Some instances will perform worse than other “identical” instances. Some application will be configured differently than all the rest. Some disk drive will fail. Some customer will send 100 times more data than anyone else. The list is endless, and no-one I’ve ever worked with, or heard about, has been smart enough to predict every possible failure mode and have tests for them.
Life as a developer of a cloud-based application can be like standing in front of a fire hose of pain. The only way to survive it is to do everything you can to measure what is actually happening in your system, and use that data to get a deeper understanding of what is actually going on, rather than guessing based on what you think should be going on.