Tech Sessions Podcast: Ep. 7 Observability Masterclass with Roy Douber: Proactive IT Solutions for 2025

Cloud Tech Sessions Podcast: Ep. 7 Observability Masterclass with Roy Douber: Proactive IT Solutions for 2025

You can’t fix what you can’t see. In this Observability Masterclass, Roy Douber breaks down how observability helps you catch problems before they start. Learn how to build smarter, not harder, with observability strategies that boost performance for AI, cloud, and complex IT environments.

Overview:

In this episode, Roy Douber explores the concept that you can’t fix what you can’t see and emphasizes that observability is crucial for maintaining the smooth operation of IT systems. From avoiding expensive outages to enhancing AI and cloud workloads, Roy provides practical strategies for IT leaders to stay ahead. He clarifies that observability is not just a tool but a vital IT culture of proactive monitoring, enabling teams to ask the right questions in real time and make informed decisions before issues arise.

By adopting a robust observability strategy, businesses can detect early warning signs, prevent costly outages, and minimize downtime, potentially saving millions in lost revenue and safeguarding their brand reputation.

Listen to the Episode:

Watch the Episode:

Key Topics Covered:

  • The essence of observability and its critical role in IT infrastructure.
  • How observability can prevent outages and reduce downtime.
  • The impact of observability on code quality and overall business performance.

Key Takeaways:

  • Observability is crucial for identifying and preventing outages before they occur, regardless of company size.
  • A strong observability practice enables faster code releases, reduces time spent on triage, and drives better business outcomes.
  • Implementing observability is not just about tools—it's about cultivating a culture that prioritizes real-time system insights.

Read the Transcript:

Kevin Kohn: Welcome everyone to our Cloud Tech Sessions Observability Masterclass. My name is Kevin Kohn, Vice President of Cloud and Application Engineering here at e360. With me today, I have our observability guru, Roy Douber. Roy, please tell us a little bit about yourself.

 

Roy Douber: Thanks, Kevin. I've been with e360 for a few years now, working primarily on observability, DevOps, AI, and various solutions for different clients. Before this, I was a lead site reliability engineer at Qualcomm. Thanks for the kind introduction, and let’s jump right in.

 

Kevin Kohn: Sure thing. I want to expand on your introduction a bit because we brought you to the team to be this amazing observability guru—and you are—but you contribute so much more to the organization. You've done everything from database optimization for creative structures in the cloud space to helping us re-architect data centers.

 

Your skills have really contributed to a wide range of areas, which I think brings even more value to this discussion. You have an end-to-end view of data centers and much of what goes into making them successful. Wouldn't you agree?

 

Roy Douber: Yeah, I’d say so. Let’s dive a bit deeper into my background. Throughout my career, I've owned various observability products at different companies. As a site reliability engineer, you get thrown into outages, and the ask is always, "Fix it, please." The stakes are often high, with millions of dollars on the line per hour.

 

You’ve got angry clients, potentially angry stakeholders, and a service that’s down, affecting either the client or internal stakeholders. During the outage, you work to resolve it, but that’s only the beginning. You need a strong understanding of complex systems, databases, applications, logging, metrics, events, programming languages, and CI/CD processes to address the issue. But even after fixing the outage, there’s more work to be done.

 

After the outage, the real work starts—you collaborate with the engineering team to develop a better observability solution for the application. That’s been my life for the past eight years.

 

It’s about implementing, writing, and devising solutions, showcasing the data in a meaningful way for clients or stakeholders, and telling the story. Sometimes the end solution is focused on developers, sometimes on operators, and many times it's focused at the executive level. The essence of observability, in my opinion, is that it’s a culture.

It’s a tool and a mechanism for acquiring data and presenting it in a way that helps drive business revenue. If your application is down, you're losing revenue. But if it's up and hardened, you're driving revenue and creating better client outcomes.

 

Kevin Kohn: Very well said. All of us who have been in the data center space have been involved in outages more times than we care to remember, right? For those of us who've been around a while, it's happened many, many times. One thing I’ve noticed—no pun intended—is that companies that have properly invested in observability often get advanced notice or can either remediate or avoid the outage altogether. Can you speak a little more about that from your experience?

Roy Douber: Sure. The symptoms before an outage are an interesting topic. Sometimes the data is right in front of you, but you just don’t know to look for it. And sometimes it’s not. I see this a lot—some companies are just starting their observability journey, while others have all the data in front of them but don’t know where to look at the right time.

There’s also the issue of identifying small symptoms that can lead to bigger problems. This involves tools and data acquisition. Generally, if you have all the pieces in place, you start to move toward an observability culture. Observability, much like how DevOps evolved from a role into a culture, has grown beyond being just a tool. Today, observability is more of a culture. You need the right people who understand the tools to answer the business’s most critical questions.

Kevin Kohn: That’s a great point. So, on that note, let’s get into what exactly observability is. How would you define it?

Roy Douber: The essence of observability is being able to answer questions about systems that you didn’t know beforehand.

You should be able to ask what the system is doing in real time, look at your instrumentation, examine the data, and come up with an answer. But observability is also a culture of well-instrumented tooling and technology, mechanisms to acquire, process, store, visualize, and act on data.

With this, you can answer those questions or devise a path toward finding those answers. You may not have all the data immediately, but observability gives you a way to access the information needed to answer complex questions about your business systems, metrics, or applications within your ecosystem.

Kevin Kohn: I think that’s incredibly powerful, especially for large environments. When you’re managing hundreds or even thousands of servers, you’ve got numerous networking nodes and a huge storage environment to maintain. Things can get complex very quickly. You can see why observability is essential for large environments. Do you think the same justification applies for smaller environments? If so, why?

Roy Douber: To put it simply, if you ever want to grow big, it makes sense to start early.

That said, I struggle with the fact that many companies still see observability as a checkbox item. They just add it and think they're good to go. But I believe the investment should go into training people and allocating time for engineers to develop observability. If a developer spends 80% of their time building a product, roughly 20% should go into observability and improving the environment they’re working in. Otherwise, you end up with an application that becomes dysfunctional.

Or, if key people leave, there’s no knowledge left behind to understand how the system operates, what the problems are, or where potential risks lie. You need to be able to answer these questions for any complex application.

Kevin Kohn: I think that applies no matter how mature your organization is. Would you agree?

Roy Douber: Absolutely. Even small businesses should start early. They don’t have to start with premium tools—open-source tools are fine at the beginning—but the key is to begin early. Build your application so that you can easily instrument metrics, logs, and traces.

Doing this will benefit you as your customer base grows, as your use cases expand, and as you implement new technologies like AI. Having a modular enough piece of software that allows for observability is critical.

Kevin Kohn: That leads us to another point. I’ve asked you about more traditional data center architecture, but we all know things have gotten much more complex, especially with the rise of serverless, Kubernetes, and now AI. The world is changing fast. We’re getting more and more ways to serve workloads and provide applications. In the context of these new methods and cloud environments, how does the observability story intersect with all this complexity?

 

Roy Douber: Great question. Workloads are becoming more and more ephemeral, right? Your Kubernetes pod could disappear at any moment. AI answers vary depending on who asks and how they ask the question. All that data needs to feed back to developers and even clients. Clients care about this too—they often check status pages to ensure the services they’re relying on are up and running.

 

Serverless is almost entirely ephemeral. The function runs and then it’s gone. So, if something goes wrong with that function, what do you do?

 

Traditionally, the tooling has been, "I’m going to set this up, install it, and leave it to run in the background." But the use cases have evolved. With ephemeral jobs, you need a solution that supports this dynamic environment, and many businesses don’t have that yet.

 

That’s one of the things we specialize in—identifying gaps in the most critical applications, determining what can or can’t be done with the current toolset, and adding solutions where they make sense.

 

The observability landscape is huge. A lot is happening, including consolidation within the observability field, and new vendors are constantly emerging. Often, new solutions come out that aren’t quite ready for prime time. This is where expertise comes in—deciding whether a solution is the right fit or needs to be adapted.

For example, if you want to add AI to your product, we’ll help add a solution that lets you track what people are asking when they raise a question. Observability helps you improve inferences, runtime, and AI response time. It’s about determining whether there’s room for improvement by using a different model that delivers a similar result. Without observability, it’s hard to answer those questions. Today, observability is a must-have. You can’t function in a modern environment without it.

Kevin Kohn: No, that's so well put. If I may just dig into something you said earlier—observability allows you to see what’s happening inside your applications and how they’re running. There’s a lot packed into that statement, right? It’s not just about knowing that the app is running, but about whether it’s performing at its best. Are you using the right infrastructure to ensure it’s running as efficiently as possible? There are so many aspects to this. Could you expand on that and touch on the more interesting elements?

Roy Douber: Absolutely. There’s definitely a cost optimization angle to this whole equation. Should your application be running in containers? Should it be serverless? Should it be on a single server? It depends on the type of application and where its bottlenecks are. Is it memory-heavy? Is it bottlenecked by CPU or storage? These are the kinds of questions observability can help answer.

All that data has to be processed, and from there, you can make a proper architectural decision. As you build and re-architect your application over time, you ideally want to have it optimized from the start. But, as we all know, the realities of the world sometimes mean changes need to be made as the application evolves.

Kevin Kohn: Right.

Roy Douber: Some of these AI use cases, for example, require more CPU or memory so that you can answer questions against the model or fine-tune it. Or maybe you need to rebuild the application so that it compiles faster, allowing you to release it more quickly. All of these questions get answered through observability.

Kevin Kohn: The example you just shared is so fluid. With AI, you first have to define the use case, then you start down that path. Maybe the use case works as expected, maybe it doesn’t, and you need to tweak how you're approaching it. So, observability needs to be flexible—it has to ebb and flow with your activity. You can’t be rigidly tied to how the observability tool is set up. It has to be adaptable to the use cases you’re putting in front of it. Would you agree?

 

Roy Douber: Yes, absolutely. In any application, there’s a CI/CD process, right? And during a blue-green release cycle, for example, you want to see how the application is performing in real-time. If you’re using Kubernetes, you could say, “I’m going to release this new version to 5% of my user base. I want to see how that 5% behaves compared to the 95% that are still on the prior version.” You can slowly roll out the new version to more users based on those observations.

 

Now, if you’re adding AI components and have just switched to the latest model, like ChatGPT 4.0, you want to see how it performs compared to ChatGPT 3.5. What kind of answers are people getting? How are they using it? What kinds of questions are being asked in aggregate?

 

All of these things can be defined through observability, which gives you the insights you need to make informed decisions. You want to know whether it’s worth paying extra for ChatGPT 4.0. Does it give your user base better answers for the questions embedded within your application?

 

Kevin Kohn: That’s perfect. Let’s pivot a little now. How does observability improve code quality? After that, I want to talk about how it helps drive revenue. But first, let’s focus on code quality.

 

Roy Douber: Developers today are facing a lot of challenges. Not only do they have to write software, but they also need to understand cloud technology, various databases, multiple programming languages, and the new AI models that are constantly emerging. This creates a level of confusion, making it difficult to consistently deliver high-quality software.

 

Having observability in place gives developers confidence in their code. They know they’ll have real-time insights to answer questions, either during development or when the system is live. As users interact with the application in ways developers might not have originally anticipated, observability helps to track that behavior. Without it, a developer’s code confidence plummets, and they have to second-guess how their changes will propagate across environments.

 

With observability, you get real-time answers on how your environment is performing and how users are interacting with it. This allows you to quickly iterate on the product. A cycle that might have taken a month can now take hours, days, or maybe a week. Developers grow more confident and can push changes to production faster.

 

I think this is a key driver of revenue that’s often overlooked. You can’t easily put a number on the value of a developer being able to push more code to production confidently. But with all the necessary tools and mechanisms in place, engineers know they’ll be able to fix bugs, push hotfixes, or deal with any fallout very quickly. To me, observability is crucial for maintaining that confidence, which is why it’s not just a checkbox item anymore. It’s absolutely essential.

 

Kevin Kohn: So, this becomes a “do more with less” story, especially for the developer. We all know how busy we are, how much is demanded of us, and how companies want to squeeze every last bit of productivity out of their valuable staff. Having observability in place optimizes each company's limited resources, allowing them to go as far as they possibly can. Does that summarize it well?

 

Roy Douber: Yes, exactly. Observability gives developers an extra hand in identifying issues. It’s also incredibly valuable when dealing with the impact of an outage. Those happen, as we’ve talked about, and if you can target the right individuals quickly, especially in a big enterprise where you have an operations center, it helps. There might be an outage affecting multiple servers or network devices, and you see the alerts firing off. If you can get those alerts to the right people fast, it makes a huge difference.

 

With observability properly instrumented and your architecture in place, it can significantly lower the time spent on triage. It gives engineers an extra hand, and in many cases, the cost of an outage—both in terms of brand reputation and business impact—is far higher than the cost of having good observability tools and well-trained people ready to resolve issues quickly. It’s the difference between resolving an outage in minutes versus hours or even days.

 

Kevin Kohn: I had to smile when you said that because we’ve all seen recent high-profile outages where companies released code that wasn’t ready, and it bricked machines all over the world. If only they had proper observability in place, that catastrophe could have been avoided. That single event cost billions in lost revenue across industries like airlines and other companies. It’s a very relevant example of what we're talking about.

 

Roy Douber: Yeah, for sure. We could talk about the CrowdStrike outage—it’s a classic example of not having proper gates in place when pushing code to production.

 

Realistically, if you only push to a portion of your user base before releasing the full payload to everyone, you’d see signs indicating whether or not to stop. There was likely a mistake somewhere—either an environment was skipped or testing across various operating systems was missed. Instead of pushing to dev first, they pushed straight to production, which happens sometimes. But the other safeguard would be a slower rollout, where you start with a subsection of your user base and gradually increase. This is standard practice with Kubernetes rollouts and A/B testing. You put version A into production and leave version B running, then observe how people interact with both. That step was likely skipped in this case.

 

If they had just sent it to a dev environment or even a smaller portion of their user base, they might have avoided the entire catastrophe.

 

Kevin Kohn: So, to pivot a bit, what are some misconceptions you’ve run into about observability? What do people not understand about it that you’d like to clear up?

 

Roy Douber: One of the biggest misconceptions is that many companies still don’t view observability as a must-have. In my experience, though, clients that have even a marginally good observability strategy have cleaner products, better systems, and higher stock prices. If you correlate how well a company runs its observability practice with everything about their application—how it behaves, what happens when a page breaks, how it degrades—you can see the connection.

 

Observability is strongly correlated with a company’s engineering culture and its ability to resolve issues quickly. I think people still view it as a checkbox item, but I see it as crucial. I encourage companies to instrument the heck out of their mission-critical applications because if you don’t, you’re one outage away from disaster, and it’s always going to happen at the worst possible time.

 

Kevin Kohn: I think you’re being generous by saying most people see it as a checkbox item. I’d go further and say a lot of companies think it’s optional altogether. I can’t count how many environments I walk into where they don’t even have an observability strategy. They don’t have a plan, they haven’t chosen a vendor, or they have multiple tools that aren’t working together. When it comes time to track down an issue, it’s impossible because the data is scattered across different systems.

 

Roy Douber: Exactly. It’s common for us to see clients with 30 different tools, and they can’t make sense of the data any of them provide. Or maybe there’s one guy in each department who knows how to use a specific tool really well, but no one else has access to it. One recommendation I always make is to democratize your data—make sure everyone has access to the tools.

 

And consolidate! If you have 30 tools, you’re probably using 25 too many. Consolidation is key, and you need to be mindful of the gaps that each tool covers.

 

Kevin Kohn: I like what you were saying about how having too many tools can actually be worse than having fewer tools that everyone understands. Having five or even one tool that the whole team knows is often more effective than having 30 that only a few people understand. Is there anything I missed in that?

 

Roy Douber: No, I think you captured it. The key is to consolidate and provide access to everyone. You don’t want your teams pointing fingers at each other. Everyone should be looking at the same data from the same vantage point, and working together to solve problems.

 

Siloed triage is a big problem in a lot of companies. But if you properly architect your observability story, you can build a solution that allows access for all your engineers. That brings a lot of engineering prowess to bear. Maybe a network engineer or storage engineer has experience as an operator and understands certain aspects of an issue that others don’t. Observability allows people from different teams to collaborate and troubleshoot more effectively.

 

The idea is to have mechanisms in place that allow for better triage stories. People should be able to go down the path of a problem, saying, “This is still an issue I’m seeing based on these symptoms,” or, “I’ve exhausted my options on this path, so I’ll go join another team tackling it from a different angle.” It’s key to have all the data in one place or as close to one place as possible to enable this kind of collaborative triage.

 

Kevin Kohn: Thank you, Roy! This has been incredibly informative. I think we covered a lot of ground today in terms of observability and its critical role in IT environments. Your insights will really help our listeners understand how important it is, not just for keeping systems running but for driving business success. I appreciate you taking the time to share your expertise.

 

Roy Douber: Thanks, Kevin. I really enjoyed this. Maybe we can do a part two in the future, covering more specific technologies or misconceptions we didn’t get to today. This was a broad overview, and there’s definitely a lot more we could discuss.

 

Kevin Kohn: I agree. Thanks for kicking off this observability 101 class with us! Let’s look forward to a 201-level session down the road.

Written By: Kevin Kohn