May 28, 2024
Cardinal Founding Team
A couple of years ago, I found myself wandering through ObservabilityVille in the expo hall at re:Invent. It felt like I had stumbled into a Service Graph theme park. Service Graphs were the new cool or cool enough thing to trigger my FOMO (as an Observability guy then at Netflix). For what it's worth, the kool aid I chose to drink was something along the lines of (from my 2022 notes): “Service graphs offer a bird’s-eye view of how services connect and communicate, providing invaluable insights for troubleshooting and performance optimization. PS: Houston!, we may finally have an easier consumption model for trace data!”
I got back from re:Invent and was excited to start working on a Service Graph myself! Over the next few months, two backend engineers, one frontend engineer, and one UX Designer worked on what I thought would be a perfect rendition of the Service Graph idea and how it would change everything. And, yet the only use cases that in my opinion that somewhat stuck were:
Migrations/Blast Radius Identification: If we make this change to Service A, which other services could be affected either directly or transitively.
New Team Member Orientation - Go look at the Service Graph to see how stuff talks to each other!
It didn’t turn out to be the hero feature we were hoping it would be! And in my opinion, below are some reasons for it. Service Graphs are:
Unwieldy because a complicated graph to denote the hundreds (or thousands of services in the case of Netflix) is good to show the kids the complexity that powers their TV time but that’s about what it's good for.
Oversimplified depictions of reality. Especially for microservices architectures, where the sheer number of services and their intricate interactions are far more complex than a graph can convey. For example, services may make decisions to call another service based on internal cache state, feature flags, or even random choice, and Service Graphs for one certainly fail to convey the true nature of such interactions.
Lacking context. Context in a Telemetry System comes from the time dimension. Is this weird? Has it happened before, if yes when and what was the magnitude? That’s the obvious next question and one you can’t easily ask when looking at a summarized representation such as a Service Graph.
Service Centric. Traces make Service Graphs possible and have actual information about the call graph and more importantly convey the business goal of the request! Yet, most vendors choose to focus on a physical abstraction like service (in the Service Graph) maybe because the end user persona that they are trying to please is a Service Owner. Sounds logical, I know, but we at Cardinal find this approach to be fundamentally flawed.
So, what are Service Centric Visualizations missing?
I’d like to quote this brilliant example from this blog written by our ex-colleague Joey Lynch:
Take 3 services A, B and C. “Service A always fails a little bit (or has an ongoing low error rate); [that] never recovers. Service B occasionally fails cataclysmically. It recovers quickly but still experiences a near 100% outage during that period. Finally, service C rarely fails, but when it does fail, it fails for a long time."
Joey shows how despite having very different failure patterns all 3 services have 3 9s of Reliability. Suddenly Service Reliability appears to be a weak indicator of your Business’s Reliability, doesn’t it? Rather the question now changes to: “Can you tell me more about what A, B and C do? And more importantly, what is the blast radius or the quantifiable impact (if any) of these failures on the business workflow(s) they serve for your company/organization?”.
And that precisely is our pitch: Turn Traces into Business Flows and enable service owners to see service telemetry through a business lens.
Enter Business Flows!
We map millions of requests and their resulting call graphs into distinct call paths, and then Chip, our AI Head of Troubleshooting infers what that call graph means from a business perspective. For each Business Flow, you see the number of calls, errors and latency over time.
For each flow, you see a single aggregated call graph showing similar metrics ranging over time for each span across that call graph. You can now set SLOs on Flow Metrics as opposed to Service or Span Metrics and get alerted when they degrade. Next Chip finds services that contribute to the degradation and brings in any other metrics or logs from those services that might help explain the issue. This makes for a much more holistic troubleshooting experience because we end up removing the artificial boundaries between the so called 3 verticals today.
Finally, we are seeing our customers/design partners depart from the traditional Service Centric Mindset and transitioning to a Business Centric Mindset because they can now start with Business Impact and then move on to Service Failures that may be causing it as opposed to the other way around.
About Cardinal
We're a team of former Netflix engineers who've spent the last decade building super-fast, dependable systems capable of handling petabytes of data. Cardinal represents the next step in our journey. We're combining tried-and-true Observability/SRE practices with cutting-edge innovations focused on cost-effectiveness and problem-solving efficiency, creating products that stand out in a crowded market.