Datadog Kubernetes Remediation

Compute teams build the infrastructure and workflows that enable software teams to deploy their applications. When deployment issues arise, software engineers often rely on compute engineers for help—an approach that doesn’t scale in complex applications or large organizations.

Kubernetes Remediation helps software engineers resolve common deployment issues on their own, reducing the support burden on compute teams.

Team Composition

1 PM
2 full time engineers, 1 part time - full stack
1 lead product designer (me)

Timeline and Project Context

1 quarter to build from scratch : We had 3 months to design and build a Limited Preview product, to collect design partner feedback.
I was unfamiliar containerization and Kubernetes: I had recently joined this team from another product vertical, when our other designer left the team suddenly. I had to get to know the team’s working dynamic, learn terminology, infrastructure, and relevant users from scratch.
I was juggling this and another project at the same time: See Kubernetes Autoscaling

The discovery process

Learning from customers, competitors, and our team

Learned about containerization and Kubernetes: I read Kubernetes documentation and watched a lot of containerization videos.

Downloaded existing team knowledge: Prior to my joining the team, some amount of ideation had already occurred among existing members. I held several meetings with the PM and engineering team to understand what was done and what decisions were made.
Competitive analysis: Examined competitors and their strategy, UX approach, and features.
Customer calls and workflow mapping: worked with PM to produce a script and held meetings with customers to learn about how they currently troubleshoot and solved Kubernetes issues.

Our target users

From internal discussions with our team and external user interviews, we landed on these users as our initial target.

Designs, manages, and optimizes an organization’s system infrastructure. Works closely with software engineers.
Responsible for designing, developing, testing, deploying, and maintaining software applications or services.

Mapping out real workflows used to find root cause and resolve common Kubernetes issues

Competitive analysis

Generating ideas and wireframes

Product inputs and key principles

Working with the Product Manager, we agreed on the following goals and principles.

Product Requirements: Based on prior user and competitive research, the product manager distilled top priorities into a PRD for the quarter. We chose to focus on one to two core Kubernetes issues as a starting point.
Engineering POCs: Because prior to me joining engineering had already created a POC, I used this to understand what was possible for existing backend capabilities and took learnings from what they thought would be useful content to display.

If we can identify what caused the issue, clearly display the supporting data so the user can see it for themselves, and therefore build trust with our tool.
If there are ways to solve the problem, explain those clearly within the user experience, or better yet, provide that action directly on the page.
Always provide a rollback (undo) to the previous version, as a way to stabilize the identified problem or if the provided solutions do not result in expected behavior.
Different companies may need different levels of access and permissions.

Showing how a Datadog monitor could link to the Kubernetes Remediation product

Explorations for how to display a list of Kubernetes issues within the Datadog Kubernetes product

Explorations on how to display a Kubernetes problem and solution. There were some complexities on how to display multiple problems, multiple Pods.

Explorations on how to display fixing OOMKill without leaving the page.

Explorations on showing an action history for all actions taken for a workload.

Resolve Kubernetes issues in a single click.

Viewing a Kubernetes Remediation item

Applying a fix to an out of memory issue and resolving the problem

Afterward

We successfully launched a Limited Preview and started collecting design partner feedback.

I helped created an analytics tracking plan so we could quantitatively monitor user activity and see whether users were troubleshooting and initiating actions from our product.

Impact

Prioritize fiercely and pick battles wisely: Creating a complex product from scratch, with only a quarter’s time, requires agreeing on what’s most important to focus time and energy on. Our team became really good at constantly checking in on what was worth design and engineering effort, and what was not. Small delighters such as animations were saved for the next iteration, but design that provided clarity and trust were never sacrificed.
Not all users are created equal: From design partner feedback, we learned that some companies prefer all engineers had access to actions and controls, whereas some others wanted to limit that to a few qualified engineers. Providing permissions was going to be a key part of our experience in a future iteration.

Reflections

Explore more projects