Re: "the customer’s workloads had been overscheduled"
Most deployments to Kubernetes aren’t via web UI, they’re via standard Kubernetes command line tools. If you’re using it, you probably aren’t (or shouldn’t be) the type of admin depending on point and drool handholding. As for preventing people getting into trouble - well, I’ve never met a technology that can stop a determined idiot from doing this.
Regarding limiting over-scheduling, it can absolutely be a valid user decision. Particularly in non-prod environments where you may burn a lot of money if you don’t contend the workloads, and probably don’t care too much if there are very occasional problems if everything gets busy at the same time.
If the user tried to deploy to production without using the very rich set of primitives Kubernetes has for controlling scheduling, I’d definitely say they bear a significant portion of the responsibility. It’s like massively over committing a VMware cluster. RTFM, know your workloads, and test properly in a prod-like environment.
What I do think was bad was that the user’s poor decision was allowed to affect the system level services. This would have made it difficult for them to debug themselves in a managed cluster, and it shouldn’t have taken a day’s debugging by the Azure team to locate this fairly basic problem. That bit Microsoft should definitely shoulder the blame for. Still, at least they’ve fixed it (according to the HN thread).