Micro service architectures result in up to 20 times larger environments than their monolithic counterparts. In such big and interconnected environments container metrics will tell you about infrastructure health but not service health. Even if you have implemented service health checks to quickly react on service failures, in a resilient system you will see intermediary mushroom cloud effects of a large number of services being affected temporarily. How do you find out what really caused the problem and how to distinguish effect vs. cause?
In this session we will do post-mortem analysis by walking through different cases of failures we've observed in a real-world large e-commerce production environment and show you how to figure out what actually caused the failures.