You know how it feels. After releasing a new version, a service starts behaving in an unexpected way, and it’s up to you to save the day. But where to start? Criteo processes 150 billion requests per day, across more than 4000 front-end servers. As part of the Criteo Performance team, our job is to investigate critical issues in this kind of environment.
In this talk, you will follow our insights, mistakes and false leads during a real world case. We will cover all the phases of the investigation, from the early detection to the actual fix, and we will detail our tricks and tools along the way. Including but not limited to: – Using metrics to detect and assess the issue – What you can get… or not from a profiler to make a good assumption – Digging into the CLR data structures with a decompiler, WinDBG and SOS to assert your assumption – Automating memory dump analysis with ClrMD to build your own tools when WinDBG falls short