You can't test everything but you should monitor it

I want to present an incident which happend at our warehouse which leads to an OpenSearch use case for metrics and monitoring: We are renting out thousands of photo booths every year with ten thousands of bookings. Most of our processes are fully automated, like the configuration of the photo booths which are connected to the network in the fulfillment or the download of the photos when a photo booth has returned.

In the high season, the photo booths get returned via shipping on the same day as they are getting configured and sent out to the next customer.

For this, the download of several gigabytes of photos must be fast. Normally, there are only 10 minutes between downloading and configuring a photo booth on the shelf again.

From one day to the next, we had issues that the download did not finish in time and it took almost 30 minutes. This disrupted the business a lot. After several hours of debugging, we found out that we had a network issue. After aggregating the data - which took some time - we found out that this error was there from 2020 already - 2 years. But we did not notice because we did not have a monitoring for this and we did not send out that many photo booths because of Covid-19.

This was the day we decided we need monitoring for everything - so we set up OpenSearch, pushed in the metrics including the old one and added alerts so we get noticed in time. By this, we will recognize very early and can take action to get rid of problems.

Details

Tuesday, May 7 4:45pm-5:05pm in Asgabat

Track: Community

Speakers

Michael Lehr photograph

Michael Lehr

Head of Code at KRUU