3 minutes reading time (622 words)

The Road To Success - Transparency, Ownership, Innovation

organizer

“I've missed more than 9000 shots in my career. I've lost almost 300 games. 26 times, I've been trusted to take the game-winning shot and missed. I've failed over and over and over again in my life. And that is why I succeed.”

These words were said by Michael Jordan and this is our moto. If we want to be the best dog show software, there will be failures on the way, but with every failure - we become better, we learn from our mistakes and improved.

 

For us at Dogg.dog team, waking up for a show day, is not just a day, it's a huge responsibility and honor for us, to make sure that all results are saved, dog owners get their critique online and the dog show event management goes according to plan, so at the end of the event, dog show management tool will be able to provide all the online results and statistics needed.

On the 21st of September, we had a chance to stress-test the new system, with 5 different shows within 3 different locations. 400 dogs in total.

The previous system worked and served our users to manage dog show online in the past 3.5 years. The new version is based on that code but was written from scratch using the latest technology available in the market. This was made in order to bring offline support, no internet support for dog show software. And 99.9% redundancy and uptime.

In the morning time once the first show started we’ve started to experience a server overload. the first impression was a DDoS attack, however - Dogg.Dog was designed to handle such attacks. the overload caused the system to be flapping and to the field management system of the dog show to become unresponsive at some times. However due to our unique solution and several levels of the way we handle data,

The Investigation

  • Timeline -
  • Issue Reported/Started: 09:30
  • Time to initiate the technical development team to resolve the problem: 09:42
  • Time to Resolve: 10:30

The root cause:

The issue was due to a missing signature of a dog judge in the system, which caused a chain reaction bugs to appear. (a scenario that our QA team forgot to check)
The bug was cached into the Redis, which appeared for other users once they also tried to complete the dog judgment in their system.

The server started to send multiple critiques in a loop - 4,000 critiques in 1 minute, after 20minutes with 80,000 critiques in the mail server - the server could not hold more than that and crash. the points here was -
The server recovered within 3 minutes.
The critiques were left in the queue and continue to be sent, even after the server restarted.
80,000 critiques are more than double of the dogs participates in a world dog show or European dog show.
There was no data loss.

User Effect:
Flapping service, delaies in critiques sent.

  • Action Items:
  • Block and throttle requests - even if the events are being sent through the system.
  • Build a queue management service to handle multiple requests.
  • Build a fallback mail service - so in case of a failure - the email will be sent through a different provider.
  • Save a dog critique even if the mail was not sent to the dog owner.
  • Add health test check our WebSocket service.
  • Add another data save logic and check.
  • Fix missing signature bug.
  • Fix a web socket bug that was found during the QA process.
  • Add monitoring tools.

All action items tasks have been completed, implemented and tested on production already.

And before we end this post always remember -
“The key to success is a failure”
― Michael Jordan

Related Posts