Kinescope - API Service outage – Détails de l'incident

Tous les systèmes sont opérationnels

API Service outage

Résolu
Panne majeure
Signalé le il y a 1 jourA duré 19 minutes

Concernés

API

Opérationnel depuis 7:04 PM à 7:04 PM, Panne majeure depuis 7:04 PM à 7:22 PM, Opérationnel depuis 7:22 PM à 7:22 PM

Mises à jour
  • Postmortem
    Postmortem

    During scheduled drive replacement performed in two of our Moscow data centers, application cluster responsible for API service went offline. Redis failed to promote masters on the remaining healthy node, causing the API service to repeatedly attempt connections to an unavailable Redis master. Manual failover resolved the issue and services were restored.

    Timeline & Root Cause Analysis

    • Planned maintenance was performed to replace hard drives in two data centers.

    • As part of the work, the app server in DC1 was shut down.

    • Later, the app server in DC2 was also shut down for the same maintenance.

      • The DC1 app server did not have enough time to fully come back online before the second shutdown.

    • As a result, two app servers in the cluster went down simultaneously.

    • Redis did not switch the master role to the remaining node in the other DC as expected.

    • The API service failed to start because it kept trying to connect to the Redis master located in DC1, which was unavailable.

    • Redis master roles were manually promoted to the servers in DC2.

    • Once Redis topology was corrected, the API and dashboard services recovered and returned to normal operation.

    Resolution

    • All Redis masters were manually switched to the healthy nodes in DC2.

    • Application services (API, dashboard) successfully started and functioned as expected.

      Next Steps / Preventive Actions

      • Applied corrections to the maintenance algorithm, ensuring app servers are never taken down simultaneously and Redis failover logic is properly validated before each step.

      • Review and improve Redis automatic failover configuration.

      • Add additional health checks and monitoring around Redis master availability and app server readiness.

      • Adjust maintenance sequencing to guarantee sufficient startup time between operations.

  • Résolu
    Résolu
  • Détecté
    Détecté

    We are currently investigating this incident.