Elevated latency on event app

Incident Report for Swapcard

Postmortem

We are ready to provide a detailed post-mortem report regarding the service disruption that affected Swapcard customers on Tuesday, September 10th, 2024, at 13:30 UTC. The issue arose across several of our apps (including both web and mobile platforms) due to an unexpected traffic surge during an automatic scale-down phase triggered by fluctuating CPU usage.

The goal of this post-mortem is to share insights from our initial assessment, as published on the Swapcard status page, and to detail the corrective measures we’ve implemented to restore service to normal.

Incident summary

On Tuesday, September 10th, 2024, at 13:30 UTC, we encountered elevated latency across our applications (both web and mobile), resulting in various user errors and prolonged or incomplete application loads. This latency spike was linked to a traffic surge during an automatic scaling phase. Our system adjusts to incoming traffic by scaling up or down based on load; however, an influx of events caused a delayed scale-up following a slow scale-down, leading to fluctuations in response time as the system attempted to adapt.

Swapcard's monitoring systems detected the disruption and promptly activated our Incident Response team.

The team took immediate action to triage and mitigate the issue by disabling automatic scaling and forcing an aggressive scale-up. This overscaling approach helped limit further impact from the traffic surge while we implement improvements. Concurrently, we have launched an investigation to refine our handling of such traffic patterns and optimize our scaling configurations for this type of traffic event sequence (up, low, up, low) for the future.

Mitigation deployment

At 13:35 UTC, our infrastructure team immediately addressed the issue by disabling automatic scaling and manually triggering an aggressive scale-up. This manual override of our usual scaling process took approximately five minutes. The delay was not in the scaling itself but in ensuring that the changes effectively overrode the default behavior and provided a stable foundation to handle the surge in traffic. As the update propagated through our infrastructure, the errors steadily decreased and eventually stopped.

Swapcard’s engineering team continued to monitor system metrics to ensure full recovery. At 3:44 UTC, after further monitoring and detecting no additional issues, we confirmed that the incident had been fully resolved.

Event Outline

Events of 2024 September 10th (UTC)

(13:30 UTC) | Elevated latency on our applications

(13:31 UTC) | Disruption identified by Swapcard monitoring

(13:35 UTC) | Manual override of our usual scaling process

(13:37 UTC) | Errors decreasing

(13:40 UTC) | Incident mitigated

Affected customers may have been impacted by varying degrees and with a shorter duration than described above.

Forward Planning

Swapcard has begun enhancing the scaling algorithm to prevent frequent recurrences of this type of incident, in line with our high standards for deliverability. Some improvements were deployed during the night of September 11-12th to strengthen our system for handling such scenarios.

Posted Sep 11, 2024 - 14:56 UTC

Resolved

This incident has been resolved.

Posted Sep 10, 2024 - 13:40 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 10, 2024 - 13:39 UTC

Investigating

We are currently investigating this issue.

Posted Sep 10, 2024 - 13:38 UTC

This incident affected: Event App.