Media capabilities degraded

Incident Report for Atlassian Support

Postmortem

SUMMARY

On May 15, 2023, between 02:36 and 04:08 UTC, Atlassian customers using Bitbucket, Confluence, Jira Align, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, and Atlas products with services hosted in the us-west-1 region were impacted by an incident related to the storing and retrieval of data assets, including media, attachments and build artifacts.

The event was triggered by a network migration of an internal service as part of an initiative to increase security by hardening partitions between network segments. The incident was detected within three minutes by automated monitoring and mitigated by a rollback of the change which put Atlassian systems into a known good state. The total time to resolution was about one hour and 32 minutes.

IMPACT

The impact across products was:

Bitbucket - Bitbucket Pipelines self-hosted builds were failing, access to Git LFS failed and cloud-hosted builds were delayed.
Confluence, Jira Align, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, and Atlas - media capabilities (images, videos, documents, audio) were affected and it was not possible to upload, download or view existing media attachments or files.

The service disruption lasted for one hour and 32 minutes between May 15, 2023, 02:36 and May 15, 2023, 04:08 UTC and caused service disruption to customers with services hosted in the us-west-1 region.

ROOT CAUSE

The issue was caused by an attempted migration of a service to a new network segment. As part of this migration, a DNS record pointing to the old network segment was not updated, which resulted in failure when the old network stack was removed. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified as moving services across network segments is not a regular activity and is difficult to accurately replicate in a test environment. To mitigate against these types of issues, we made this change using blue/green deployment practices but failed to run adequate verification steps before decommissioning the old stack.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Reviewing our systems that decommission service stacks and implementing checks that customer traffic is no longer being served prior to decommissioning the stacks; and
As part of our service network migration process, we are adding steps to identify when there are associated DNS records that require attention.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted May 23, 2023 - 23:54 UTC

Resolved

Between 2023-05-15 02:40 UTC to 2023-05-15 04:09 UTC, we experienced an partial outage for Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Atlassian Bitbucket, and Atlas. The issue has been resolved and the service is operating normally.

Posted May 15, 2023 - 04:51 UTC

Monitoring

We have identified the root cause of the outage and have mitigated the problem. We are now monitoring closely.

Posted May 15, 2023 - 04:12 UTC

Investigating

We are investigating reports of intermittent errors for Atlassian Support, Confluence, Jira Work Management, Jira Service Management, Jira Software, Atlassian Bitbucket, and Atlas Cloud customers. We will provide more details once we identify the root cause.

Posted May 15, 2023 - 03:39 UTC

This incident affected: Downloads Access.