us-south-1 Network outage

Resolved·Full outage

Severity Level: [High] Our Engineers and DCops team were able to work through the remaining issues affecting several hypervisors and instances and now all services and instances are reporting healthy again. As such, we consider this incident resolved. The root cause was a routine UPS related maintenance affected power to our cooling, causing networking and other hardware to power off or be throttled before power was restored. First reports of an issue were 3-25-25 @11:45PM CST. While some networking and services were partially operational during some of this time, the entire incident lasted ~16 hours. In the event you’re unsure as to whether your service is still impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 08:37 PM

(7 months ago)

Affected components

Mar 26, 2025, 05:37 AM

08:37 PM

Updates

Resolved

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 08:37 PM

Identified

Severity Level: [High] Power is back at the DC level and network access has been restored however the power outage has caused some hypervisor hosts to be unhealthy and might require reprovisioning. It's still unclear the total impact this will have on instances however we're going to start rolling hard reboots of several hypervisors to get them healthy again. Some users will experience additional downtime as we reboot hosts with running, accessible instances on them. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 06:34 PM(2 hours earlier)

Identified

Severity Level: [High]

Power is back at the DC level. We have determined that a power reset on the hypervisor nodes resets the networking and brings the host back online. We are doing controlled power resets to bring back hosts and instances.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 05:29 PM(1 hour earlier)

Identified

Severity Level: [High]

[You may provide an ETA for resolution if one is available]

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 05:14 PM(15 minutes earlier)

Identified

Severity Level: [High]

Power is back at the DC level and we are proceeding to begin systemically restoring systems. There still may be fluctuations as we bring nodes back up; we are working cautiously to avoid overwhelming or impairing the recovery process.

[You may provide an ETA for resolution if one is available]

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 02:29 PM(2 hours earlier)

Identified

Severity Level: [High] DC Ops are still working towards a resolution for this power outage. Users may still be unable to access their instances at this time. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 10:57 AM(3 hours earlier)

Identified

Severity Level: [High] We've identified the root cause of the outage being a power related failure at the datacenter for the us-south-1 region. Users will experience network related connectivity issues to instances in that region. We're working with our DCOps team in the region for updates from the datacenter, still no ETA on when users can expect to regain access to their instances. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 09:53 AM(1 hour earlier)

Identified

Severity Level: [High] Efforts to restore power on the affected devices are still on-going. Users may still be experiencing [specific impact] at this time. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 07:23 AM(2 hours earlier)

Identified

Severity Level: [High] We've identified the root cause of the outage being a power related failure in the datacenter for the us-south-1 region. Users will experience network related connectivity to instances in that region. We're working with our datacenter operations team in the region for updates. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 06:06 AM(1 hour earlier)

Investigating

Severity Level: [High] We’re currently investigating a potential issue with networking outage in our us-south-1 region. Users may experience issues connecting to their instances in this region. Our engineers are actively working to identify the root cause of the issue. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

https://support.lambdalabs.com/hc/en-us/requests/new

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, Mar 26, 2025, 05:37 AM(29 minutes earlier)