US-EAST-3 Intermittent network connectivity
Identified·Partial outage

Severity Level: [High]

Our engineers are going to begin proactive reboots tomorrow, Wednesday, May 14, starting at 11:00 AM EDT. We're unable to communicate just how long these reboots will take. That being said, we will provide an update here once the reboots have been completed.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Tue, May 13, 2025, 07:24 PM
(21 hours ago)
·
Affected components
Updates

Identified

Severity Level: [High]

Our engineers are going to begin proactive reboots tomorrow, Wednesday, May 14, starting at 11:00 AM EDT. We're unable to communicate just how long these reboots will take. That being said, we will provide an update here once the reboots have been completed.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Tue, May 13, 2025, 07:24 PM

Identified

Severity Level: [High]  

Our engineers continue to work with the vendor to obtain the fix. Otherwise, we're currently in a holding pattern. More information will provided as it becomes available.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Fri, May 9, 2025, 10:19 PM(3 days earlier)

Identified

Severity Level: [High]  

Our engineers have placed the most at risk hosts into maintenance. This prevents these hosts from running newly launched instances. Because of their actions, instance launches will be enabled in our Washington, DC region (us-east-3) momentarily.

That being said, the incident is ongoing. We're still working with the vendor to obtain and apply a fix. Until this fix has been applied, unexpected reboots are expected to continue.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Thu, May 8, 2025, 09:44 PM(1 day earlier)

Identified

Severity Level: [High]  

We continue to encounter the issue previously mentioned across hosts within our Washington, DC region (us-east-3). Once encountered, we must reboot the affected host to restore connectivity.

Our engineers continue to work with the vendor on implementing a fix. Until the fix is implemented, more reboots are expected.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Wed, May 7, 2025, 10:08 PM(23 hours earlier)

Identified

Severity Level: [High]  

Launches in our Washington, DC region (us-east-3) have been disabled. Users will not be able to launch new instances at this time.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Wed, May 7, 2025, 04:44 PM(5 hours earlier)

Identified

Severity Level: [High] UPDATE: We are seeing the same behavior of nodes crashing however it’s now in a new area of our data hall in this region. Users may experience intermittent downtime as we reboot affected nodes. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Wed, May 7, 2025, 04:15 PM(28 minutes earlier)

Identified

Severity Level: [High]  

Our engineers are currently awaiting a fix from the vendor at this time. The frequency of communications will be reduced going forward. That being said, updates will still be provided as they're made available.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Tue, May 6, 2025, 06:43 PM(21 hours earlier)

Identified

Severity Level: [Low/Medium/High/Critical]  

Connectivity has been restored to the hosts connected to the pair of switches previously noted. That being said, our engineers are still working on a permanent fix at this time.

This incident is currently in a holding pattern. More information will be provided as it comes available.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Tue, May 6, 2025, 12:32 AM(18 hours earlier)

Identified

Severity Level: [High]  

Our engineers have narrowed the issue down to hosts connected to a pair of switches in the region and they're actively working towards remediation. Instances running on these hosts may still see intermittent network connectivity.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Mon, May 5, 2025, 09:53 PM(2 hours earlier)

Identified

Severity Level: [High]  

Our engineers are currently working diligently to resolve interface errors seen within the Washington, DC region (us-east-3).

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Mon, May 5, 2025, 06:18 PM(3 hours earlier)

Identified

Severity Level: [High]  

We're still seeing intermittent network connectivity within our Washington, DC region (us-east-3). Our engineers have identified the root cause of the issue and they’re actively working on remediation.

In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button.

Thank you for your patience and understanding during this time.

Mon, May 5, 2025, 04:55 PM(1 hour earlier)

Monitoring

Severity: [HIGH] [UPDATE] The rolling reboots have completed. We are now monitoring the situation. Please see below for the other incident details. We’ve identified the cause being related to an existing issue the vendor is aware of however the patch for this won't be available anytime soon. We've found that the nodes that we've power cycled have not crashed again since. So, to prevent longer hard downtime for the nodes that will be affected, we're opting to perform a rolling power cycle of the remaining hosts that will eventually crash. Once the vendor has provided the permanent fix for this, we'll schedule another maintenance window as this will require the hosts to be power cycled again. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible. • https://support.lambdalabs.com/hc/en-us/requests/new To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Sat, May 3, 2025, 03:17 AM(2 days earlier)

Identified

Severity Level: [High] [UPDATE]
The rolling reboots have begun however we'll need more time to complete them so we're delaying the end of the power cycle window to 10PM EST. Please see below for the other incident details. We’ve identified the cause being related to an existing issue the vendor is aware of however the patch for this won't be available anytime soon. We've found that the nodes that we've power cycled have not crashed again since. So, to prevent longer hard downtime for the nodes that will be affected, we're opting to perform a rolling power cycle of the remaining hosts that will eventually crash. Once the vendor has provided the permanent fix for this, we'll schedule another maintenance window as this will require the hosts to be power cycled again. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible. • https://support.lambdalabs.com/hc/en-us/requests/new To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Sat, May 3, 2025, 12:08 AM(3 hours earlier)

Identified

Severity Level: [High] We’ve identified the cause being related to an existing issue the vendor is aware of however the patch for this won't be available anytime soon. We've found that the nodes that we've power cycled have not crashed again since. So, to prevent longer hard downtime for the nodes that will be affected, we're opting to perform a rolling power cycle of the remaining hosts that will eventually crash. The rolling reboots will begin in ~30 minutes at 6PM EST, ending on or before 8PM EST. During this time, users may experience intermittent downtime as nodes reboot, with downtime lasting 15 minutes or less. If your instance has already experienced downtime, you should not experience further downtime. Once the vendor has provided the permanent fix for this, we'll schedule another maintenance window as this will require the hosts to be power cycled again. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible. • https://support.lambdalabs.com/hc/en-us/requests/new To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Fri, May 2, 2025, 10:56 PM(1 hour earlier)

Identified

Severity Level: [High] We’ve identified the cause being related to an existing issue the vendor is aware of however the patch for this won't be available anytime soon. We've found that the nodes that we've power cycled have not crashed again since. So, to prevent longer hard downtime for the nodes that will be affected, we're opting to perform a rolling power cycle of the remaining hosts that will eventually crash. The rolling reboots will begin in ~30 minutes at 6PM EST, ending on or before 8PM EST. During this time, users may experience intermittent downtime as nodes reboot, with downtime lasting 15 minutes or less. If your instance has already experienced downtime, you should not experience further downtime. Once the vendor has provided the permanent fix for this, we'll schedule another maintenance window as this will require the hosts to be power cycled again. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Fri, May 2, 2025, 09:35 PM(1 hour earlier)

Identified

Severity Level: [High] We’ve identified the cause being related to an existing issue the vendor is aware of however the patch for this won't be available anytime soon. We've found that the nodes that we've power cycled have not crashed again since. So, to prevent longer hard downtime for the nodes that will be affected, we're opting to perform a rolling power cycle of the remaining hosts that will eventually crash. The rolling reboots will begin in ~30 minutes at 4PM EST, ending on or before 6PM EST. During this time, users may experience intermittent downtime as nodes reboot, with downtime lasting 15 minutes or less. If your instance has already experienced downtime, you should not experience further downtime. Once the vendor has provided the permanent fix for this, we'll schedule another maintenance window as this will require the hosts to be power cycled again. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Fri, May 2, 2025, 09:33 PM

Investigating

Severity Level: [High] We’re currently investigating a potential issue with our Networking in US-EAST-3 region which solely serves our GH200 instances. Users may experience intermittent network outages as this is caused when a component on the host panics. Our engineers are actively working to identify the root cause of the issue, working with the component's vendor for remediation or workarounds. In the event you’re unsure as to whether your service is impacted by this incident, please open a support ticket. We’ll get back to you as soon as possible.

To be automatically notified when the status of this incident changes, please click on the “Subscribe to updates” button. Thank you for your patience and understanding during this time.

Fri, May 2, 2025, 04:17 PM(5 hours earlier)