AWS Outage in US-East-1d Region Affecting Cloud Instances - RESOLVED

22-Dec-2021 by Inflectra Product News

Our monitoring has detected an outage in Amazon Web Services (AWS) Elastic Cloud Compute (EC2) in the US-East-1d availability zone. This is affecting some of our US cloud customers. We have escalated to AWS support and are awaiting confirmation on the timeline for Resolution.

Latest Update

We have completed the restarting of the remaining EC2 instances and at all this time all Inflectra EC2 services are up and running normally. We apologize for any inconvenience.

Impacted Services

  • Spira and KronoDesk Demo instances hosted for all US trial customers
  • Spira and KronoDesk production instances for certain US customers
  • Data Synchronization services between Spira and other tools such as ADO, Jira.

As we learn more from AWS, we will update this article accordingly.

07:35AM EST Update

[From AWS] We are investigating increased EC2 launched failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue.

08:01AM EST Update

[From AWS] We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

08:18AM EST Update

[From AWS] We continue to make progress in restoring power to the affected data center within the affected Availability Zone (USE1-AZ4) in the US-EAST-1 Region. We have now restored power to the majority of instances and networking devices within the affected data center and are starting to see some early signs of recovery. Customers experiencing connectivity or instance availability issues within the affected Availability Zone, should start to see some recovery as power is restored to the affected data center. RunInstances API error rates are returning to normal levels and we are working to recover affected EC2 instances and EBS volumes. While we would expect continued improvement over the coming hour, we would still recommend failing away from the Availability Zone if you are able to do so to mitigate this issue.

We are observing the majority of our affected US EC2 instances returning to normal operations. We will post an update when all services are restored.

08:39 AM EST Update

[From AWS] We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. Network connectivity within the affected Availability Zone has also returned to normal levels. While all services are starting to see meaningful recovery, services which were hosting endpoints within the affected data center - such as single-AZ RDS databases, ElastiCache, etc. - would have seen impact during the event, but are starting to see recovery now. Given the level of recovery, if you have not yet failed away from the affected Availability Zone, you should be starting to see recovery at this stage.

Based on guidance from AWS and our own monitoring, we are seeing the return to normal operation of our affected EC2 instances. We will keep you posted.

09:13 AM EST Update

[From AWS] We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. We continue to make progress in recovering the remaining EC2 instances and EBS volumes within the affected Availability Zone. If you are able to relaunch affected EC2 instances within the affected Availability Zone, that may help to speed up recovery. We have a small number of affected EBS volumes that are still experiencing degraded IO performance that we are working to recover. The majority of AWS services have also recovered, but services which host endpoints within the customer’s VPCs - such as single-AZ RDS databases, ElasticCache, Redshift, etc. - continue to see some impact as we work towards full recovery.

We are continuing to see the restoration of the affected US EC2 instances. At this time there are still 2 EC2 instances that are affected.

09:51 AM EST Update

[From AWS] We have now restored power to all instances and network devices within the affected data center and are seeing recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone. For the remaining EC2 instances, we are experiencing some network connectivity issues, which is slowing down full recovery. We believe we understand why this is the case and are working on a resolution. Once resolved, we expect to see faster recovery for the remaining EC2 instances and EBS volumes. If you are able to relaunch affected EC2 instances within the affected Availability Zone, that may help to speed up recovery. Note that restarting an instance at this stage will not help as a restart does not change the underlying hardware. We have a small number of affected EBS volumes that are still experiencing degraded IO performance that we are working to recover. The majority of AWS services have also recovered, but services which host endpoints within the customer’s VPCs - such as single-AZ RDS databases, ElasticCache, Redshift, etc. - continue to see some impact as we work towards full recovery.

At this time there are still 2 EC2 instances that are affected. Based on current AWS guidance we are awaiting their restoration. If we do not see restoration within the next hour we will consider relaunching the instances from our hourly EBS snaphots.

11:02 AM EST Update

[From AWS] Power continues to be stable within the affected data center within the affected Availability Zone (USE1-AZ4) in the US-EAST-1 Region. We have been working to resolve the connectivity issues that the remaining EC2 instances and EBS volumes are experiencing in the affected data center, which is part of a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. We have addressed the connectivity issue for the affected EBS volumes, which are now starting to see further recovery. We continue to work on mitigating the networking impact for EC2 instances within the affected data center, and expect to see further recover there starting in the next 30 minutes. Since the EC2 APIs have been healthy for some time within the affected Availability Zone, the fastest path to recovery now would be to relaunch affected EC2 instances within the affected Availability Zone or other Availability Zones within the region.

At this time there is a single Inflectra EC2 instance that is affected. We are checking with AWS to get a concrete timelime or we shall initiate the relaunch of the EBS snapshot in a new availability zone.

11:02 AM EST Update

We have completed the restarting of the remaining EC2 instances and at all this time all Inflectra EC2 services are up and running normally. We apologize for any inconvenience.