Airtasker outage incident report

We recently experienced a major outage, which affected all members. This incident report outlines what caused the outage, the steps we took to get Airtasker back up and the actions we’re taking to help prevent a similar outage recurring.

We apologise to everyone affected by this outage.

Summary

On 2nd April 2017, we experienced a 4-hour outage starting at 07:30 UTC. It took 28min to identify the cause within our cloud formation templates, 5 minutes to make the change we thought was required. It then took a further 180 minutes to identify the cause of the error stopping our earlier fix from being applied. 10 minutes to apply the correct fix and 15 minutes to deploy the fix allowing Airtasker to come back online again. The direct cause was a package install began failing when a new server was being initialised into our load balancer. Specifically, the `apt-get install

The direct cause was a package install that began failing when a new server was being initialised into our load balancer. Specifically, the apt-get install new relic-infra -y call was failing. This halted each new server being created. As our load balancer attempted to rotate instances it was unable spin up new ones. This eventually led to a situation where there were no instances left. At this point (07:30 UTC) we went down. It also stopped us from simply creating and adding the instances manually as they were unable to complete the create.

Timeline

04:48UTC: A New Relic synthetic test failed
04:50UTC: The New Relic synthetic test passed
06:30UTC: Error reports via social media
07:18UTC: Tim notified the developer channel of errors
07:20UTC: DevOps team began investigation
07:36UTC: Airtasker completely offline as final server dropped off load and no new servers could be added
07:47UTC: Attempt was made to rollback to a known good AMI
08:01UTC: The new AMI was also failing
08:22UTC: Root cause was identified as the NewRelic Infra package install somewhere in the server initialisation
08:25UTC: DevOps engineer on phone with Rackspace support engineers
08:38UTC: Airtasker and Rackspace DevOps identified the root cause as being a Cloud Formation template. Changes were made and a new deployment was made
08:56UTC: New deploy failed again
09:22UTC: Plan B of creating a mirror version of the Newrelic Infra package on our S3 boxes was initiated by 2 additional Airtasker engineers
09:26UTC: Same core issue identified. Revealed that the change made to the Cloud Formation template was not being picked up somewhere
09:55UTC: New AMI from scratch was created to see if that would pull the updated Cloud Formation template (stored within Airtasker S3)
10:26UTC: Plan B is shown to not be possible as we could not get the package independently (deb file) to load to S3
10:40UTC: Discussion around how the templates are generated from within the base, through to the Utility, Worker, API and Front End templates.
10:48UTC: Airtasker engineer again on the phone with Rackspace support
11:00UTC: Root cause identified. The order of the Cloud Formation templates was understood, the correct template was updated.
11:04UTC: Servers started to come back online as the load balancer spun up new boxes to match the load
11:06UTC: RDS DB Connections began to match standard levels
11:08UTC: Airtasker was back online

Technical details

Each time our load balancer adds either a new EC2 instance to the stack or replaces an Ec2 instance, it uses Cloud Formation templates. One of the templates called for the NewRelic Infrastructure package to be installed. This install started failing sometime after 04:00UTC. This commenced a chain reaction where all of our instances disappeared (for various standard operational reasons) over the course of a few hours and were unable to be replaced.

When our Engineers identified the issue, then attempted to remove the offending call from within the Cloud Formation Template. Unfortunately our naming conventions led to confusion as to exactly which template needed to be updated. This confusion was the primary reason for the length of the outage.

Outcomes

Our AMI needs to be both immutable and not have external dependencies. Each time a new instance is stood up, there should be no external requirements that can fail (as happened).

We also need to rename and better document the exact nature of our Cloud Formation templates. Had they reflected the naming conventions, we would have resolved in approximately 1.5 hours rather than 4 hours.

A dedicated DevOps task group will plan an approach to hardening our deployment and operational activities to root out any additional external dependencies. This same group will also refactor our templates to ensure their simplicity and robustness moving forward.

Conclusion

We had wrongly assumed that our instances were immutable once the AMI built them in our Continuous Integration stage. This was not the case. We also had assumption about which Cloud Formation template did what. Both of these facts led to an outage. Neither the outage nor the length of time is acceptable.

We are committed to improving our technology and operational processes to prevent future outages. We appreciate your patience and we apologise for any inconvenience. Thank you for your continued support.

Summary

Timeline

Technical details

Outcomes

Conclusion

Hi and welcome to Airtasker!

Get an iPhone without waiting in line!

Airtasker iPhone line up tasks: “We’ve done this before!”

Creating jobs out of an iPhone queue

Providing a scalable workforce for the AirBNB Sydney launch party

TaskBox is joining Airtasker!