We recently experienced a major outage, which affected all members. This incident report outlines what caused the outage, the steps we took to get Airtasker back up and the actions we’re taking to help prevent a similar outage recurring.
We apologise to everyone affected by this outage.
On 2nd April 2017, we experienced a 4-hour outage starting at 07:30 UTC. It took 28min to identify the cause within our cloud formation templates, 5 minutes to make the change we thought was required. It then took a further 180 minutes to identify the cause of the error stopping our earlier fix from being applied. 10 minutes to apply the correct fix and 15 minutes to deploy the fix allowing Airtasker to come back online again. The direct cause was a package install began failing when a new server was being initialised into our load balancer. Specifically, the `apt-get install
The direct cause was a package install that began failing when a new server was being initialised into our load balancer. Specifically, the
apt-get install new relic-infra -y call was failing. This halted each new server being created. As our load balancer attempted to rotate instances it was unable spin up new ones. This eventually led to a situation where there were no instances left. At this point (07:30 UTC) we went down. It also stopped us from simply creating and adding the instances manually as they were unable to complete the create.
- 04:48UTC: A New Relic synthetic test failed
- 04:50UTC: The New Relic synthetic test passed
- 06:30UTC: Error reports via social media
- 07:18UTC: Tim notified the developer channel of errors
- 07:20UTC: DevOps team began investigation
- 07:36UTC: Airtasker completely offline as final server dropped off load and no new servers could be added
- 07:47UTC: Attempt was made to rollback to a known good AMI
- 08:01UTC: The new AMI was also failing
- 08:22UTC: Root cause was identified as the NewRelic Infra package install somewhere in the server initialisation
- 08:25UTC: DevOps engineer on phone with Rackspace support engineers
- 08:38UTC: Airtasker and Rackspace DevOps identified the root cause as being a Cloud Formation template. Changes were made and a new deployment was made
- 08:56UTC: New deploy failed again
- 09:22UTC: Plan B of creating a mirror version of the Newrelic Infra package on our S3 boxes was initiated by 2 additional Airtasker engineers
- 09:26UTC: Same core issue identified. Revealed that the change made to the Cloud Formation template was not being picked up somewhere
- 09:55UTC: New AMI from scratch was created to see if that would pull the updated Cloud Formation template (stored within Airtasker S3)
- 10:26UTC: Plan B is shown to not be possible as we could not get the package independently (deb file) to load to S3
- 10:40UTC: Discussion around how the templates are generated from within the base, through to the Utility, Worker, API and Front End templates.
- 10:48UTC: Airtasker engineer again on the phone with Rackspace support
- 11:00UTC: Root cause identified. The order of the Cloud Formation templates was understood, the correct template was updated.
- 11:04UTC: Servers started to come back online as the load balancer spun up new boxes to match the load
- 11:06UTC: RDS DB Connections began to match standard levels
- 11:08UTC: Airtasker was back online
Each time our load balancer adds either a new EC2 instance to the stack or replaces an Ec2 instance, it uses Cloud Formation templates. One of the templates called for the NewRelic Infrastructure package to be installed. This install started failing sometime after 04:00UTC. This commenced a chain reaction where all of our instances disappeared (for various standard operational reasons) over the course of a few hours and were unable to be replaced.
When our Engineers identified the issue, then attempted to remove the offending call from within the Cloud Formation Template. Unfortunately our naming conventions led to confusion as to exactly which template needed to be updated. This confusion was the primary reason for the length of the outage.
Our AMI needs to be both immutable and not have external dependencies. Each time a new instance is stood up, there should be no external requirements that can fail (as happened).
We also need to rename and better document the exact nature of our Cloud Formation templates. Had they reflected the naming conventions, we would have resolved in approximately 1.5 hours rather than 4 hours.
A dedicated DevOps task group will plan an approach to hardening our deployment and operational activities to root out any additional external dependencies. This same group will also refactor our templates to ensure their simplicity and robustness moving forward.
We had wrongly assumed that our instances were immutable once the AMI built them in our Continuous Integration stage. This was not the case. We also had assumption about which Cloud Formation template did what. Both of these facts led to an outage. Neither the outage nor the length of time is acceptable.
We are committed to improving our technology and operational processes to prevent future outages. We appreciate your patience and we apologise for any inconvenience. Thank you for your continued support.