Issues loading MapServices

Incident Report for HBK Where

Postmortem

On March 23rd, 2021, at approximately 2:00pm Central Time, we noticed the /mapservices endpoint suddenly refusing to function. This issue appeared after I (Brett) had been doing some administration work in the admin interface to debug a user permissions issue.

Upon investigating the logs, I noticed a “rate exceeded” error coming from AWS Parameter Store, which is used to securely host configuration information for the application. I tried a variety of hacks to restart the function, to no avail. I also ensured we had the proper enhanced throughput configured in the Paramster Store, which we did.

Most likely, this was caused by spinning up too many fresh Lambda function invocations concurrently during my work in the admin panel (although it could’ve also been caused by a user.) Every time a function boots, it needs to connect to the Parameter Store, so booting too many at once could overwhelm the Parameter Store, causing “rate exceeded” errors.

I discovered the “throttle” button in the AWS Lambda interface, and clicked it, and gave the system about an hour to recover on its own while writing a hotfix for the issue should that not work, assuming reducing the load on the Parameter Store would cause the rate limits to reset and the endpoint to recover. Unfortunately, I did not realize that the throttle button throttled traffic on the endpoint to absolute zero – I should’ve instead modified the concurrency to a lower level while the rate limits recovered. This was only noticed because I was unable to deploy the hotfix while the throttle was in place.

There are a variety of things that need to be done to prevent this situation in the future --

create an alert on “rate exceeded” errors https://github.com/orgs/HBKEngineering/projects/7#card-57622756 https://github.com/orgs/HBKEngineering/projects/7#card-57622698
add synthetic tests to proactively monitor the mapservices endpoint (and the others, for that matter) https://github.com/orgs/HBKEngineering/projects/7#card-57622869
cleanup the configuration library - simplify and reduce number of calls to the Parameter Store needed to fresh boot endpoints https://github.com/orgs/HBKEngineering/projects/7#card-57623033
We could also be throttling or managing API requests on the client-side better, but this would require some significant changes to the frontendh ttps://github.com/orgs/HBKEngineering/projects/7#card-57623359

Should I run into this issue in the future, the solution is to not throttle requests, but reduce concurrency to a more manageable level. It may make sense to keep maximum concurrency under a certain level, but it’s unclear what that number should be and what the effects of setting it too low might be.

Posted Mar 24, 2021 - 12:08 CDT

Resolved

This incident has been resolved.

Posted Mar 24, 2021 - 11:45 CDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 23, 2021 - 16:38 CDT

Investigating

We are noticing issues with the backend failing to properly connect to the database. These issues appear to originate with our hosting provider, AWS. We are investigating a temporary fix.

Posted Mar 23, 2021 - 14:35 CDT