Issues loading MapServices
Incident Report for HBK Where
Postmortem

On March 23rd, 2021, at approximately 2:00pm Central Time, we noticed the /mapservices endpoint suddenly refusing to function. This issue appeared after I (Brett) had been doing some administration work in the admin interface to debug a user permissions issue.

Upon investigating the logs, I noticed a “rate exceeded” error coming from AWS Parameter Store, which is used to securely host configuration information for the application. I tried a variety of hacks to restart the function, to no avail. I also ensured we had the proper enhanced throughput configured in the Paramster Store, which we did.

Most likely, this was caused by spinning up too many fresh Lambda function invocations concurrently during my work in the admin panel (although it could’ve also been caused by a user.) Every time a function boots, it needs to connect to the Parameter Store, so booting too many at once could overwhelm the Parameter Store, causing “rate exceeded” errors.

I discovered the “throttle” button in the AWS Lambda interface, and clicked it, and gave the system about an hour to recover on its own while writing a hotfix for the issue should that not work, assuming reducing the load on the Parameter Store would cause the rate limits to reset and the endpoint to recover. Unfortunately, I did not realize that the throttle button throttled traffic on the endpoint to absolute zero – I should’ve instead modified the concurrency to a lower level while the rate limits recovered. This was only noticed because I was unable to deploy the hotfix while the throttle was in place.

There are a variety of things that need to be done to prevent this situation in the future --

Should I run into this issue in the future, the solution is to not throttle requests, but reduce concurrency to a more manageable level. It may make sense to keep maximum concurrency under a certain level, but it’s unclear what that number should be and what the effects of setting it too low might be.

Posted Mar 24, 2021 - 12:08 CDT

Resolved
This incident has been resolved.
Posted Mar 24, 2021 - 11:45 CDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 23, 2021 - 16:38 CDT
Investigating
We are noticing issues with the backend failing to properly connect to the database. These issues appear to originate with our hosting provider, AWS. We are investigating a temporary fix.
Posted Mar 23, 2021 - 14:35 CDT