This is a debugging story.

A bit of context:

We run a python web application, running on AWS lambda. It’s backed by an AWS aurora serverless postgres database.

When it comes time to change the database schema, we use a library called alembic. We have a GitHub action configured such that on every deployment, we run a “database migration” lambda function, which tells alembic to ensure that all migrations written have been applied to the database (and, in the correct sequence, run any that have not been applied).

This is supposed to be a fast operation - other lambdas are serving HTTP traffic and querying this database at the same time, and we expect them to have to pause for less than a second while the database is upgraded.

The bug

We ran a migration, as part of a production deployment, that took about 10 minutes to complete. During this time, requests to the application received 504 errors, as the network bridge to the lambda functions gave up on getting responses.

The migration was fairly simple:

alembic.op.create_unique_constraint(
    "unique_informal_name_academic_partner",
    "academic_partners",
    ["informal_name"]
)

which corresponds to the following SQL:

alter table academic_partners
add constraint unique_informal_name_academic_partner
unique (informal_name);

The table in question contains around 250 rows.

Why did this operation cause the database to lock up in this way? What did we have to do to prevent it from happening again?

Outage observations

During the outage, we saw the connection count to the database increase dramatically, as the HTTP-request-serving lambdas timed out. So, we proceeded on the (correct) assumption that the database was closely related to the root cause.

A bunch of other things also happened:

the database tried, and failed, to scale, emitting messages like

Scaling DB cluster from 4 capacity units to 8 capacity units for this reason: Autoscaling.

The DB cluster failed to scale from 4 capacity units to 8 capacity units for this reason: A scaling point wasn’t found.

Our database-migration lambda ran 3 times, presumably due to some non-obvious retry logic. The first time took 10 minutes, the second two happened during that time, and quickly failed with log messages like

[ERROR] OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections

A bunch of other database stats made big swings: freeable memory, read IOPs, swap usage, write/read latency, and commit throughput

It was unclear, initially, how much of this was relevant to the root issue, and how much was noise, or symptoms.

Initial hypothesis: aurora is slow at validating column uniqueness

My first stop was the documentation for postgres, and the behavior of the ALTER TABLE command. What about it might cause the database to “lock up” like this?

The documentation says, among other things:

…most forms of ADD table_constraint require an ACCESS EXCLUSIVE lock…

Normally, this form will cause a scan of the table to verify that all existing rows in the table satisfy the new constraint. But if the NOT VALID option is used, this potentially-lengthy scan is skipped.

Seems like a clue! An ACCESS EXLCLUSIVE lock seems like it would lock things up!

ACCESS EXCLUSIVE: This mode guarantees that the holder is the only transaction accessing the table in any way.

OK, so first hypothesis: We ran an ALTER TABLE command to enforce uniqueness of a column, aurora postgres for some reason took 10 minutes to verify uniqueness on the ~250 rows of existing data, and during this 10 minutes enforced an ACCESS EXLUSIVE lock on the relevant table, locking out our request-serving lambdas. And maybe, if we use a NOT VALID option, we can fix it.

Weird, but straightforward. Was it correct? Could we disprove it?

Replication attempt #1

Because I thought that this was an aurora bug, I thought we’d need to replicate it in aurora. I walked down this road a bit:

Use cloudformation to create a replication environment, with a database configured similarly to our production database
Load the replication environment database with realistic data
Run the database migration
See if it takes 10 minutes

This is a fairly heavy undertaking - I spend about two hours on it, and then started wondering, what will I do if I cannot replicate this problem easily in this environment? Will I have to add more things about our real environment (HTTP-serving lambdas, a github action, etc) one-by-one until I find the relevant interaction?

Is there a cheaper, easier way to replicate this?

Replication attempt #2

So then I thought, let’s just see if we can replicate this easily in one of our pre-production environments. We have two, one called dev and one called staging. Using the AWS CLI, I confirmed that the databases were configured very similarly to production.

So, I decided to impersonate alembic - I got it to tell me the SQL it would run to downgrade the database, then upgrade it again. The experiment:

Run downgrade SQL against dev database
Run upgrade SQL against dev database
If either of these is “slow” (longer than a few seconds), you’ve replicated the bug

The result was negative - the downgrade and the upgrade took less than a second!

So, my initial replication effort had been doomed to failure, and I was correct to abandon it before investing too heavily.

So, what’s the relevant difference? We know the bug does not appear on the dev environment, with a database configured identically to production. Well, the production environment serves HTTP traffic, so there are other connections open to the database - maybe we can simulate that in a pre-production environment, and see if the bug appears.

I figured that this was a hail mary, because I had no theory that would make HTTP traffic relevant to the bug.

Replication attempt #3

I wrote a script that made realistic requests to the staging environment on an infinite loop, and reported the status code and latency of the responses.

I ran it against the staging environment, opened a connection to the staging server, and ran the SQL downgrade migration. Immediately, HTTP traffic ceased as the migration command hung. Bug replicated!

I expected the migration to take 10 minutes and resolve itself, as it had in production. It did not - after an hour I hit ctrl+c to cancel the SQL command, the transaction was cancelled and HTTP traffic immediately recovered.

From this I took it that we got lucky in production with a 10-minute outage - it could have been much worse, and it’s not obvious how we could have fixed it!

Local replication

Now that I’d replicated the bug in a cloud environment, I thought I should make the attempt to replicate it locally. I still thought this was an aurora bug, so I did not expect to be able to.

But, I was wrong - repeating the steps by pointing my fake-traffic script at a local environment, and running the migration commands, replicated the bug locally - HTTP requests would hang for about a minute and a half while the database processed the migration commands.

So, I owed aurora an apology - with a normal, postgres database on my local machine, the bug still existed. Some observations here:

On a “quiet database” (one not being accessed by any request-serving processes), a migration took less than a second
While being accessed by request-serving processes, a migration took around 90 seconds (it was unclear why it was so much faster locally than in a cloud environment, but this was enough to demonstrate a problem)
For some period of time after requests cease, migrations were still slow. I thought of this as the “hangover”. After a while (a few minutes, I didn’t try to get precise here), the “hangover” resolved, and migrations are fast again.
It did not matter if I wrapped the migration commands in transactions
Adding NOT VALID to avoid uniquness checks did not work - it’s unsupported on column uniquness contstraints.

Postgres clues

So, what was happening? Could I get postgres to tell me any more about what it was doing, and when?

My next stop was the postgres logs - looking at those at the end of a round of experimentation, I found several blocks of lines like this:

2022-05-11 09:13:05.509 PDT [7891] LOG: unexpected EOF on client connection with an open transaction

That seemed like a clue - postgres was complaining that several clients had open connections, with open transactions, and it was waiting for them to finish their transactions. Instead, they just rudely closed their connections.

Then I took a look at postgres processes, with the excellent watch command-line utility. watch 'ps aux | grep postgres' reported on any postgres process running on my machine. Postgres is very kindly designed to give useful information when watched in this way, and I quickly noticed that when serving traffic, postgres would spin up several processes noting idle in transaction.

Combining this with my previous observations, I quickly figured out some more things:

Attempting to migrate the database while watching postgres processes showed a new process noting ALTER TABLE waiting
After idling traffic for some time, all the idle in transaction processes disappeared, and new unexpected EOF on client connection with an open transaction logs appeared
The period of time between the last http request, and the idle in transaction processes disappearing, correspond to and explained the “hangover” I’d seen earlier - idle in transaction connections were what the migration was waiting to resolve, before it could proceed.
ctrl-c on my local request-serving application would immediately kill the connection processes and generate logs, confirming that our application was the source of these idle in transaction connections.

So, whatever the underlying bug was, it seemed to be the fault of our web application. Why was it creating transactions that stayed open indefinitely, and how could we get it to stop?

SQLAlchemy Connection handling

We use the sqlalchemy library to handle database connections. Our application in this middle of a migration from using flask to a new architecture fronted by AWS AppSync. This bug was observable only on the “AppSync side” of the application, so I examined how we are creating database connections, and how we’re managing them. The relevant code is roughly:

from sqlalchemy import create_engine
from sqlalchemy.orm import (
    scoped_session, sessionmaker
)


db_engine = create_engine(
    SQLALCHEMY_DATABASE_URI
)
session_factory = sessionmaker(
    bind=db_engine, autoflush=False
)
DBSession = scoped_session(session_factory)

Then, in various parts of the application, when we need a database connection we import DBSession and do things like

results = DBSession().query(...)

Reading the SQLAlchemy documentation for all this (create_engine, sessionmaker, scoped_session), it all seemed to be in line with what is recommended - scoped_session creates a thread-local session context, which does clever internal caching and handles the mechanics of actual database connections.

Experimentally, I tried a few changes to our call to sessionmaker:

Set autoflush to False: no change
set autocommit to True: fixes the bug!

With the autocommit option, we no longer get idle-in-transaction connections. Victory! Except…the docs say

The autocommit flag is not for general use, and if it is used, queries should only be invoked within the span of a Session.begin() / Session.commit() pair. Executing queries outside of a demarcated transaction is a legacy mode of usage, and can in some cases lead to concurrent connection checkouts.

And the option is deprecated. So, if this is the Wrong Way to fix the bug, what’s the Right Way?

I spent an hour or two reading the documentation for session management in SQLAlchemy, and found a relevant section on managing session and transaction scope. The useful advice in our case can be paraphrased as:

If you’re serving HTTP requests, your session and transaction scope should be identical with the scope of the logic serving a single HTTP request. Which is to say, somewhere in your logic, after you’re done with database queries, but before you finish handling the HTTP request, you should ensure that your database session has been “removed”.

The fix

In our case, that looked like a call to DBSession.remove() in our request-routing handler, right before we return a result.

After adding that one-line change, the bug was fixed - we confirmed it in subsequent pre-production environments, and pushed it out to production.

Some notes

I surprised myself several times, and disproved my initial hypothesis.

I took one wrong turn early on, which cost me several hours. The correct way to approach replication is to try it the cheapest, simplest way first, and only move on to heavier tools if necessary. My assumption that there was an aurora bug was unfounded, and it would have saved me effort on the replication-environment false start to instead begin with replication in one of our existing environments.

The underlying issue is partially due to the architectural approach we’re taking - ironically, this would not have ever been a problem if we’d stuck with flask, because the flask-sqlalchemy library we’ve been using there handles this session-removal logic itself as part of the hooks it uses to integrate with the framework. That doesn’t mean that migrating was a mistake, but changing fundamental things creates surprising bugs, seemingly far from the root-cause change.

The Case of the Database Blocking Migration