Skip to content
Advertisement

SQLAlchemy SSL SYSCALL timeout coping mechanism

I’m using a combination of SQLAlchemy and Postgres. Once every while my database cluster replaces a failing node, circle of life I guess.

I was under the impression that by configuring my engine in the following manner:

engine = create_engine(
    env_config.pg_connection_string,
    echo=False,
    pool_size=env_config.pg_pool_size,
    pool_timeout=1,  # Number of seconds to wait before giving up on getting a connection from the pool.
    pool_recycle=3600,  # Replace connections on CHECKOUT after 1 hour
    connect_args={
        'connect_timeout': 10,  # Maximum wait for connection
        "options": "-c statement_timeout=30s"  # Maximum amount of time set for statements
    },
)

my connections would be timing out on queries >30s, and my connections would timeout after trying for 10 seconds.

What I’m noticing in practice is that in a situation where my db node is being replaced from my db cluster, it sometimes takes 15 mins(900s) dealing with an exception like psycopg2.DatabaseError: SSL SYSCALL error: No route to host. If a db transaction is active while the node is being replaced it could take up to 16 mins for it to raise the SYSCALL exception. All new transactions are being handled well, and I guess routed to the right host? But existing session / transactions seem to block and halt for up to 16 minutes.

My explanation would be that a SSL SYSCALL issue is neither a connection nor a statement related setting, so both configured time-outs would not have an impact. My question remains ‘How do I stop or timeout these SSL SYSCALL issues?’, I would rather just fail quickly and retry the same query than spend 15 minutes in a blocking call. I’m not sure where to resolve this, I’m guessing either in my DB layer (Postgres, SQLAlchemy, or db driver) or a configuration in my network layer (Centos).

Some more digging in my postgres configurations reveal that both the TCP related settings in postgres for tcp_keepalives_count and tcp_keep_alives_interval are 6 and 10. Which makes we wonder why the connection hasn’t been killed after 60 seconds. Also, is it even possible to receive TCP ACKS even though there is no ‘Route to Host’, the SSL SYSCALL issue.

Advertisement

Answer

Unless someone else has a more fitting explanation I’m convinced my issue is being caused by a combination of TCP tcp_retries2 and non gracefully halting of open db connections. Whenever my primary db node is being replaced its being nuked from the cluster, any established connections with that node are being left open / in established state. With the current default TCP settings it could take up to 15 minutes before the connection is dropped, not really sure why this manifests in a SSL SYSCALL exception though.

This issue that covers my problem is covered really well on one of the issues / PR’s at the PGbounder repo: https://github.com/pgbouncer/pgbouncer/issues/138, TCP connections taking a long time before marked marked / considered ‘dead’.

I suggest reading that page in order to get a better understanding, my assumption being that my issue is also caused by the default TCP settings.

Long story short, I consider to have two options:

  1. Manually tune TCP settings on my host, this will affect all other TCP using components on that machine.
  2. Setup something like PGBouncer so TCP tuning can be done service locally, without affecting anything else on that machine.
Advertisement