Skip to content
Advertisement

Airflow HttpSensor using a default host

I’m trying to poll some endpoint to wait until the Last-Modified header shows the endpoint has been updated in the last five minutes (the default poke interval for the HttpSensor). In the Airflow logs, I see the following:

JavaScript

As the logs show, the hostname it’s using is: Using connection to: id: http_default. Host: https://www.httpbin.org/, and so when it goes to form the request, it appends the URL I’m actually interested in to https://www.httpbin.org/, resulting in a 404. This is my sensor definition (fairly straightforward):

JavaScript

where DAILY_URL is the URL I want, and endpoint_is_updated_recently is the function to parse the Last-Modified header to determine if it’s been updated since the last poke.

Does anyone have any ideas why it’s using httpbin.org as the host? That’s not mentioned anywhere in my code, Airflow code, etc. and curl <the URL I want> does work.

Advertisement

Answer

It is a good idea to keep credentials and connection info out of the code. Airflow uses Connections as a central database to store and manage credentials and connection info.

In you case HttpSensor uses the http_default connection. And someone has specified https://www.httpbin.org/ as the Host for the http_default connection. That is why this URL is getting prepended to your DAILY_URL.

The endpoint argument of HttpSensor is meant to store a path relative to the URL stored in the corresponding connection.

Advertisement