Skip to content
Advertisement

Airflow HttpSensor using a default host

I’m trying to poll some endpoint to wait until the Last-Modified header shows the endpoint has been updated in the last five minutes (the default poke interval for the HttpSensor). In the Airflow logs, I see the following:

[2020-07-11 22:40:53,794] {http_sensor.py:77} INFO - Poking: https://<the URL I want>
[2020-07-11 22:40:53,802] {logging_mixin.py:112} INFO - [2020-07-11 22:40:53,802] {base_hook.py:87} INFO - Using connection to: id: http_default. Host: https://www.httpbin.org/, Port: None, Schema: None, Login: None, Password: None, extra: None
[2020-07-11 22:40:53,803] {logging_mixin.py:112} INFO - [2020-07-11 22:40:53,803] {http_hook.py:136} INFO - Sending 'GET' to url: https://www.httpbin.org/https://<the URL I want>
[2020-07-11 22:40:53,837] {logging_mixin.py:112} WARNING - /usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:986: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.httpbin.org'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
[2020-07-11 22:40:53,841] {logging_mixin.py:112} INFO - [2020-07-11 22:40:53,841] {http_hook.py:150} ERROR - HTTP error: NOT FOUND
[2020-07-11 22:40:53,841] {logging_mixin.py:112} INFO - [2020-07-11 22:40:53,841] {http_hook.py:151} ERROR - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server.  If you entered the URL manually please check your spelling and try again.</p>

As the logs show, the hostname it’s using is: Using connection to: id: http_default. Host: https://www.httpbin.org/, and so when it goes to form the request, it appends the URL I’m actually interested in to https://www.httpbin.org/, resulting in a 404. This is my sensor definition (fairly straightforward):

    data_is_updated = HttpSensor(
        task_id="data-is-updated",
        endpoint=DAILY_URL,
        response_check=endpoint_is_updated_recently
    )

where DAILY_URL is the URL I want, and endpoint_is_updated_recently is the function to parse the Last-Modified header to determine if it’s been updated since the last poke.

Does anyone have any ideas why it’s using httpbin.org as the host? That’s not mentioned anywhere in my code, Airflow code, etc. and curl <the URL I want> does work.

Advertisement

Answer

It is a good idea to keep credentials and connection info out of the code. Airflow uses Connections as a central database to store and manage credentials and connection info.

In you case HttpSensor uses the http_default connection. And someone has specified https://www.httpbin.org/ as the Host for the http_default connection. That is why this URL is getting prepended to your DAILY_URL.

The endpoint argument of HttpSensor is meant to store a path relative to the URL stored in the corresponding connection.

Advertisement