Skip to content
Advertisement

Poetry and buildkit mount=type=cache not working when building over airflow image

I have 2 examples of docker file and one is working and another is not. The main difference between the 2 is the base image.

Simple python base image docker file:

# syntax = docker/dockerfile:experimental
FROM python:3.9-slim-bullseye

RUN apt-get update -qy && apt-get install -qy 
    build-essential tini libsasl2-dev libssl-dev default-libmysqlclient-dev gnutls-bin

RUN pip install poetry==1.1.15
COPY pyproject.toml .
COPY poetry.lock .
RUN poetry config virtualenvs.create false
RUN --mount=type=cache,mode=0777,target=/root/.cache/pypoetry poetry install

Airflow base image docker file:

# syntax = docker/dockerfile:experimental
FROM apache/airflow:2.3.3-python3.9
USER root
RUN apt-get update -qy && apt-get install -qy 
    build-essential tini libsasl2-dev libssl-dev default-libmysqlclient-dev gnutls-bin

USER airflow
RUN pip install poetry==1.1.15
COPY pyproject.toml .
COPY poetry.lock .
RUN poetry config virtualenvs.create false
RUN poetry config cache-dir /opt/airflow/.cache/pypoetry
RUN --mount=type=cache,uid=50000,mode=0777,target=/opt/airflow/.cache/pypoetry poetry install

Before building the docker file run poetry lock in the same folder as the pyproject.toml file!

pyproject.toml file:

[tool.poetry]
name = "Airflow-test"
version = "0.1.0"
description = ""
authors = ["Lorem ipsum"]

[tool.poetry.dependencies]
python = "~3.9"
apache-airflow = { version = "2.3.3", extras = ["amazon", "crypto", "celery", "postgres", "hive", "jdbc", "mysql", "ssh", "slack", "statsd"] }
prometheus_client = "^0.8.0"
isodate = "0.6.1"
dacite = "1.6.0"
sqlparse = "^0.3.1"
python3-openid = "^3.1.0"
flask-appbuilder = ">=3.4.3"
alembic = ">=1.7.7"
apache-airflow-providers-google = "^8.1.0"
apache-airflow-providers-databricks = "^3.0.0"
apache-airflow-providers-amazon = "^4.0.0"
pendulum = "^2.1.2"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

In order to build the images this is the command that I use:

DOCKER_BUILDKIT=1 docker build --progress=plain -t airflow-test -f Dockerfile . 

For both images the first time they build poetry install will need to download all dependencies. The interesting part is, the second time I build the image, the python-based image is a lot faster as the dependencies are already cached, but the airflow-based image will try and download all 200 dependencies once again. From what O know by specifying --mount=type=cache that directory will be stored in the image repository so it can be reused next time the image is build. By this you trim the final image size.

When running the image how do the dependencies appear? If I run docker run -it --user 50000 --entrypoint /bin/bash image a simple python import is working on the airflow image but not on the python image. When and how will the dependencies be reattached to the image?

If you want to try it out, here is a dummy project that can be cloned locally and played around with: https://github.com/ioangrozea/Docker-dummy

Advertisement

Answer

Maybe it is not answering the question directly but I think what you are trying to do makes very little sense in the first place, so I would recommend you to change the approach, completely, especially that what you are trying to achieve is very well described in The Airflow Official image documentation including plenty of examples to follow. And what you are trying to achieve will (no matter how hard you try) end up with the image that is more than 200 MB bigger (at least 20%) than what you can try to get it if you follow the official documentation.

Using poetry to build that image makes very little sense and is not recommended (and there is absolutely no need to use poetry in this case).

See the comment here.

While there are some successes with using other tools like poetry or pip-tools, they do not share the same workflow as pip – especially when it comes to constraint vs. requirements management. Installing via Poetry or pip-tools is not currently supported. If you wish to install airflow using those tools you should use the constraints and convert them to appropriate format and workflow that your tool requires.

Poetry and pip have completely different way of resolving dependencies and while poetry is a cool tool for managing dependencies of small projects and I really like poetry, they opinionated choice of treating libraries and applications differently, makes it not suitable to manage dependencies for Airflow which is a both – application to install and library for developers to build on top of and Poetry’s limitation are simply not working for Airflow.

I explained it more in the talk I gave last year and you can see it if you are interested in “why”.

Then – how to solve your problem? Don’t use --mount-type cache in this case and poetry. Use multi-segmented image of Apache Airflow and “customisation” option rather than “extending” the image. This will give you a lot more savings – because you will not have “build-essentials” added to your final image (on their own they add ~200 MB to the image size and the only way to get rid of them is to split your image into two segments – the one that has “build-essentials” and allows you to build Python packages, and the one that you use as “runtime” where you only copy the “build” python libraries.

This is exactly the approach that Airfow Official Python image takes – it’s highly optimised for size and speed of rebuilds and while internals of it are pretty complex, the actual building of your highly optimised, completely custom image are as simple as downloading the airflow Dockerfile and running the right docker buildx build . --build-arg ... --build-arg ... command – and the Airflow Dockerfile will do all the optimisations for you – resulting in as small image as humanly possible, and also it allows you to reuse build cache – especially if you use buildkit – which is a modern, slick and very well optimised way of building the images (Airflow Dockerfile requires buildkit as of Airflow 2.3).

You can see all the details on how to build the customised image here – where you have plenty of examples and explanation why it works the way it works and what kind of optimisations you can get. There are examples on how you can add dependencies, python packages etc. While this is pretty sophisticated, you seem to do sophisticated thing with your image, that’s why I am suggesting you follow that route.

Also, if you are interested in other parts of reasoning why it makes sense, you can watch my talk from Airflow Summit 2020 – while the talk was given 2 years ago and some small details changed, the explanation on how and why building the image the way we do in Airflow still holds very strongly. It got a little simpler since the talk was give (i.e. the only thing you need not is Dockerfile, no full Airflow sources are needed) and you need to use buildkit – all the rest remains the same however.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement