To preface I’m fairly new to Docker, Airflow & Stackoverflow.
I’ve got an instance of Airflow running in Docker on an Ubuntu (20.04.3) VM.
I’m trying to get Openpyxl installed on build in order to use it as the engine for pd.read_excel
.
Here’s the Dockerfile with the install command:
FROM apache/airflow:2.2.4 ENV AIRFLOW_HOME=/opt/airflow USER root RUN apt-get update -qq && apt-get install vim -qqq COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Ref: https://airflow.apache.org/docs/docker-stack/recipes.html SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"] ARG CLOUD_SDK_VERSION=322.0.0 ENV GCLOUD_HOME=/home/google-cloud-sdk ENV PATH="${GCLOUD_HOME}/bin/:${PATH}" RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" && TMP_DIR="$(mktemp -d)" && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" && mkdir -p "${GCLOUD_HOME}" && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 && "${GCLOUD_HOME}/install.sh" --bash-completion=false --path-update=false --usage-reporting=false --quiet && rm -rf "${TMP_DIR}" && gcloud --version WORKDIR $AIRFLOW_HOME USER $AIRFLOW_UID
The requirements.txt file looks like this:
openpyxl apache-airflow-providers-google pyarrow==6.0.1 pandas==1.3.5 requests==2.27.1
And the docker-compose.yaml file looks like this:
version: '3' x-airflow-common: &airflow-common build: context: . dockerfile: ./Dockerfile environment: &airflow-common-env AIRFLOW__CORE__EXECUTOR: CeleryExecutor AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0 AIRFLOW__CORE__FERNET_KEY: '' AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' AIRFLOW__CORE__LOAD_EXAMPLES: 'false' AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth' _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json' GCP_PROJECT_ID: <MYPROJECTID> GCP_GCS_BUCKET: <MYBUCKET> volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ~/.google/credentials/:/.google/credentials:ro user: "${AIRFLOW_UID:-50000}:0" depends_on: &airflow-common-depends-on redis: condition: service_healthy postgres: condition: service_healthy services: postgres: image: postgres:13 environment: POSTGRES_USER: <USER> POSTGRES_PASSWORD: <PASSWORD> POSTGRES_DB: <DBNAME> volumes: - postgres-db-volume:/var/lib/postgresql/data healthcheck: test: [ "CMD", "pg_isready", "-U", "airflow" ] interval: 5s retries: 5 restart: always redis: image: redis:latest expose: - 6379 healthcheck: test: [ "CMD", "redis-cli", "ping" ] interval: 5s timeout: 30s retries: 50 restart: always airflow-webserver: <<: *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: [ "CMD", "curl", "--fail", "http://localhost:8080/health" ] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-scheduler: <<: *airflow-common command: scheduler healthcheck: test: [ "CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"' ] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-worker: <<: *airflow-common command: celery worker healthcheck: test: - "CMD-SHELL" - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"' interval: 10s timeout: 10s retries: 5 environment: <<: *airflow-common-env DUMB_INIT_SETSID: "0" restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-triggerer: <<: *airflow-common command: triggerer healthcheck: test: [ "CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"' ] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-init: <<: *airflow-common entrypoint: /bin/bash command: - -c - | function ver() { printf "%04d%04d%04d%04d" $${1//./ } } airflow_version=$$(gosu airflow airflow version) airflow_version_comparable=$$(ver $${airflow_version}) min_airflow_version=2.2.0 min_airflow_version_comparable=$$(ver $${min_airflow_version}) if (( airflow_version_comparable < min_airflow_version_comparable )); then echo echo -e "33[1;31mERROR!!!: Too old Airflow version $${airflow_version}!e[0m" echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!" echo exit 1 fi if [[ -z "${AIRFLOW_UID}" ]]; then echo echo -e "33[1;33mWARNING!!!: AIRFLOW_UID not set!e[0m" echo "If you are on Linux, you SHOULD follow the instructions below to set " echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." echo "For other operating systems you can get rid of the warning with manually created .env file:" echo " See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user" echo fi one_meg=1048576 mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) disk_available=$$(df / | tail -1 | awk '{print $$4}') warning_resources="false" if (( mem_available < 4000 )) ; then echo echo -e "33[1;33mWARNING!!!: Not enough memory available for Docker.e[0m" echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" echo warning_resources="true" fi if (( cpus_available < 2 )); then echo echo -e "33[1;33mWARNING!!!: Not enough CPUS available for Docker.e[0m" echo "At least 2 CPUs recommended. You have $${cpus_available}" echo warning_resources="true" fi if (( disk_available < one_meg * 10 )); then echo echo -e "33[1;33mWARNING!!!: Not enough Disk space available for Docker.e[0m" echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" echo warning_resources="true" fi if [[ $${warning_resources} == "true" ]]; then echo echo -e "33[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!e[0m" echo "Please follow the instructions to increase amount of resources available:" echo " https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin" echo fi mkdir -p /sources/logs /sources/dags /sources/plugins chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins} exec /entrypoint airflow version environment: <<: *airflow-common-env _AIRFLOW_DB_UPGRADE: 'true' _AIRFLOW_WWW_USER_CREATE: 'true' _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} user: "0:0" volumes: - .:/sources airflow-cli: <<: *airflow-common profiles: - debug environment: <<: *airflow-common-env CONNECTION_CHECK_MAX_COUNT: "0" command: - bash - -c - airflow flower: <<: *airflow-common command: celery flower ports: - 5555:5555 healthcheck: test: [ "CMD", "curl", "--fail", "http://localhost:5555/" ] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully volumes: postgres-db-volume:
After I’ve run docker build
and docker up
and shell into the running worker container, running pip list
shows that all of the packages in the requirements file have been installed successfully except for Openpyxl. The requirements.txt file that is copied to the container on build even includes Openpyxl in it. I’m able to manually pip install openpyxl at this point by executing pip install openpyxl
in the shell.
- I’ve tried adding a manual install to the Dockerfile (
RUN pip install openpyxl
) both before and after theRUN pip install --no-cache-dir -r requirements.txt
command. - I’ve tried running
docker-compose build --no-cache
. - I’ve tried running
docker system prune -a
and rebuilding the containers from scratch.
It seems like this should be a fairly simple thing to do since I had no problems getting the other packages in the requirements.txt file installed correctly – thinking it might be something to do with the Openpyxl package itself?
Any advice would be much appreciated.
Advertisement
Answer
If I understand your question right, the following line in your docker-compose.yml can also help:
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:- openpyxl==3.0.9}
BTW, the docs here explain the ways to add additional requirements: Building the image