Skip to content

Latest commit

 

History

History
446 lines (288 loc) · 18.3 KB

07_local_virtualenv.rst

File metadata and controls

446 lines (288 loc) · 18.3 KB

Local Virtual Environment (virtualenv)

The easiest wey to run tests for Airflow is to use local virtualenv. While Breeze is the recommended way to run tests - because it provides a reproducible environment and is easy to set up, it is not always the best option as you need to run your tests inside a docker container. This might make it harder to debug the tests and to use your IDE to run them.

That's why we recommend using local virtualenv for development and testing.

Use system-level package managers like yum, apt-get for Linux, or Homebrew for macOS to install required software packages:

  • Python (One of: 3.8, 3.9, 3.10, 3.11, 3.12)
  • MySQL 5.7+
  • libxml
  • helm (only for helm chart tests)

Refer to the Dockerfile.ci for a comprehensive list of required packages.

Note

  • MySql 2.2.0 needs pkgconf to be a pre requisite, refer here to install pkgconf
  • MacOs with ARM architectures require graphviz for venv setup, refer here to install graphviz
  • The helm chart tests need helm to be installed as a pre requisite. Refer here to install and setup helm

Note

As of version 2.8 Airflow follows PEP 517/518 and uses pyproject.toml file to define build dependencies and build process and it requires relatively modern versions of packaging tools to get airflow built from local sources or sdist packages, as PEP 517 compliant build hooks are used to determine dynamic build dependencies. In case of pip it means that at least version 22.1.0 is needed (released at the beginning of 2022) to build or install Airflow from sources. This does not affect the ability of installing Airflow from released wheel packages.

The simplest way to install Airflow in local virtualenv is to use pip:

pip install -e ".[devel,<OTHER EXTRAS>]" # for example: pip install -e ".[devel,google,postgres]"

This will install Airflow in 'editable' mode - where sources of Airflow are taken directly from the source code rather than moved to the installation directory. You need to run this command in the virtualenv you want to install Airflow in - and you need to have the virtualenv activated.

While you can use any virtualenv manager, we recommend using Hatch as your development environment front-end, and we already use Hatch backend hatchling for Airflow.

Hatchling is automatically installed when you build Airflow but since airflow build system uses PEP compliant pyproject.toml file, you can use any front-end build system that supports PEP 517 and PEP 518. You can also use pip to install Airflow in editable mode.

You can also install extra packages (like [ssh], etc) via pip install -e [devel,EXTRA1,EXTRA2 ...]. However, some of them may have additional install and setup requirements for your local system.

For example, if you have a trouble installing the mysql client on macOS and get an error as follows:

ld: library not found for -lssl

you should set LIBRARY_PATH before running pip install:

export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/opt/openssl/lib/

You are STRONGLY encouraged to also install and use pre-commit hooks for your local virtualenv development environment. Pre-commit hooks can speed up your development cycle a lot.

The full list of extras is available in pyproject.toml and can be easily retrieved using hatch via

Note

Only pip installation is currently officially supported. Make sure you have the latest pip installed, reference version

While there are some successes with using other tools like poetry or pip-tools, they do not share the same workflow as pip - especially when it comes to constraint vs. requirements management. Installing via Poetry or pip-tools is not currently supported.

There are known issues with bazel that might lead to circular dependencies when using it to install Airflow. Please switch to pip if you encounter such problems. Bazel community works on fixing the problem in this PR so it might be that newer versions of bazel will handle it.

If you wish to install airflow using those tools you should use the constraint files and convert them to appropriate format and workflow that your tool requires.

Airflow uses hatch as a build and development tool of choice. It is one of popular build tools and environment managers for Python, maintained by the Python Packaging Authority. It is an optional tool that is only really needed when you want to build packages from sources, but it is also very convenient to manage your Python versions and virtualenvs.

Airflow project contains some pre-defined virtualenv definitions in pyproject.toml that can be easily used by hatch to create your local venvs. This is not necessary for you to develop and test Airflow, but it is a convenient way to manage your local Python versions and virtualenvs.

You can install hat using various other ways (including Gui installers).

Example using pipx:

pipx install hatch

We recommend using pipx as you can manage installed Python apps easily and later use it to upgrade hatch easily as needed with:

pipx upgrade hatch

## Using Hatch to manage your Python versions

You can also use hatch to install and manage airflow virtualenvs and development environments. For example, you can install Python 3.10 with this command:

hatch python install 3.10

or install all Python versions that are used in Airflow:

hatch python install all

Airflow has some pre-defined virtualenvs that you can use to develop and test airflow. You can see the list of available envs with:

hatch env show

This is what it shows currently:

Name Type Description
default virtual Default environment with Python 3.8 for maximum compatibility
airflow-38 virtual Environment with Python 3.8. No devel installed.
airflow-39 virtual Environment with Python 3.9. No devel installed.
airflow-310 virtual Environment with Python 3.10. No devel installed.
airflow-311 virtual Environment with Python 3.11. No devel installed
airflow-312 virtual Environment with Python 3.12. No devel installed

The default env (if you have not used one explicitly) is default and it is a Python 3.8 virtualenv for maximum compatibility. You can install devel set of dependencies with it by running:

pip install -e ".[devel]"

After entering the environment.

The other environments are just bare-bones Python virtualenvs with Airflow core requirements only, without any extras installed and without any tools. They are much faster to create than the default environment, and you can manually install either appropriate extras or directly tools that you need for testing or development.

hatch env create

You can create specific environment by using them in create command:

hatch env create airflow-310

You can install extras in the environment by running pip command:

hatch -e airflow-310 run -- pip install -e ".[devel,google]"

And you can enter the environment with running a shell of your choice (for example zsh) where you can run any commands

hatch -e airflow-310 shell

Once you are in the environment (indicated usually by updated prompt), you can just install extra dependencies you need:

[~/airflow] [airflow-310] pip install -e ".[devel,google]"

You can also see where hatch created the virtualenvs and use it in your IDE or activate it manually:

hatch env find airflow-310

You will get path similar to:

/Users/jarek/Library/Application Support/hatch/env/virtual/apache-airflow/TReRdyYt/apache-airflow

Then you will find python binary and activate script in the bin sub-folder of this directory and you can configure your IDE to use this python virtualenv if you want to use that environment in your IDE.

You can also set default environment name by HATCH_ENV environment variable.

You can clean the env by running:

hatch env prune

More information about hatch can be found in Hatch: Environments

## Using Hatch to build your packages

You can use hatch to build installable package from the airflow sources. Such package will include all metadata that is configured in pyproject.toml and will be installable with pip.

The packages will have pre-installed dependencies for providers that are always installed when Airflow is installed from PyPI. By default both wheel and sdist packages are built.

hatch build

You can also build only wheel or sdist packages:

hatch build -t wheel
hatch build -t sdist

One of the great benefits of using the local virtualenv and Breeze is an option to run local debugging in your IDE graphical interface.

When you run example DAGs, even if you run them using unit tests within IDE, they are run in a separate container. This makes it a little harder to use with IDE built-in debuggers. Fortunately, IntelliJ/PyCharm provides an effective remote debugging feature (but only in paid versions). See additional details on remote debugging.

You can set up your remote debugging session as follows:

Setup remote debugging

Note that on macOS, you have to use a real IP address of your host rather than the default localhost because on macOS the container runs in a virtual machine with a different IP address.

Make sure to configure source code mapping in the remote debugging configuration to map your local sources to the /opt/airflow location of the sources within the container:

Source code mapping

In Airflow 2.0 we introduced split of Apache Airflow into separate packages - there is one main apache-airflow package with core of Airflow and 70+ packages for all providers (external services and software Airflow can communicate with).

When you install airflow from sources using editable install, you can develop together both - main version of Airflow and providers, which is pretty convenient, because you can use the same environment for both.

Running pip install -e . will install Airflow in editable mode, but all provider code will also be available in the same environment. However, most provider need some additional dependencies.

You can install the dependencies of the provider you want to develop by installing airflow in editable mode with provider id as extra (with - instead of .) . You can see the list of provider's extras in the extras reference.

For example, if you want to develop Google provider, you can install it with:

pip install -e ".[devel,google]"

In case of a provider has name compose of several segments, you can use - to separate them. You can also install multiple extra dependencies at a time:

pip install -e ".[devel,apache-beam,dbt-cloud]"

The dependencies for providers are configured in airflow/providers/PROVIDERS_FOLDER/provider.yaml file - separately for each provider. You can find there two types of dependencies - production runtime dependencies, and sometimes devel-dependencies which are needed to run tests. While provider.yaml file is the single source of truth for the dependencies, eventually they need to find its way to Airflow`s pyproject.toml. This is done by running:

pre-commit run update-providers-dependencies --all-files

This will update pyproject.toml with the dependencies from provider.yaml files and from there it will be used automatically when you install Airflow in editable mode.

If you want to add another dependency to a provider, you should add it to corresponding provider.yaml, run the command above and commit the changes to pyproject.toml. Then running pip install -e .[devel,PROVIDER_EXTRA] will install the new dependencies. Tools like hatch can also install the dependencies automatically when you create or switch to a development environment.

Whatever virtualenv solution you use, when you want to make sure you are using the same version of dependencies as in main, you can install recommended version of the dependencies by using constraint-python<PYTHON_MAJOR_MINOR_VERSION>.txt files as constraint file. This might be useful to avoid "works-for-me" syndrome, where you use different version of dependencies than the ones that are used in main, CI tests and by other contributors.

There are different constraint files for different python versions. For example this command will install all basic devel requirements and requirements of google provider as last successfully tested for Python 3.8:

pip install -e ".[devel,google]" \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-source-providers-3.8.txt"

Make sure to use latest main for such installation, those constraints are "development constraints" and they are refreshed several times a day to make sure they are up to date with the latest changes in the main branch.

Note that this might not always work as expected, because the constraints are not always updated immediately after the dependencies are updated, sometimes there is a very recent change (few hours, rarely more than a day) which still runs in canary build and constraints will not be updated until the canary build succeeds. Usually what works in this case is running your install command without constraints.

You can upgrade just airflow, without paying attention to provider's dependencies by using the 'constraints-no-providers' constraint files. This allows you to keep installed provider dependencies and install to latest supported ones by pure airflow core.

pip install -e ".[devel]" \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-no-providers-3.8.txt"

These are examples of the development options available with the local virtualenv in your IDE:

  • local debugging;
  • Airflow source view;
  • auto-completion;
  • documentation support;
  • unit tests.

This document describes minimum requirements and instructions for using a standalone version of the local virtualenv.

Running tests is described in Testing documentation.

While most of the tests are typical unit tests that do not require external components, there are a number of Integration tests. You can technically use local virtualenv to run those tests, but it requires to set up all necessary dependencies for all the providers you are going to tests and also setup databases - and sometimes other external components (for integration test).

So, generally it should be easier to use the Breeze development environment (especially for Integration tests).

When analyzing the situation, it is helpful to be able to directly query the database. You can do it using the built-in Airflow command (however you needs a CLI client tool for each database to be installed):

airflow db shell

The command will explain what CLI tool is needed for the database you have configured.


As the next step, it is important to learn about Static code checks.that are used to automate code quality checks. Your code must pass the static code checks to get merged.