Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How airflowinit service communicates with the other services? #1

Open
AndrewLaganaro opened this issue Aug 24, 2022 · 4 comments
Open

Comments

@AndrewLaganaro
Copy link

AndrewLaganaro commented Aug 24, 2022

Since all services are completely separated containers, how does what airflow-init service (container) does affects the other two (webservice and scheduler) if they're separated, creating files each one on its own file system and etc?

Airflowinit creates the user and password, rigth, but how does it makes a difference on the others? There's no clear communication between then, their folders or files. Even their volumes are pointing to different folders.

How airflowinit makes a difference since it seems to be creating an user in a container that never executes airflow inside it? (this is done on the others by using "command: airflow webserver, scheduler etc")

So, how the other containers can know that there's a default user created if they're isolated from airflowinit's container and their folders aren't linked by volumes?

@hussein-awala
Copy link
Owner

All the Airflow services are connected to the Metadata which is the brain of Airflow. The Metadata is a database responsible for putting together all the information needed for the whole environment, it stores a lot of information like the configuration of the Airflow environment’s roles and permissions, as well as all metadata for past and present DAG (serialized dags), runs and tasks, and other information like XCom messages.

Indeed, airflow-init initialize the Metadata using the command airflow db init (creating airflow tables if they don't exist), then it creates a user in the database in order to use it to login to Airflow webserver, so nothing is written to a local file.

The scheduler processes the dags script files every dag_dir_list_interval (5 min by default, and you can change it), to serialize them and add them to the Metadata, then it schedule some runs and tasks based on parallelism and dependency configurations, by inseting some records in the tables dag_run and task_instance, then the executor (a process in the same container because we use LocalExecutor) executes the tasks and updates the state on the same tables to inform the scheduler process.

Of course the webserver is connected to the same DB, so you can see all the changes on this DB on Airflow UI, where the webserver query the Metastore every second to update the dag grid, and change the tasks colors based on the state stored in the DB.

airflow_local_executor

@AndrewLaganaro
Copy link
Author

AndrewLaganaro commented Aug 24, 2022

Thanks for you quick answer! Really informative!
I figured out that the DB was being used as bridge between the three services by myself when looking carefully at the file yesterday, I noticed that the postgres service wouldn't be listed there for no reason, then by looking at the .env file indicating the URL connection to postgres and being used as .env file in every other service, everything made sense:

  • All services make a connection to Postgres when they start by using their docker internal network and the connection URL on the env file

But a question still remains: Why make a whole container with a whole airflow image just to make ONE admin user?

I did test this config yesterday, everything ran fine, but I noticed that the init service got down right after finishing what it was supposed to do. Why get up a container just to drop it later on?

Why wouldn't I do what the init service does (setting env variables to the user, calling airflow init command) on the containers that will be effectively used and executed as "airflow" by fact? Wouldn't it waste space and processing doing a thing just to discard it right after? I don't get the need of this. Why generate three airflow apps, if one will always be dropped?

(to get clear that I got the importance of separating webserver and scheduler, I just don't get the need of the init one if you could do it on the others too)

Thanks again!

@hussein-awala
Copy link
Owner

Init containers/steps are very important in devops, there is several reason to use one in this project:

  • this project is developed to help teams to create a testing/development environment. And to maintain the consistency between the dev and prod environments, we use the same docker image in both, and where we don't need the init job in the prod environment (we init the database once), we should avoid adding unnecessary processes in the image.
  • other reason is to init the database before starting the scheduler and webserver service, we achieved that by adding a service_completed_successfully dependency condition, but if we run the init in one of those services, the dependency rule will be more complicated.
  • also in my init container I have a small script to execute, but it is not always the case, maybe we use a secret manager, and we need to load some secrets without providing the credentials to the other services.
  • there is a difference between a service and a job, docker can do both, scheduler and webserver are services which should stay running to serve the user requests, but init db is a job we should do one time before running the services, so the container is not discarded, but it just finished his job which we created it for.

@AndrewLaganaro
Copy link
Author

Thanks, you made it clear for me. Again, tested your file here and everything ran fine. I even did some modifications to fit my MinIO and a MySQL which will simulate both external could services, MinIO an AWS S3 and MySQl an external DB of any sort.

When I finish this study I'll upload it to my github too, and post its link here :) Your file an explanations really helped on this starting path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants