Should all Elasticsearch/ingestion-server related DAGs belong to the same data_refresh
pool?
#2420
Replies: 1 comment 4 replies
-
Thanks for bringing this up for discussion! I've been pondering this a bit. The trouble with pools is that they apply per-task, not per-DAG. This means that as soon as a task is complete, those pool slots open up and any other tasks using that pool slot could run. Paired with the fact that all of our sensors (including the indexing related ones) are on This can be mitigated a bit by priority weights: we could set the data refresh as a higher priority weight so the entire DAG completes before tasks from other DAGs are allowed to run. This could land us in a scenario though where a lower priority DAG (e.g. index settings) is running and a sensor is waiting for the reindex to complete, then a scheduled data refresh starts in the middle of that operation. Based on priority weights, the data refresh DAG would still have priority and so it would begin executing from start to finish while the other reindex operation is occurring. Perhaps we can use an Airflow Variable to determine if there's an ongoing Elasticsearch operation, and have each of the DAGs which need to perform ES operations would check that value first before proceeding. That might be the easiest way to manage concurrency across all DAGs, along with priority weight and pool usage. It's definitely a tricky problem to manage DAG-level concurrency it seems! |
Beta Was this translation helpful? Give feedback.
-
We have two (soon to be three) DAGs that can trigger an indexing operation in production Elasticsearch:
The production
data_refresh
pool is set to 1 to prevent multiple data refreshes from happening at a single time.We've implemented relatively complex concurrency guards in these DAGs to prevent them from happening at the same time as other indexing operations. Would we be able to simplify this by treating anything that can cause an indexing operation as part of the same pool? Because DAGs can be queued to the pool and will wait until a slot is available, I don't think we'd need to worry about things like the filtered index creation triggered by the data refresh DAG overlapping.
One thing we would need to consider, however, is how to correctly prioritise DAGs in the pool. I think the order should go something like:
However, I'm not sure if Airflow makes it possible to consider different weights per run. For example, manual runs of the filtered index creation DAG should not precede runs triggered by the data refresh DAG. If there isn't a way to ensure that the data refresh-related runs of that DAG are always the next thing to run (even before another data refresh), then this might not be possible.
cc @WordPress/openverse-catalog
Beta Was this translation helpful? Give feedback.
All reactions