-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #21 from datacoves/mp_migration
Mp update getting started
- Loading branch information
Showing
24 changed files
with
412 additions
and
102 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,25 @@ | ||
# Configuring Airflow | ||
You don't need Airflow to begin using Datacoves, but at some point you will want to schedule your dbt jobs. The following steps will help you get started using Airflow. Keep in mind this is the basic setup, you can find additional Aiflow information in the how-tos and reference sections. | ||
|
||
1. Start with the initial configuration of Airflow in your Datacoves environment. You may need to make changes to your repository to have the Datacoves default dbt profiles path and Airflow DAG path. | ||
1. To complete the initial configuration of Airflow, you will need to make changes to your project. This includes creating the dbt profile for Airflow to use as well as the Airflow DAG files that will schedule your dbt runs. | ||
|
||
[Initial Airflow Setup](how-tos/airflow/initial-setup) | ||
|
||
2. Airflow will authenticate to your data warehouse using a service connection. The credentials defined here will be used by dbt when your jobs run. | ||
|
||
[Setup Service Connection](how-tos/datacoves/how_to_service_connections.md) | ||
|
||
3. When Airflow jobs run you may want to receive notifications. We have a few ways to send notifications in Datacoves. | ||
3. Datacoves uses a specific [folder structure](explanation/best-practices/datacoves/folder-structure.md) for Airflow. You will need to add some folders and files to your repository for Airflow to function as expected. | ||
|
||
[Update Repository](getting-started/Admin/configure-repository.md) | ||
|
||
4. When Airflow jobs run you may want to receive notifications. We have a few ways to send notifications in Datacoves. Choose the option that makes sense for your use case. | ||
|
||
- **Email:** [Setup Email Integration](how-tos/airflow/send-emails) | ||
|
||
- **MS Teams:** [Setup MS Teams Integration](how-tos/airflow/send-ms-teams-notifications) | ||
|
||
- **Slack:** [Setup Slack Integration](how-tos/airflow/send-slack-notifications) | ||
|
||
Once Airflow is configured, you can begin scheduling your dbt jobs by creating Airflow DAGs! | ||
## Getting Started Next Steps | ||
Once Airflow is configured, you can begin scheduling your dbt jobs by [creating Airflow DAGs](getting-started/Admin/creating-airflow-dags.md)! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Update Repository for Airflow | ||
|
||
Now that you have configured your Airflow settings you must ensure that your repository has the correct folder structure to pick up the DAGs we create. You will need to add folders to your project repository in order to match the folder defaults we just configured for Airflow. These folders are `orchestrate/dags` and, optionally, `orchestrate/dags_yml_definitions`. | ||
|
||
**Step 1:** Add a folder named `orchestrate` and a folder inside `orchestrate` named `dags`. `orchestrate/dags` is where you will be placing your DAGs as defined earlier in our Airflow settings with the `Python DAGs path` field. | ||
|
||
**Step 2:** **ONLY If using Git Sync**. If you have not already done so, create a branch named `airflow_development` from `main`. This branch was defined as the sync branch earlier in our Airflow Settings with the `Git branch name` field. Best practice will be to keep this branch up-to-date with `main`. | ||
|
||
**Step 3:** **This step is optional** if you would like to make use of the [dbt-coves](https://github.com/datacoves/dbt-coves?tab=readme-ov-file#airflow-dags-generation-arguments) `dbt-coves generate airflow-dags` command. Create the `dags_yml_definitions` folder inside of your newly created `orchestrate` folder. This will leave you with two folders inside `orchestrate`- `orchestrate/dags` and `orchestrate/dags_yml_definitions`. | ||
|
||
**Step 4:** **This step is optional** if you would like to make use of the dbt-coves' extension `dbt-coves generate airflow-dags` command. You must create a config file for dbt-coves. Please follow the [generate DAGs from yml](how-tos/airflow/generate-dags-from-yml.md) docs. | ||
|
||
## Create a profiles.yml | ||
|
||
Upon creating a service connection, [environment variables](reference/vscode/datacoves-env-vars.md#warehouse-environment-variables) for your warehouse credentials were created to be used in your profiles.yml file and will allow you to safely commit them with git. The available environment variables will vary based on your data warehouse. We have made it simple to set this up by completing the following steps. | ||
|
||
To create your and your `profiles.yml`: | ||
|
||
**Step 1:** Create the `automate` folder at the root of your project | ||
|
||
**Step 2:** Create the `dbt` folder inside the `automate` folder | ||
|
||
**Step 3:** Create the `profiles.yml` inside of your `automate` folder. ie) `automate/dbt/profiles.yml` | ||
|
||
**Step 4:** Copy the following configuration into your `profiles.yml` | ||
|
||
### Snowflake | ||
``` yaml | ||
default: | ||
target: default_target | ||
outputs: | ||
default_target: | ||
type: snowflake | ||
threads: 8 | ||
client_session_keep_alive: true | ||
|
||
account: "{{ env_var('DATACOVES__MAIN__ACCOUNT') }}" | ||
database: "{{ env_var('DATACOVES__MAIN__DATABASE') }}" | ||
schema: "{{ env_var('DATACOVES__MAIN__SCHEMA') }}" | ||
user: "{{ env_var('DATACOVES__MAIN__USER') }}" | ||
password: "{{ env_var('DATACOVES__MAIN__PASSWORD') }}" | ||
role: "{{ env_var('DATACOVES__MAIN__ROLE') }}" | ||
warehouse: "{{ env_var('DATACOVES__MAIN__WAREHOUSE') }}" | ||
``` | ||
### Redshift | ||
```yaml | ||
company-name: | ||
target: dev | ||
outputs: | ||
dev: | ||
type: redshift | ||
host: "{{ env_var('DATACOVES__MAIN__HOST') }}" | ||
user: "{{ env_var('DATACOVES__MAIN__USER') }}" | ||
password: "{{ env_var('DATACOVES__MAIN__PASSWORD') }}" | ||
dbname: "{{ env_var('DATACOVES__MAIN__DATABASE') }}" | ||
schema: analytics | ||
port: 5439 | ||
``` | ||
### BigQuery | ||
```yaml | ||
my-bigquery-db: | ||
target: dev | ||
outputs: | ||
dev: | ||
type: bigquery | ||
method: service-account | ||
project: GCP_PROJECT_ID | ||
dataset: "{{ env_var('DATACOVES__MAIN__DATASET') }}" | ||
threads: 4 # Must be a value of 1 or greater | ||
keyfile: "{{ env_var('DATACOVES__MAIN__KEYFILE_JSON') }}" | ||
``` | ||
### Databricks | ||
```yaml | ||
your_profile_name: | ||
target: dev | ||
outputs: | ||
dev: | ||
type: databricks | ||
catalog: [optional catalog name if you are using Unity Catalog] | ||
schema: "{{ env_var('DATACOVES__MAIN__SCHEMA') }}" # Required | ||
host: "{{ env_var('DATACOVES__MAIN__HOST') }}" # Required | ||
http_path: "{{ env_var('DATACOVES__MAIN__HTTP_PATH') }}" # Required | ||
token: "{{ env_var('DATACOVES__MAIN__TOKEN') }}" # Required Personal Access Token (PAT) if using token-based authentication | ||
threads: 4 | ||
``` | ||
## Getting Started Next Steps | ||
You will want to set up notifications. Selet the option that works best for your organization. | ||
- **Email:** [Setup Email Integration](how-tos/airflow/send-emails) | ||
- **MS Teams:** [Setup MS Teams Integration](how-tos/airflow/send-ms-teams-notifications) | ||
- **Slack:** [Setup Slack Integration](how-tos/airflow/send-slack-notifications) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,25 @@ | ||
# Creating Aiflow Dags | ||
Now that Airflow is configured we can turn our attention to creating DAGs which is what airflow uses to run dbt as well as other orchestration tasks. Below are the important things to know when creating DAGs and running dbt with Airflow. | ||
|
||
1. In the initial Airflow setup you added the `orchestrate` folder and the `dags` folder to your repository. Here you will store your airflow DAGs. ie) `orchestrate/dags` | ||
## Pre-Requisites | ||
By now you should have: | ||
- [Configured Airflow](getting-started/Admin/configure-airflow.md) in Datacoves | ||
- [Updated your repo](getting-started/Admin/configure-repository.md) to include `automate/dbt/profiles.yml` and `orchestrate/dags` folders | ||
- [Set up notifications](how-tos/airflow/send-emails.md) for Airflow | ||
|
||
See the recommended [folder structure](explanation/best-practices/datacoves/folder-structure.md) if you have not completed this step. | ||
|
||
2. You have 2 options when it comes to writing DAGs in Datacoves. You can write them out using Python and place them in the `orchestrate/dags` directory, or you can generate your DAGs with `dbt-coves` from a YML definition. | ||
|
||
[Generate DAGs from yml definitions](how-tos/airflow/generate-dags-from-yml) this is simpler for users not accustomed to using Python | ||
## Where to create your DAGs | ||
This means that Airflow is fully configured and we can turn our attention to creating DAGs! Airflow uses DAGs to run dbt as well as other orchestration tasks. Below are the important things to know when creating DAGs and running dbt with Airflow. | ||
|
||
During the Airflow configuration step you added the `orchestrate` folder and the `dags` folder to your repository. Here you will store your airflow DAGs. ie) You will be writing your python files in `orchestrate/dags` | ||
|
||
3. Here is the simplest way to run dbt with Airflow. | ||
## DAG 101 in Datacoves | ||
1. If you are eager to see Airflow and dbt in action within Datacoves, here is the simplest way to run dbt with Airflow. | ||
|
||
[Run dbt](how-tos/airflow/run-dbt) | ||
|
||
4. You may also wish to use external libraries in your DAGs such as Pandas. In order to do that effectively, you can create custom Python scripts in a separate directory such as `orchestrate/python_scripts` and use the `DatacovesBashOperator` to handle all the behind the scenes work as well as run your custom script. | ||
2. You have 2 options when it comes to writing DAGs in Datacoves. You can write them out using Python and place them in the `orchestrate/dags` directory, or you can generate your DAGs with `dbt-coves` from a YML definition. | ||
|
||
[Generate DAGs from yml definitions](how-tos/airflow/generate-dags-from-yml) this is simpler for users not accustomed to using Python | ||
|
||
3. You may also wish to use external libraries in your DAGs such as Pandas. In order to do that effectively, you can create custom Python scripts in a separate directory such as `orchestrate/python_scripts` and use the `DatacovesBashOperator` to handle all the behind the scenes work as well as run your custom script.**You will need to contact us beforehand to pre-configure any python libraries you need.** | ||
|
||
[External Python DAG](how-tos/airflow/external-python-dag) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# How to Dynamically set the schedule Interval | ||
|
||
The default schedule for DAG development is `paused`. However, there may be scenarios where this default configuration doesn't align with your requirements. For instance, you might forget to add/adjust the schedule interval before deploying to production, leading to unintended behaviors. | ||
|
||
To mitigate such risks, a practical approach is to dynamically configure the schedule according to the environment — development or production. This can be done by implementing a function named `get_schedule`. This function will determine the appropriate schedule based on the current environment, ensuring that DAGs operate correctly across different stages of deployment. | ||
|
||
Here is how to achieve this: | ||
|
||
**Step 1:** Create a `get_schedule.py` file inside of `orchestrate/dags/python_scripts` | ||
|
||
**Step 2:** Paste the following code: | ||
Note: Find your environment slug [here](reference/admin-menu/environments.md) | ||
```python | ||
# get_schedule.py | ||
import os | ||
from typing import Union | ||
|
||
DEV_ENVIRONMENT_SLUG = "dev123" # Replace with your environment slug | ||
|
||
def get_schedule(default_input: Union[str, None]) -> Union[str, None]: | ||
""" | ||
Sets the application's schedule based on the current environment setting. Allows you to | ||
set the the default for dev to none and the the default for prod to the default input. | ||
This function checks the Datacoves Slug through 'DATACOVES__ENVIRONMENT_SLUG' variable to determine | ||
if the application is running in a specific environment (e.g., 'dev123'). If the application | ||
is running in the 'dev123' environment, it indicates that no schedule should be used, and | ||
hence returns None. For all other environments, the function returns the given 'default_input' | ||
as the schedule. | ||
Parameters: | ||
- default_input (Union[str, None]): The default schedule to return if the application is not | ||
running in the dev environment. | ||
Returns: | ||
- Union[str, None]: The default schedule if the environment is not 'dev123'; otherwise, None, | ||
indicating that no schedule should be used in the dev environment. | ||
""" | ||
env_slug = os.environ.get("DATACOVES__ENVIRONMENT_SLUG", "").lower() | ||
if env_slug == DEV_ENVIRONMENT_SLUG = "dev123": | ||
return None | ||
else: | ||
return default_input | ||
``` | ||
**Step 3:** In your DAG, import the `get_schedule` function using `from orchestrate.python_scripts.get_schedule import get_schedule` and pass in your desired schedule. | ||
|
||
ie) If your desired schedule is `'0 1 * * *'` then you will set `schedule_interval=get_schedule('0 1 * * *')` as seen in the example below. | ||
```python | ||
from airflow.decorators import dag | ||
from operators.datacoves.bash import DatacovesBashOperator | ||
from operators.datacoves.dbt import DatacovesDbtOperator | ||
from pendulum import datetime | ||
|
||
from orchestrate.python_scripts.get_schedule import get_schedule | ||
|
||
@dag( | ||
default_args={ | ||
"start_date": datetime(2022, 10, 10), | ||
"owner": "Noel Gomez", | ||
"email": "[email protected]", | ||
"email_on_failure": True, | ||
}, | ||
catchup=False, | ||
tags=["version_8"], | ||
description="Datacoves Sample dag", | ||
# This is a regular CRON schedule. Helpful resources | ||
# https://cron-ai.vercel.app/ | ||
# https://crontab.guru/ | ||
schedule_interval=get_schedule('0 1 * * *'), # Replace with desired schedule | ||
) | ||
def datacoves_sample_dag(): | ||
# Calling dbt commands | ||
dbt_task = DatacovesDbtOperator( | ||
task_id = "run_dbt_task", | ||
bash_command = "dbt debug", | ||
) | ||
|
||
# Invoke Dag | ||
dag = datacoves_sample_dag() | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,17 +3,19 @@ | |
You have the option to write out your DAGs in python or you can write them using yml and then have dbt-coves generate the python DAG for you. | ||
|
||
## Configure config.yml | ||
>[!NOTE]This configuration is for the `dbt-coves generate airflow-dags` command which generates the DAGs from your yml files. Visit the [dbt-coves docs](https://github.com/datacoves/dbt-coves?tab=readme-ov-file#settings) for the full dbt-coves configuration settings. | ||
This configuration is for the `dbt-coves generate airflow-dags` command which generates the DAGs from your yml files. Visit the [dbt-coves docs](https://github.com/datacoves/dbt-coves?tab=readme-ov-file#settings) for the full dbt-coves configuration settings. | ||
|
||
dbt-coves will read settings from `<dbt_project_path>/.dbt_coves/config.yml`. First, create your `.dbt-coves` directory at the root of your dbt project (where the dbt_project.yml file is located). Then create a file called `config.yml`. Datacoves' recommended dbt project location is `transform/` so that's where you would create this file. eg) `transform/.dbt-coves/config.yml`. | ||
dbt-coves will read settings from `<dbt_project_path>/.dbt_coves/config.yml`. We must create these files in order for dbt-coves to function. | ||
|
||
- `yml_path`: This is where dbt-coves will look for the yml files to generate your Python DAGs. | ||
- `dags_path`: This is where dbt-coves will place your generated python DAGs. | ||
**Step 1:** Create the `.dbt-coves` folder at the root of your dbt project (where the dbt_project.yml file is located). Then create a file called `config.yml` inside of `.dbt-coves`. | ||
|
||
### Place the following in your `config.yml file`: | ||
>[!NOTE]Datacoves' recommended dbt project location is `transform/` eg) `transform/.dbt-coves/config.yml`. This will require some minor refactoring and ensuring that the `dbt project path ` in your environment settings reflects accordingly. | ||
>[!TIP]We use environment variables such as `DATACOVES__AIRFLOW_DAGS_YML_PATH` that are pre-configured for you. For more information on these variables see [Datacoves Environment Variables](reference/vscode/datacoves-env-vars.md) | ||
**Step 2:** We use environment variables such as `DATACOVES__AIRFLOW_DAGS_YML_PATH` that are pre-configured for you. For more information on these variables see [Datacoves Environment Variables](reference/vscode/datacoves-env-vars.md) | ||
- `yml_path`: This is where dbt-coves will look for the yml files to generate your Python DAGs. | ||
- `dags_path`: This is where dbt-coves will place your generated python DAGs. | ||
|
||
Place the following in your `config.yml file` | ||
```yml | ||
generate: | ||
... | ||
|
@@ -30,31 +32,34 @@ generate: | |
## Create the yml file for your Airflow DAG | ||
|
||
Inside your `orchestrate` folder, create a folder named `dag_yml_definitions`. dbt-coves will look for your yml in this folder to generate your Python DAGs. | ||
|
||
eg) `orchestrate/dag_yml_definitions` | ||
dbt-coves will look for your yml inside your `orchestrate/dags_yml_definition` folder to generate your Python DAGs. Please create these folders if you have not already done so. | ||
|
||
>[!NOTE]The name of the file will be the name of the DAG.** | ||
>[!NOTE]When you create a DAG with YAML the name of the file will be the name of the DAG. | ||
eg) `yml_dbt_dag.yml` generates a dag named `yml_dbt_dag` | ||
|
||
Let's create our first DAG using YAML. | ||
|
||
**Step 1**: Create a new file named `my_first_yml.py` in your `orchestrate/dags` folder. | ||
|
||
**Step 2:** Add the following YAML to your file and be sure to change | ||
|
||
```yml | ||
description: "Sample DAG for dbt build" | ||
schedule_interval: "0 0 1 */12 *" | ||
tags: | ||
- version_2 | ||
default_args: | ||
start_date: 2023-01-01 | ||
owner: Noel Gomez | ||
# Replace with the email of the recipient for failures | ||
email: [email protected] | ||
owner: Noel Gomez # Replace this with your name | ||
email: [email protected] # Replace with the email of the recipient for failures | ||
email_on_failure: true | ||
catchup: false | ||
|
||
nodes: | ||
run_dbt: | ||
type: task | ||
operator: operators.datacoves.dbt.DatacovesDbtOperator | ||
bash_command: "dbt run -s personal_loans" | ||
bash_command: "dbt run -s personal_loans" | ||
``` | ||
>[!TIP]In the examples we make use of the Datacoves Operators which handle things like copying and running dbt deps. For more information on what these operators handle, see [Datacoves Operators](reference/airflow/datacoves-operator.md) | ||
|
Oops, something went wrong.