Skip to content

Commit

Permalink
Merge pull request #21 from datacoves/mp_migration
Browse files Browse the repository at this point in the history
Mp update getting started
  • Loading branch information
noel authored May 4, 2024
2 parents bc8aa9d + 8f361ed commit f65dfb4
Show file tree
Hide file tree
Showing 24 changed files with 412 additions and 102 deletions.
2 changes: 2 additions & 0 deletions docs/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
- [Generate DAGs from yml](/how-tos/airflow/generate-dags-from-yml.md)
- [Calling External Python Scripts](/how-tos/airflow/external-python-dag.md)
- [Use Variables and Connections](/how-tos/airflow/use-variables-and-connections.md)
- [Dynamically Set Schedule](/how-tos/airflow/dynamically-set-schedule.md)
- [Run Airbyte sync jobs](/how-tos/airflow/run-airbyte-sync-jobs.md)
- [Run Fivetran sync jobs](/how-tos/airflow/run-fivetran-sync-jobs.md)
- [Add Dag Documentation](how-tos/airflow/create-dag-level-docs.md)
Expand All @@ -44,6 +45,7 @@
- [Configure Projects](how-tos/datacoves/how_to_projects.md)
- [Configure Service Connections](how-tos/datacoves/how_to_service_connections.md)
- [Manage Users](/how-tos/datacoves/how_to_manage_users.md)
- [Update Repository](getting-started/Admin/configure-repository.md)
- [Superset](/how-tos/superset/)
- [Add a Database](how-tos/superset/how_to_database.md)
- [Add a Data Set](how-tos/superset/how_to_data_set.md)
Expand Down
11 changes: 8 additions & 3 deletions docs/getting-started/Admin/configure-airflow.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,25 @@
# Configuring Airflow
You don't need Airflow to begin using Datacoves, but at some point you will want to schedule your dbt jobs. The following steps will help you get started using Airflow. Keep in mind this is the basic setup, you can find additional Aiflow information in the how-tos and reference sections.

1. Start with the initial configuration of Airflow in your Datacoves environment. You may need to make changes to your repository to have the Datacoves default dbt profiles path and Airflow DAG path.
1. To complete the initial configuration of Airflow, you will need to make changes to your project. This includes creating the dbt profile for Airflow to use as well as the Airflow DAG files that will schedule your dbt runs.

[Initial Airflow Setup](how-tos/airflow/initial-setup)

2. Airflow will authenticate to your data warehouse using a service connection. The credentials defined here will be used by dbt when your jobs run.

[Setup Service Connection](how-tos/datacoves/how_to_service_connections.md)

3. When Airflow jobs run you may want to receive notifications. We have a few ways to send notifications in Datacoves.
3. Datacoves uses a specific [folder structure](explanation/best-practices/datacoves/folder-structure.md) for Airflow. You will need to add some folders and files to your repository for Airflow to function as expected.

[Update Repository](getting-started/Admin/configure-repository.md)

4. When Airflow jobs run you may want to receive notifications. We have a few ways to send notifications in Datacoves. Choose the option that makes sense for your use case.

- **Email:** [Setup Email Integration](how-tos/airflow/send-emails)

- **MS Teams:** [Setup MS Teams Integration](how-tos/airflow/send-ms-teams-notifications)

- **Slack:** [Setup Slack Integration](how-tos/airflow/send-slack-notifications)

Once Airflow is configured, you can begin scheduling your dbt jobs by creating Airflow DAGs!
## Getting Started Next Steps
Once Airflow is configured, you can begin scheduling your dbt jobs by [creating Airflow DAGs](getting-started/Admin/creating-airflow-dags.md)!
93 changes: 93 additions & 0 deletions docs/getting-started/Admin/configure-repository.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Update Repository for Airflow

Now that you have configured your Airflow settings you must ensure that your repository has the correct folder structure to pick up the DAGs we create. You will need to add folders to your project repository in order to match the folder defaults we just configured for Airflow. These folders are `orchestrate/dags` and, optionally, `orchestrate/dags_yml_definitions`.

**Step 1:** Add a folder named `orchestrate` and a folder inside `orchestrate` named `dags`. `orchestrate/dags` is where you will be placing your DAGs as defined earlier in our Airflow settings with the `Python DAGs path` field.

**Step 2:** **ONLY If using Git Sync**. If you have not already done so, create a branch named `airflow_development` from `main`. This branch was defined as the sync branch earlier in our Airflow Settings with the `Git branch name` field. Best practice will be to keep this branch up-to-date with `main`.

**Step 3:** **This step is optional** if you would like to make use of the [dbt-coves](https://github.com/datacoves/dbt-coves?tab=readme-ov-file#airflow-dags-generation-arguments) `dbt-coves generate airflow-dags` command. Create the `dags_yml_definitions` folder inside of your newly created `orchestrate` folder. This will leave you with two folders inside `orchestrate`- `orchestrate/dags` and `orchestrate/dags_yml_definitions`.

**Step 4:** **This step is optional** if you would like to make use of the dbt-coves' extension `dbt-coves generate airflow-dags` command. You must create a config file for dbt-coves. Please follow the [generate DAGs from yml](how-tos/airflow/generate-dags-from-yml.md) docs.

## Create a profiles.yml

Upon creating a service connection, [environment variables](reference/vscode/datacoves-env-vars.md#warehouse-environment-variables) for your warehouse credentials were created to be used in your profiles.yml file and will allow you to safely commit them with git. The available environment variables will vary based on your data warehouse. We have made it simple to set this up by completing the following steps.

To create your and your `profiles.yml`:

**Step 1:** Create the `automate` folder at the root of your project

**Step 2:** Create the `dbt` folder inside the `automate` folder

**Step 3:** Create the `profiles.yml` inside of your `automate` folder. ie) `automate/dbt/profiles.yml`

**Step 4:** Copy the following configuration into your `profiles.yml`

### Snowflake
``` yaml
default:
target: default_target
outputs:
default_target:
type: snowflake
threads: 8
client_session_keep_alive: true

account: "{{ env_var('DATACOVES__MAIN__ACCOUNT') }}"
database: "{{ env_var('DATACOVES__MAIN__DATABASE') }}"
schema: "{{ env_var('DATACOVES__MAIN__SCHEMA') }}"
user: "{{ env_var('DATACOVES__MAIN__USER') }}"
password: "{{ env_var('DATACOVES__MAIN__PASSWORD') }}"
role: "{{ env_var('DATACOVES__MAIN__ROLE') }}"
warehouse: "{{ env_var('DATACOVES__MAIN__WAREHOUSE') }}"
```
### Redshift
```yaml
company-name:
target: dev
outputs:
dev:
type: redshift
host: "{{ env_var('DATACOVES__MAIN__HOST') }}"
user: "{{ env_var('DATACOVES__MAIN__USER') }}"
password: "{{ env_var('DATACOVES__MAIN__PASSWORD') }}"
dbname: "{{ env_var('DATACOVES__MAIN__DATABASE') }}"
schema: analytics
port: 5439
```
### BigQuery
```yaml
my-bigquery-db:
target: dev
outputs:
dev:
type: bigquery
method: service-account
project: GCP_PROJECT_ID
dataset: "{{ env_var('DATACOVES__MAIN__DATASET') }}"
threads: 4 # Must be a value of 1 or greater
keyfile: "{{ env_var('DATACOVES__MAIN__KEYFILE_JSON') }}"
```
### Databricks
```yaml
your_profile_name:
target: dev
outputs:
dev:
type: databricks
catalog: [optional catalog name if you are using Unity Catalog]
schema: "{{ env_var('DATACOVES__MAIN__SCHEMA') }}" # Required
host: "{{ env_var('DATACOVES__MAIN__HOST') }}" # Required
http_path: "{{ env_var('DATACOVES__MAIN__HTTP_PATH') }}" # Required
token: "{{ env_var('DATACOVES__MAIN__TOKEN') }}" # Required Personal Access Token (PAT) if using token-based authentication
threads: 4
```
## Getting Started Next Steps
You will want to set up notifications. Selet the option that works best for your organization.
- **Email:** [Setup Email Integration](how-tos/airflow/send-emails)
- **MS Teams:** [Setup MS Teams Integration](how-tos/airflow/send-ms-teams-notifications)
- **Slack:** [Setup Slack Integration](how-tos/airflow/send-slack-notifications)
25 changes: 16 additions & 9 deletions docs/getting-started/Admin/creating-airflow-dags.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,25 @@
# Creating Aiflow Dags
Now that Airflow is configured we can turn our attention to creating DAGs which is what airflow uses to run dbt as well as other orchestration tasks. Below are the important things to know when creating DAGs and running dbt with Airflow.

1. In the initial Airflow setup you added the `orchestrate` folder and the `dags` folder to your repository. Here you will store your airflow DAGs. ie) `orchestrate/dags`
## Pre-Requisites
By now you should have:
- [Configured Airflow](getting-started/Admin/configure-airflow.md) in Datacoves
- [Updated your repo](getting-started/Admin/configure-repository.md) to include `automate/dbt/profiles.yml` and `orchestrate/dags` folders
- [Set up notifications](how-tos/airflow/send-emails.md) for Airflow

See the recommended [folder structure](explanation/best-practices/datacoves/folder-structure.md) if you have not completed this step.

2. You have 2 options when it comes to writing DAGs in Datacoves. You can write them out using Python and place them in the `orchestrate/dags` directory, or you can generate your DAGs with `dbt-coves` from a YML definition.

[Generate DAGs from yml definitions](how-tos/airflow/generate-dags-from-yml) this is simpler for users not accustomed to using Python
## Where to create your DAGs
This means that Airflow is fully configured and we can turn our attention to creating DAGs! Airflow uses DAGs to run dbt as well as other orchestration tasks. Below are the important things to know when creating DAGs and running dbt with Airflow.

During the Airflow configuration step you added the `orchestrate` folder and the `dags` folder to your repository. Here you will store your airflow DAGs. ie) You will be writing your python files in `orchestrate/dags`

3. Here is the simplest way to run dbt with Airflow.
## DAG 101 in Datacoves
1. If you are eager to see Airflow and dbt in action within Datacoves, here is the simplest way to run dbt with Airflow.

[Run dbt](how-tos/airflow/run-dbt)

4. You may also wish to use external libraries in your DAGs such as Pandas. In order to do that effectively, you can create custom Python scripts in a separate directory such as `orchestrate/python_scripts` and use the `DatacovesBashOperator` to handle all the behind the scenes work as well as run your custom script.
2. You have 2 options when it comes to writing DAGs in Datacoves. You can write them out using Python and place them in the `orchestrate/dags` directory, or you can generate your DAGs with `dbt-coves` from a YML definition.

[Generate DAGs from yml definitions](how-tos/airflow/generate-dags-from-yml) this is simpler for users not accustomed to using Python

3. You may also wish to use external libraries in your DAGs such as Pandas. In order to do that effectively, you can create custom Python scripts in a separate directory such as `orchestrate/python_scripts` and use the `DatacovesBashOperator` to handle all the behind the scenes work as well as run your custom script.**You will need to contact us beforehand to pre-configure any python libraries you need.**

[External Python DAG](how-tos/airflow/external-python-dag)
80 changes: 80 additions & 0 deletions docs/how-tos/airflow/dynamically-set-schedule.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# How to Dynamically set the schedule Interval

The default schedule for DAG development is `paused`. However, there may be scenarios where this default configuration doesn't align with your requirements. For instance, you might forget to add/adjust the schedule interval before deploying to production, leading to unintended behaviors.

To mitigate such risks, a practical approach is to dynamically configure the schedule according to the environment — development or production. This can be done by implementing a function named `get_schedule`. This function will determine the appropriate schedule based on the current environment, ensuring that DAGs operate correctly across different stages of deployment.

Here is how to achieve this:

**Step 1:** Create a `get_schedule.py` file inside of `orchestrate/dags/python_scripts`

**Step 2:** Paste the following code:
Note: Find your environment slug [here](reference/admin-menu/environments.md)
```python
# get_schedule.py
import os
from typing import Union

DEV_ENVIRONMENT_SLUG = "dev123" # Replace with your environment slug

def get_schedule(default_input: Union[str, None]) -> Union[str, None]:
"""
Sets the application's schedule based on the current environment setting. Allows you to
set the the default for dev to none and the the default for prod to the default input.
This function checks the Datacoves Slug through 'DATACOVES__ENVIRONMENT_SLUG' variable to determine
if the application is running in a specific environment (e.g., 'dev123'). If the application
is running in the 'dev123' environment, it indicates that no schedule should be used, and
hence returns None. For all other environments, the function returns the given 'default_input'
as the schedule.
Parameters:
- default_input (Union[str, None]): The default schedule to return if the application is not
running in the dev environment.
Returns:
- Union[str, None]: The default schedule if the environment is not 'dev123'; otherwise, None,
indicating that no schedule should be used in the dev environment.
"""
env_slug = os.environ.get("DATACOVES__ENVIRONMENT_SLUG", "").lower()
if env_slug == DEV_ENVIRONMENT_SLUG = "dev123":
return None
else:
return default_input
```
**Step 3:** In your DAG, import the `get_schedule` function using `from orchestrate.python_scripts.get_schedule import get_schedule` and pass in your desired schedule.

ie) If your desired schedule is `'0 1 * * *'` then you will set `schedule_interval=get_schedule('0 1 * * *')` as seen in the example below.
```python
from airflow.decorators import dag
from operators.datacoves.bash import DatacovesBashOperator
from operators.datacoves.dbt import DatacovesDbtOperator
from pendulum import datetime

from orchestrate.python_scripts.get_schedule import get_schedule

@dag(
default_args={
"start_date": datetime(2022, 10, 10),
"owner": "Noel Gomez",
"email": "[email protected]",
"email_on_failure": True,
},
catchup=False,
tags=["version_8"],
description="Datacoves Sample dag",
# This is a regular CRON schedule. Helpful resources
# https://cron-ai.vercel.app/
# https://crontab.guru/
schedule_interval=get_schedule('0 1 * * *'), # Replace with desired schedule
)
def datacoves_sample_dag():
# Calling dbt commands
dbt_task = DatacovesDbtOperator(
task_id = "run_dbt_task",
bash_command = "dbt debug",
)

# Invoke Dag
dag = datacoves_sample_dag()
```
33 changes: 19 additions & 14 deletions docs/how-tos/airflow/generate-dags-from-yml.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,19 @@
You have the option to write out your DAGs in python or you can write them using yml and then have dbt-coves generate the python DAG for you.

## Configure config.yml
>[!NOTE]This configuration is for the `dbt-coves generate airflow-dags` command which generates the DAGs from your yml files. Visit the [dbt-coves docs](https://github.com/datacoves/dbt-coves?tab=readme-ov-file#settings) for the full dbt-coves configuration settings.
This configuration is for the `dbt-coves generate airflow-dags` command which generates the DAGs from your yml files. Visit the [dbt-coves docs](https://github.com/datacoves/dbt-coves?tab=readme-ov-file#settings) for the full dbt-coves configuration settings.

dbt-coves will read settings from `<dbt_project_path>/.dbt_coves/config.yml`. First, create your `.dbt-coves` directory at the root of your dbt project (where the dbt_project.yml file is located). Then create a file called `config.yml`. Datacoves' recommended dbt project location is `transform/` so that's where you would create this file. eg) `transform/.dbt-coves/config.yml`.
dbt-coves will read settings from `<dbt_project_path>/.dbt_coves/config.yml`. We must create these files in order for dbt-coves to function.

- `yml_path`: This is where dbt-coves will look for the yml files to generate your Python DAGs.
- `dags_path`: This is where dbt-coves will place your generated python DAGs.
**Step 1:** Create the `.dbt-coves` folder at the root of your dbt project (where the dbt_project.yml file is located). Then create a file called `config.yml` inside of `.dbt-coves`.

### Place the following in your `config.yml file`:
>[!NOTE]Datacoves' recommended dbt project location is `transform/` eg) `transform/.dbt-coves/config.yml`. This will require some minor refactoring and ensuring that the `dbt project path ` in your environment settings reflects accordingly.
>[!TIP]We use environment variables such as `DATACOVES__AIRFLOW_DAGS_YML_PATH` that are pre-configured for you. For more information on these variables see [Datacoves Environment Variables](reference/vscode/datacoves-env-vars.md)
**Step 2:** We use environment variables such as `DATACOVES__AIRFLOW_DAGS_YML_PATH` that are pre-configured for you. For more information on these variables see [Datacoves Environment Variables](reference/vscode/datacoves-env-vars.md)
- `yml_path`: This is where dbt-coves will look for the yml files to generate your Python DAGs.
- `dags_path`: This is where dbt-coves will place your generated python DAGs.

Place the following in your `config.yml file`
```yml
generate:
...
Expand All @@ -30,31 +32,34 @@ generate:
## Create the yml file for your Airflow DAG

Inside your `orchestrate` folder, create a folder named `dag_yml_definitions`. dbt-coves will look for your yml in this folder to generate your Python DAGs.

eg) `orchestrate/dag_yml_definitions`
dbt-coves will look for your yml inside your `orchestrate/dags_yml_definition` folder to generate your Python DAGs. Please create these folders if you have not already done so.

>[!NOTE]The name of the file will be the name of the DAG.**
>[!NOTE]When you create a DAG with YAML the name of the file will be the name of the DAG.
eg) `yml_dbt_dag.yml` generates a dag named `yml_dbt_dag`

Let's create our first DAG using YAML.

**Step 1**: Create a new file named `my_first_yml.py` in your `orchestrate/dags` folder.

**Step 2:** Add the following YAML to your file and be sure to change

```yml
description: "Sample DAG for dbt build"
schedule_interval: "0 0 1 */12 *"
tags:
- version_2
default_args:
start_date: 2023-01-01
owner: Noel Gomez
# Replace with the email of the recipient for failures
email: [email protected]
owner: Noel Gomez # Replace this with your name
email: [email protected] # Replace with the email of the recipient for failures
email_on_failure: true
catchup: false

nodes:
run_dbt:
type: task
operator: operators.datacoves.dbt.DatacovesDbtOperator
bash_command: "dbt run -s personal_loans"
bash_command: "dbt run -s personal_loans"
```
>[!TIP]In the examples we make use of the Datacoves Operators which handle things like copying and running dbt deps. For more information on what these operators handle, see [Datacoves Operators](reference/airflow/datacoves-operator.md)
Expand Down
Loading

0 comments on commit f65dfb4

Please sign in to comment.