New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

adding marts discussion #438

Open

alex-pavlopoulos wants to merge 11 commits into ae-style-guide from adding-marts-section

alex-pavlopoulos commented Aug 21, 2024

Opening a new PR for the marts section as discussed here

Currently this is based on the suggested suggested-edit-v2 branch, but will need to be rebased to the ae-style-guide branch when the edits branch has been merged

alex-pavlopoulos added 2 commits

August 21, 2024 11:47


          Making suggested edits

627608a


          adding marts discussion

alex-pavlopoulos requested review from a team as code owners

August 21, 2024 12:02

Contributor

SoumayaMauthoorMOJ commented Aug 27, 2024 •

edited

Loading

@alex-pavlopoulos would you mind adding a sub-section called "data modelling" and explaining what this is? I don't think this is clearly explained anywhere in the handbook and it can be confusing since data models can have multiple meanings depending on the context

Base automatically changed from suggested-edits-v1 to ae-style-guide

September 18, 2024 09:14

alex-pavlopoulos and others added 9 commits

September 26, 2024 12:03


          quick changes

49a3de3


          some updates

fd5733e


          base model update

f4e546e


          reading through and updating

388d834


          continuing edit

cdb2398


          first complete edit

cc3a405


          small changes

ea1c6a3


          adding links

6154c6d


          second edit of section 1

fabc1e8

benwaterfield reviewed

View reviewed changes

Contributor

benwaterfield left a comment

Looks good to me, thanks Alex!

source/documentation/tools/create-a-derived-table/project-structure.md

-                    │   ├── court_intermediate
+                    │   ├── courts_int
+                    │   │
+                    │   ├── courst_datamarts

Contributor

benwaterfield Nov 14, 2024

Suggested change

      
                  │   ├── courst_datamarts
          
                  │   ├── courts_datamarts

source/documentation/tools/create-a-derived-table/project-structure.md

-              - **Folders.** Folder structure is extremely important in dbt. Not only do we need a consistent structure to find our way around the codebase, as with any software project, but our folder structure is also one of the key interfaces for understanding the knowledge graph encoded in our project (alongside the DAG and the data output into our warehouse). It should reflect how the data flows, step-by-step, from a wide variety of source-conformed models into fewer, richer business-conformed models. Moreover, we can use our folder structure as a means of selection in dbt [selector syntax](https://docs.getdbt.com/reference/node-selection/syntax). For example, with the above structure, if we got fresh xhibit data loaded and wanted to run all the models that build on our xhibit data, we can easily run `dbt build --select staging/xhibit_stg+` and we’re all set for building more up-to-date reports on payments.
-                - ✅ **Subdirectories based on the source system**. Our internal transactional database is one system, the data we get from Stripe's API is another, and lastly the events from our Snowplow instrumentation. We've found this to be the best grouping for most companies, as source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.
+              - **Folders.** Folder structure is extremely important in dbt. Not only do we need a consistent structure to find our way around the codebase, as with any software project, but our folder structure is also one of the key interfaces for understanding the knowledge graph encoded in our project (alongside the DAG and the data output into our warehouse). It should reflect how the data flows, step-by-step, from a wide variety of source-conformed models into fewer, richer business-conformed models. Moreover, we can use our folder structure as a means of selection in dbt [selector syntax](https://docs.getdbt.com/reference/node-selection/syntax). For example, with the above structure, if we got fresh xhibit data loaded and wanted to run all the models that build on our xhibit data, we can easily run `dbt build --select staging/xhibit_stg+`, and we’re all set for building more up-to-date reports on Crown Courts. Additionally, with create-a-derived-table the project structure is how we manage database access. So ensuring you follow this guidance will ensure only those who need to get to see your data
+                - ✅ **Subdirectories based on business grouping.** dbt recommends against this practice, however crate-a-derived-table has been built in a way that necessitates domains as subdirectories so that we can control access through [data engineering database access](https://github.com/moj-analytical-services/data-engineering-database-access/tree/main/database_access/create_a_derived_table). This is a key deviation from dbt guidance.

Contributor

benwaterfield Nov 14, 2024

we could drop this bullet now we have the staging domain and are conforming with dbt's guidance

source/documentation/tools/create-a-derived-table/project-structure.md

               ### Intermediate: Models
-              Below is the lone intermediate model from our small example project. This represents an excellent use case per our principles above, serving a clear single purpose: grouping and pivoting a staging model to different grain. It utilizes a bit of Jinja to make the model DRY-er (striving to be DRY applies to the code we write inside a single model in addition to transformations across the codebase), but don’t be intimidated if you’re not quite comfortable with [Jinja](/docs/build/jinja-macros) yet. Looking at the name of the <Term id="cte">CTE</Term>, `pivot_and_aggregate_payments_to_order_grain` we get a very clear idea of what’s happening inside this block. By descriptively labeling the transformations happening inside our CTEs within model, just as we do with our files and folders, even a stakeholder who doesn’t know SQL would be able to grasp the purpose of this section, if not the code. As you begin to write more complex transformations moving out of the staging layer, keep this idea in mind. In the same way our models connect into a DAG and tell the story of our transformations on a macro scale, CTEs can do this on a smaller scale inside our model files.
+              Below is a slightly more comlicated model taken from a finance project. This represents an excellent use case per our principles above, serving a clear single purpose: pivoting a staging model to different grain. It utilises a bit of Jinja in the form of a macro to make the model DRY-er (striving to be DRY applies to the code we write inside a single model in addition to transformations across the codebase), but don’t be intimidated if you’re not quite comfortable with [Jinja](/docs/build/jinja-macros) yet. Looking at the name of the <Term id="cte">CTE</Term>, `finance_int__int_hyperion_forecast_pivot` we get a clear idea of what’s happening inside this block. By descriptively labeling the transformations happening inside our CTEs within model, just as we do with our files and folders, even a stakeholder who doesn’t know SQL would be able to grasp the purpose of this section, if not the code. As you begin to write more complex transformations moving out of the staging layer, keep this idea in mind. In the same way our models connect into a DAG and tell the story of our transformations on a macro scale, CTEs can do this on a smaller scale inside our model files.

Contributor

benwaterfield Nov 14, 2024

complicated typo

source/documentation/tools/create-a-derived-table/project-structure.md


		--------------
		✅ Group by domain or area of concern. On create-a-derived-table datamarts will be organised into domains, these align with genral business areas (criminal courts, finance, prisons, people, etc.). Within each domain the subfolder will represent the database name and should reflect a business area concept (crown court for criminal courts or recruitment for people). We are no longer interested in strictly source aligned conecepts.

Contributor

benwaterfield Nov 14, 2024

concepts typo

source/documentation/tools/create-a-derived-table/project-structure.md


		![data-flow-diagram excalidraw](https://github.com/user-attachments/assets/3dfc54d7-e304-4e48-b06d-c0dddad40503)

		In the above diagram you can see the flow of data through the Data Modelling and Engineering function. This follows a medalion rating system, that corresponds to the level of cleaning, transforming and testing that has been implemented on the data.

Contributor

benwaterfield Nov 14, 2024

medallion typo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet