Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of drop table if exists "no_schema"."$scratch" #149

Open
Alain-Barrette opened this issue Mar 6, 2023 · 30 comments
Open

Removal of drop table if exists "no_schema"."$scratch" #149

Alain-Barrette opened this issue Mar 6, 2023 · 30 comments
Assignees

Comments

@Alain-Barrette
Copy link

Describe the enhancement requested

Deployment of our code take a lot of time. We are currently deploying 1,025 object in 16minutes. Lots of this is due to

/* {"app": "dbt", "dbt_version": "1.4.1", "profile_name": "cbq", "target_name": "qa", "node_id": "model.cbq.DeclarationAnnuelleAMFLegacyApplication3JeuxDonneesPolicesEnVigueurFinAnnee"} */
drop table if exists "no_schema"."$scratch"."DeclarationAnnuelleAMFLegacyApplication3JeuxDonneesPolicesEnVigueurFinAnnee"

This drop take between 1 and 22 seconds here. It depends on the load of the server and the Thread level we are using.

image

Since there will never be an object name "no_schema"."$scratch".%, I don't see the need for this.
Is it possible to have this removed ?

Justification for this enhancement

Speed increase during deployment.

@ravjotbrar ravjotbrar self-assigned this Mar 6, 2023
@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Mar 7, 2023

Bonjour @Alain-Barrette ,
(sorry I have to proceed in my poor english)
it seems to be an issue in the rendering of your model as a relation.
could you please give me its configuration parameters ?
I mean :

  • materialization type
  • database
  • schema
  • datalake
  • root_path

Can you confirm that you want your model to be located as :
$scratch.DeclarationAnnuelleAMFLegacyApplication3JeuxDonneesPolicesEnVigueurFinAnnee ?
I will try to understand why the adapter emit a relation like this :
"no_schema"."$scratch"."DeclarationAnnuelleAMFLegacyApplication3JeuxDonneesPolicesEnVigueurFinAnnee"
it 's as if you configured your model with :

  • datalake : no_schema
  • root_path : $scratch

no_schema is only meaningful for root_path or schema configurations,
as a way to locate your model straight under the space or s3/hadoop/nfs root location.

$scratch is the default location for persistent materializations.

Are you on S3 ? I can remember long time table deletions issues.

Looking forward,
Cordialement (de la part d'un Cousin, apparemment ?)
Fabrice

@Alain-Barrette
Copy link
Author

Hi @fabrice-etanchaud ,

No problems continuing in french if you like. In the meantime, I am adding the main data engineer that is working with dbt here.

Alexandre Cote will be reviewing this to answer you. We may indeed have some bad configuration in this project.

Thanks

@ravjotbrar
Copy link
Contributor

Hi @Alain-Barrette and @fabrice-etanchaud , If I understand the situation correctly, this drop table happens when QBC is trying to create views?

I see in the relevant create_or_replace_view macro we have a call to drop the table with the same name as the view:

  -- setup: in case the model was materialized before, drop the table
  {{ adapter.drop_relation(target_table) }}

@fabrice-etanchaud do you know the reasoning for why this was added?

@Alain-Barrette
Copy link
Author

dbt_project.zip

This could probably help.

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Mar 8, 2023

Hi @Alain-Barrette ,

Yes that's what I was suspecting :

  cbq:
    ##########################################################################
    +materialized: view
    +database: no_schema      # dremio_space
    +schema: no_schema        # dremio_space_folder
    +datalake: no_schema      # object_storage_source
    +root_path: $scratch      # object_storage_path

no_schema is only meaningful for root_path or schema configurations,
as a way to locate your model straight under the space or s3/hadoop/nfs root location.
$scratch is the default location for persistent materializations.

As it seems all your models are materialized as views (I know nothin about your custom model config in the properties or sql files), datalake and root_path will not be used at all.
If you want to explicitly configure a database (resp. a datalake) at the project level, you have to mention a space name (resp. a source name).

Could you please check that the "DeclarationAnnuelleAMFLegacyApplication3JeuxDonneesPolicesEnVigueurFinAnnee" model is in a subdirectory mentioned in your dbt_project.yml or its twin_strategy is 'allow' ? It seems to me that it's not the case, and that dbt use the invalid project level configuration to build it's twin table 's relation as "no_schema"."$scratch"."DeclarationAnnuelleAMFLegacyApplication3JeuxDonneesPolicesEnVigueurFinAnnee".
Or you may have configured it as a table ?
Please tell me !

Hi @ravjotbrar :

The create_or_replace_view macro does not seem to be used anywhere (please correct me if i am wrong. should maybe be removed from code).
In fact there is even an issue in this macro, because the twin table relation would always be built with default (profile) datalake/root_path) configuration.

Correct code is in the apply_twin_strategy macro.

I wish dremio had a shared catalog for views and tables, the database/schema and datalake/root_path double configuration add complexity to the adapter. Maybe is there a simpler way to go ?

@alexcotecbq
Copy link

Removing datalake and root_path from the dbt_project.yml file does not seem to avoid the drop table instruction.

Note that there is no functional impact from this issue, only a possible optimization.

@fabrice-etanchaud
Copy link

Removing datalake and root_path from the dbt_project.yml file does not seem to avoid the drop table instruction.

Note that there is no functional impact from this issue, only a possible optimization.

I can't explain this

@Alain-Barrette
Copy link
Author

This change result in a scratch drop..
image
image

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Mar 8, 2023

That's better !

Please, take into account this :

no_schema is only meaningful for root_path or schema configurations,
as a way to locate your model straight under the space or s3/hadoop/nfs root location.

So you shouldn't configure your project's default database to 'no_schema' (unless your space is named like this ?)

What is the materialized configuration of your "PlatiniumCrc..." model ?
I guess it's a view, and that you did not set any twin_strategy configuration, so defaulting to clone, and that the drop table command is emitted by the twin_strategy, that prevents a model from being materialized once as a view and once as a table.

Are you still experiencing long drop times ?

@Alain-Barrette
Copy link
Author

The drop time depends on the load of dremio. At the moment we are testing with a small number of object. To reduce test time. Will do further testing when this issue is resolved

@Alain-Barrette
Copy link
Author

Would it be easier to diagnose this by booking a meet between @fabrice-etanchaud and @alexcotecbq

@alexcotecbq
Copy link

@fabrice-etanchaud : we can even talk in French if you want :-)

@fabrice-etanchaud
Copy link

Why not ?
But the diagnosis is done, since it's expected behavior, provided that you confirm that :

What is the materialized configuration of your "PlatiniumCrc..." model ?
I guess it's a view, and that you did not set any twin_strategy configuration, so defaulting to clone, and that the drop table command is emitted by the twin_strategy, that prevents a model from being materialized once as a view and once as a table.

@alexcotecbq
Copy link

alexcotecbq commented Mar 8, 2023

We don't have any documentation on the twin_strategy so no configuration. If you have an URL for this, I'll take it.

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Mar 8, 2023

Dremio team wrote a wiki : https://github.com/dremio/dbt-dremio/wiki/Using-Materializations-with-Dremio#optional-twin-strategy-configuration

I am currently on my office time. I created dbt-dremio on my spare time three years ago, it's now maintained by dremio officially ! Don't tell my manager I still have a double life :-),
I would be available after 6 PM CET(GMT+1).
How could we proceed ?

@alexcotecbq
Copy link

OK, we were recently looking at the Dremio and DBT websites so I wasn't aware of the option.
It seems to work with allow value for twin_strategy.
Thanks @fabrice-etanchaud

@fabrice-etanchaud
Copy link

Yes, twin_strategy is not an original dbt configuration, it's specific to dremio.

@Alain-Barrette
Copy link
Author

Situation resolved with the new configuration tested by @alexcotecbq
We went from
deploying 1025 object in 16m39s.
to
1025 view models in 0 hours 7 minutes and 30.95 seconds

Which is much better. Thanks to all
@fabrice-etanchaud @ravjotbrar

@Alain-Barrette
Copy link
Author

models:
##########################################################################
cbq:
##########################################################################
+materialized: view
+twin_strategy: allow
# +database: no_schema # dremio_space
# +schema: no_schema # dremio_space_folder
# +datalake: no_schema # object_storage_source
# +root_path: $scratch # object_storage_path
##########################################################################

@jlarue26
Copy link
Contributor

jlarue26 commented Mar 8, 2023

Yes, many thanks @fabrice-etanchaud and @ravjotbrar !!!

@fabrice-etanchaud
Copy link

Glad to hear it worked as expected !
I am so thankful to the dremio team for bringing the adapter to the next level, an enterprise grade adapter !

@fabrice-etanchaud
Copy link

Did you try to tell dbt to process more than 4 models at the same time (with regards to your DAG) ?

https://docs.getdbt.com/reference/dbt-jinja-functions/target

@Alain-Barrette
Copy link
Author

testing was done with thread 8 and 16. No performance difference with both level. Will do further testing later when we hit the 1,500 objects mark in our project.

@ravjotbrar
Copy link
Contributor

It seems like we might need to consider removing "clone" as the default twin_strategy option and make it "allow" instead.

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Mar 8, 2023

in a standard db engine, views and tables share the same namespace, and in dbt, when a given model changes from table to view, or from view to table, the previous table or view is of course replaced by the new view or table.
That's not true anymore in dremio : when you change the materialized configuration of a model, the previous table or view remains because they don't live in the same namespace (space or source). It 's confusing. That's why i introduced the twin strategy.
I really like to have for each table a kind of proxy view in a space, in order to have all the sql objects at the same place, like in a regular db engine "schema".
Finally, for non auto promoted format datasets, creating a proxy view is the only way to apply the format and emulate auto promotion...

@alexcotecbq
Copy link

@fabrice-etanchaud : By the way, is it possible to use a sys table with the adapter? No CREATE VDS instruction is passed to Dremio in my case. Here is the code in the .sql file:

select distinct role_name as roleName from {{ source("dremio_internal_sys", "membership") }} order by role_name

And here is the code in the .yml file:

sources:
  - name: dremio_internal_sys
    database: sys
    schema: no_schema
    tables:
      - name: membership

models:
  - name: DistinctRoles
    config:
      materialized: view
      alias: listRole

My colleague @Alain-Barrette would like to include administration views in Dremio via DBT.

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Mar 10, 2023

Hi @alexcotecbq , great idea ! And what about open sourcing it if you can !
About the issue you mention, this should work. I cannot figure out why this model is not selected by dbt.
Are you sure the files are located in a place taken into account by dbt ?

@alexcotecbq
Copy link

alexcotecbq commented Mar 10, 2023

Yes I am sure because the path is TravailEquipe.EquipeData and DBT message is:

There are 2 unused configuration paths:
- models.cbq.TravailEquipe.CroixBleue
- models.cbq.TravailEquipe.CanAssistance

@alexcotecbq
Copy link

alexcotecbq commented Mar 10, 2023

@fabrice-etanchaud : c'est un code 18 finalement. vive le vendredi!

@fabrice-etanchaud
Copy link

Merci @alexcotecbq, je ne connaissais pas l'expression ! Bonne "fin de semaine" !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants