Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poll .GetBatch() instead of using operation.result() #841

Closed
wants to merge 9 commits into from

Conversation

wazi55
Copy link
Contributor

@wazi55 wazi55 commented Jul 21, 2023

resolves: #734
docs dbt-labs/docs.getdbt.com/#

Problem

Serverless Spark polling within the dbt-bigquery adapter spends an extra 1.5 minute waiting for cluster to be torn down instead of just waiting for the job to finish.

Solution

Updated operation.result to getBatch instead. added a randomly generated batch_id to allow dataproc to reference getBatch operation from the batch_id used for createBatch.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@wazi55 wazi55 requested a review from a team as a code owner July 21, 2023 12:51
@wazi55 wazi55 requested a review from McKnight-42 July 21, 2023 12:51
@cla-bot cla-bot bot added the cla:yes label Jul 21, 2023
@wazi55
Copy link
Contributor Author

wazi55 commented Jul 21, 2023

wazi-macbookpro:dbt-bigquery wazi$ python -m pytest tests/unit/test_bigquery_adapter.py

======================================================= test session starts =======================================================
platform darwin -- Python 3.10.9, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/wazi/Downloads/git/dbt-bigquery
configfile: pytest.ini
plugins: anyio-3.5.0, csv-3.0.0, logbook-1.2.0, flaky-3.7.0, dotenv-0.5.2, xdist-3.3.1
collected 52 items                                                                                                                

tests/unit/test_bigquery_adapter.py ....................................................                                    [100%]

======================================================== warnings summary =========================================================
../../../../../usr/local/anaconda3/lib/python3.10/site-packages/jupyter_client/connect.py:27
  /usr/local/anaconda3/lib/python3.10/site-packages/jupyter_client/connect.py:27: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
  given by the platformdirs library.  To remove this warning and
  see the appropriate new directories, set the environment variable
  `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
  The use of platformdirs will be the default in `jupyter_core` v6

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================= 52 passed, 1 warning in 15.77s ==================================================
wazi-macbookpro:dbt-bigquery wazi$ 

@dataders dataders changed the title re:PR https://github.com/dbt-labs/dbt-bigquery/pull/840/files poll .GetBatch() instead of using operation.result() Jul 21, 2023
@dataders
Copy link
Contributor

@wazi55 we still need a changelog entry on this. here's how to add a changelog entry

Comment on lines 131 to 135
operation = self.job_client.create_batch(request=request) # type: ignore
# this takes quite a while, waiting on GCP response to resolve
# (not a google-api-core issue, more likely a dataproc serverless issue)
response = operation.result(polling=self.result_polling_policy)

state = "PENDING"
Copy link
Contributor

@dataders dataders Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linting checks are failing with the following message

python_submissions.py:127:9: F841 local variable 'operation' is assigned to but never used

I suggest either:

  1. removing it, or
  2. (preferred) use it to set the initial state variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the easier path 🔢


state = "State.PENDING"
while state not in ["State.SUCCEEDED", "State.FAILED", "State.CANCELLED"]:
time.sleep(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colin-rogers-dbt i'm a polling filthy casual, is there a more canonical way for polling besides .sleep(2)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, we need to have some sort of polling interval

response = operation.result(polling=self.result_polling_policy)

state = "State.PENDING"
while state not in ["State.SUCCEEDED", "State.FAILED", "State.CANCELLED"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this is an enum defined some where in the bigquery python client. We should rely on that and not hardcoded strings.

@colin-rogers-dbt
Copy link
Contributor

Thanks for getting this started I'm going to take this one over to address tests/missing functionality as part of #929

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants