diff --git a/docs/accessing-data/connecting.mdx b/docs/accessing-data/connecting.mdx index 392d6b008..2224d3384 100644 --- a/docs/accessing-data/connecting.mdx +++ b/docs/accessing-data/connecting.mdx @@ -3,7 +3,7 @@ title: "Connecting" sidebar_position: 10 --- -BigQuery offers multiple connection methods to the Hubble dataset. This guide details three common methods: +BigQuery offers multiple connection methods to Hubble. This guide details three common methods: - [BigQuery UI](#bigquery-ui) - analysts that need to perform ad hoc analysis using SQL - [BigQuery SDK](#bigquery-sdk) - developers that need to integrate data into applications @@ -11,7 +11,7 @@ BigQuery offers multiple connection methods to the Hubble dataset. This guide de ## Prerequisites -To access the Hubble dataset, you will need a Google Cloud Project with billing and the BigQuery API enabled. For more information, please follow the instructions provided by [Google Cloud](https://cloud.google.com/bigquery/docs/quickstarts/query-public-dataset-console). +To access Hubble, you will need a Google Cloud Project with billing and the BigQuery API enabled. For more information, please follow the instructions provided by [Google Cloud](https://cloud.google.com/bigquery/docs/quickstarts/query-public-dataset-console). Google does provide a BigQuery Sandbox for free that allows users to explore datasets in a limited capacity. @@ -58,6 +58,8 @@ Install the client library locally, and configure your environment to use your G python3 --version # if you do not have pip, install it python -m pip install --upgrade pip + +# install bigquery client library pip install --upgrade google-cloud-bigquery gcloud config set project PROJECT_ID ``` diff --git a/docs/accessing-data/optimizing-queries.mdx b/docs/accessing-data/optimizing-queries.mdx index 07306f9af..5f2c6fc83 100644 --- a/docs/accessing-data/optimizing-queries.mdx +++ b/docs/accessing-data/optimizing-queries.mdx @@ -76,6 +76,8 @@ order by `month` **Performance Summary** +By pruning partitions and aggregating on a clustered field, the query processing costs reduce by a factor of 8. + | | Bytes Processed | Cost | | ---------------- | --------------- | ------ | | Original Query | 408.1 GB | $2.041 | @@ -127,6 +129,8 @@ where batch_run_date >= '2023-05-01' **Performance Summary** +Hubble stores wide tables. Query performance is greatly improved by selecting only the data you need. This principle is critical when exploring the operations and transactions tables, which are the largest tables in Hubble. + | | Bytes Processed | Cost | | -------------- | --------------- | ------ | | Original Query | 769.45 GB | $3.847 | @@ -156,6 +160,33 @@ If you need to estimate costs before running a query, there are several options ### BigQuery Console +The BigQuery Console comes with a built-in query validator. It verifies query syntax and provides an estimate of the number of bytes processed. The validator can be found in the upper right hand corner of the Query Editor, next to the green checkmark. + +To calculate the query cost, convert the number of bytes processed into terabytes, and multiply the result by $5: + +`(estimated bytes read / 1TB) * $5` + +Paste the following query into the Editor to view the estimated bytes processed. + + + +```sql +select timestamp_trunc(closed_at, month) as month, + sum(tx_set_operation_count) as total_operations +from `crypto-stellar.crypto_stellar.history_ledgers` +where batch_run_date >= '2023-01-01T00:00:00' + and batch_run_date < '2023-06-01T00:00:00' + and closed_at >= '2023-01-01T00:00:00' + and closed_at < '2023-06-01T00:00:00' +group by month +``` + + + +The validator estimates that 51.95MB of data will be read. + +0.00005195 TB * $5 = $0.000259. _That’s a cheap query!_ + ### dryRun Config Parameter If you are submitting a query through a [BigQuery client library](https://cloud.google.com/bigquery/docs/reference/libraries), you can perform a dry run to estimate the total bytes processed before submitting the query job. diff --git a/docs/accessing-data/overview.mdx b/docs/accessing-data/overview.mdx index dbc715ee8..fbc99b0c4 100644 --- a/docs/accessing-data/overview.mdx +++ b/docs/accessing-data/overview.mdx @@ -7,7 +7,7 @@ sidebar_position: 0 Hubble is an open-source, publicly available dataset that provides a complete historical record of the Stellar network. Similar to Horizon, it ingests and presents the data produced by the Stellar network in a format that is easier to consume than the performance-oriented data representations used by Stellar Core. The dataset is hosted on BigQuery–meaning it is suitable for large, analytic workloads, historical data retrieval and complex data aggregation. **Hubble should not be used for real-time data retrieval and cannot submit transactions to the network.** For real time use cases, we recommend [running an API server](/docs/run-api-server). -This guide describes when to use Hubble and how to connect. For more information regarding underlying data structures, queries and examples, please refer to [Viewing Metadata](/docs/accessing-data/viewing-metadata) and [Optimizing Queries](/docs/accessing-data/optimizing-queries). +This guide describes when to use Hubble and how to connect. To view the underlying data structures, queries and examples, use the [Viewing Metadata](/docs/accessing-data/viewing-metadata) and [Optimizing Queries](/docs/accessing-data/optimizing-queries) tutorials. ## Why Use Hubble?