all: implement Spanner connector #59

odeke-em · 2023-08-15T17:38:47Z

Allows us to build SQL from filters and required columns.

Updates #58

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerRelation.java

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerTable.java

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerRDD.java

spark-3.1-spanner-lib/pom.xml

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerUtils.java

halio-g · 2023-09-06T20:37:22Z

/gcbrun

halio-g · 2023-09-06T21:15:00Z

/gcbrun

halio-g · 2023-09-06T21:20:01Z

/gcbrun

halio-g · 2023-09-12T23:13:15Z

/gcbrun

halio-g

Can we add an integration test (spark session) one to test it ? Cause last time when I tested the range in the real dataproc job.

"2022-12-31" results in 19357 but really it should be 1672444800000000

Sorry, I misunderstood your question, it should be 19357 since I think spark receive the # of days from 1970-01-01 instead of microseconds.

odeke-em · 2023-09-19T06:40:30Z

Can we add an integration test (spark session) one to test it ? Cause last time when I tested the range in the real dataproc job.
"2022-12-31" results in 19357 but really it should be 1672444800000000
Sorry, I misunderstood your question, it should be 19357 since I think spark receive the # of days from 1970-01-01 instead of microseconds.

No worries, over the weekend I reverted to return days since the epoch.

odeke-em · 2023-09-19T06:42:58Z

@davidrabinowitz @halio-g at current commit, these are the results from running this Python program

from pyspark.sql import SparkSession

def main():
    table = "exchange_rates"
    spark = SparkSession.builder.appName("Query Spanner").getOrCreate()
    df = spark.read.format('cloud-spanner') \
                .option("projectId", PROJECT_ID) \
                .option("instanceId", SPANNER_INSTANCE) \
                .option("databaseId", SPANNER_DATABASE) \
                .option("enableDataBoost", "true") \
                .option("table", table) \
                .load(table)
    df.printSchema()
    df.select("created_at", "value", "base_cur").filter((df["value"] > 3720) & (df["base_cur"] == "USD")).show()

which produces

root
 |-- id: string (nullable = false)
 |-- base_cur: string (nullable = false)
 |-- end_cur: string (nullable = false)
 |-- value: double (nullable = false)
 |-- data_src: string (nullable = false)
 |-- created_at: timestamp (nullable = false)
 |-- published_at: timestamp (nullable = true)

+--------------------+-----------+--------+
|          created_at|      value|base_cur|
+--------------------+-----------+--------+
|2023-08-30 20:56:...|3724.365901|     USD|
|2023-09-09 04:40:...|3738.384066|     USD|
|2023-09-07 05:44:...|3735.182945|     USD|
|2023-08-25 17:00:...|3730.606064|     USD|
|2023-09-15 10:08:...|3724.849835|     USD|
|2023-08-21 18:24:...|3727.353643|     USD|
|2023-08-20 07:52:...|3724.922281|     USD|
|2023-08-23 23:00:...|3738.342818|     USD|
|2023-08-14 10:00:...|3749.635375|     USD|
|2023-09-01 05:48:...|3730.472718|     USD|
|2023-09-15 21:40:...|3730.465981|     USD|
|2023-09-11 16:40:...|3723.400239|     USD|
|2023-09-13 13:16:...|3723.537745|     USD|
|2023-08-19 14:00:...|3724.922281|     USD|
|2023-09-06 02:56:...|3735.927727|     USD|

halio-g · 2023-09-19T17:37:50Z

Please fix the presubmit issue. https://github.com/GoogleCloudDataproc/spark-spanner-connector/pull/59/checks?check_run_id=16914932023

odeke-em · 2023-09-19T19:23:49Z

@halio-g the presubmit check now passes all gucci, please take a look again!

Addressed almost all the comments, please re-review.

README.md

examples/SpannerSpark.py

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerBaseRelation.java

...1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerInputPartitionContext.java

....1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/Spark31SpannerTableProvider.java

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SparkFilterUtils.java

spark-3.1-spanner-lib/src/test/java/com/google/cloud/spark/spanner/SpannerScanBuilderTest.java

spark-3.1-spanner-lib/src/test/java/com/google/cloud/spark/spanner/IntegrationTestBase.java

halio-g · 2023-09-20T19:09:14Z

spark-3.1-spanner-lib/src/test/resources/db/populate_ddl.sql

@@ -12,6 +12,7 @@ CREATE TABLE games (
  winner STRING(36),
  created TIMESTAMP,
  finished TIMESTAMP,
+  max_date DATE,
 ) PRIMARY KEY(gameUUID);

 CREATE TABLE players (


This can be done in a follow up pr.

Please create table by types, for example:
CREATE TABLE IntTable(
not_null_index INT64 NOT NULL,
int_column INT64,
) PRIMARY KEY(not_null_index);

CREATE TABLE StringTable(
not_null_index STRING<10> NOT NULL,
max_string_col STRING,
max_string_col STRING<100>,
) PRIMARY KEY(not_null_index);

CREATE TABLE TimestampTable(
not_null_index TIMESTAMP NOT NULL,
timestamp_col TIMESTAMP,
) PRIMARY KEY(not_null_index);

Or creating a single table with all the types, for example:
CREATE TABLE IntegrationTestTable(
not_null_index INT64 NOT NULL,
max_string_col STRING,
100_string_col STRING<100>,
int_column INT64,
timestamp_col TIMESTAMP,
) PRIMARY KEY(not_null_index);

CREATE TABLE StringIndexTable(
string_index STRING<20> NOT NULL,
) PRIMARY KEY(string_index);

Keep that functionality in SpannerScanner. While here also add filters to SpannerScanner to allow filtration.

This change implements a columns selector to reduce the amount of data returned and the results were confirmed by a live query ```python df.printSchema() df.select("created_at", "value", "base_cur") .filter((df["value"] > 3720) & (df["base_cur"] == "USD")) .show() ``` which produced ```shell root |-- id: string (nullable = false) |-- base_cur: string (nullable = false) |-- end_cur: string (nullable = false) |-- value: double (nullable = false) |-- data_src: string (nullable = false) |-- created_at: timestamp (nullable = false) |-- published_at: timestamp (nullable = true) +--------------------+-----------+--------+ | created_at| value|base_cur| +--------------------+-----------+--------+ |2023-08-30 20:56:...|3724.365901| USD| |2023-09-09 04:40:...|3738.384066| USD| |2023-09-07 05:44:...|3735.182945| USD| |2023-08-25 17:00:...|3730.606064| USD| |2023-09-15 10:08:...|3724.849835| USD| |2023-08-21 18:24:...|3727.353643| USD| |2023-08-20 07:52:...|3724.922281| USD| |2023-08-23 23:00:...|3738.342818| USD| |2023-08-14 10:00:...|3749.635375| USD| |2023-09-01 05:48:...|3730.472718| USD| |2023-09-15 21:40:...|3730.465981| USD| |2023-09-11 16:40:...|3723.400239| USD| |2023-09-13 13:16:...|3723.537745| USD| |2023-08-19 14:00:...|3724.922281| USD| |2023-09-06 02:56:...|3735.927727| USD| |2023-09-11 13:04:...|3723.400239| USD| |2023-09-14 17:56:...|3724.849835| USD| |2023-09-07 20:44:...|3739.570874| USD| |2023-08-22 07:00:...|3729.171921| USD| |2023-08-22 08:20:...|3729.171921| USD| +--------------------+-----------+--------+ ```

Helps identify the connector to Google Cloud so that later usage metrics, health checks, quota updates, optimizations can be trivially made.

Allows Databoost to be disabled; it is on by default given the point of this connector. However, there is something to be said about compatibility so that by default most users who haven't enabled Databoost can still use it, but that's to be discussed for later. Fixes #68

While here also fixed up the package path to be fully: package com.google.cloud.spark.spanner; instead of erroneously: package com.google.cloud.spark;

…ataboost Hao reasoned that databoost being enabled should be an opt-in because it is expensive for customers so that's a good reason to require it to be explicitly specified by the customer.

This fixes a long standing shutdown failure due to unclosed Spanner objects.

davidrabinowitz requested changes Aug 15, 2023

View reviewed changes

odeke-em force-pushed the implement-buildScan branch from 3ba1895 to da233d0 Compare August 16, 2023 12:47

davidrabinowitz requested changes Aug 18, 2023

View reviewed changes

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerRDD.java Outdated Show resolved Hide resolved

odeke-em force-pushed the implement-buildScan branch 4 times, most recently from 208e0ee to f44ba06 Compare August 25, 2023 17:37

davidrabinowitz requested changes Aug 28, 2023

View reviewed changes

davidrabinowitz previously requested changes Aug 28, 2023

View reviewed changes

spark-3.1-spanner-lib/src/main/java/com/google/cloud/spark/spanner/SpannerUtils.java Outdated Show resolved Hide resolved

odeke-em force-pushed the implement-buildScan branch 5 times, most recently from 79dfb48 to a669e9c Compare September 5, 2023 17:22

halio-g reviewed Sep 15, 2023

View reviewed changes

odeke-em force-pushed the implement-buildScan branch from ae593b0 to b3f3a24 Compare September 17, 2023 12:58

odeke-em changed the title ~~ScanBuilder: implement buildScan to construct SQL~~ all: implement Spanner connector Sep 19, 2023

odeke-em requested review from davidrabinowitz and halio-g September 19, 2023 06:40

odeke-em force-pushed the implement-buildScan branch from b9473c4 to d15c645 Compare September 19, 2023 19:14

halio-g reviewed Sep 20, 2023

View reviewed changes

odeke-em added 17 commits September 22, 2023 19:31

Date: convert back to daysSinceEpoch + fix timezone discrepancies

d59a56a

revert toSparkTimestamp to original + add more comments

ff8cc8c

Remove unnecessary spark.Batch from SpannerScanBuilder

8241919

Keep that functionality in SpannerScanner. While here also add filters to SpannerScanner to allow filtration.

SpannerUtils: add UserAgent when creating batchClient

8fe090f

Helps identify the connector to Google Cloud so that later usage metrics, health checks, quota updates, optimizations can be trivially made.

README: improve example

06ca6f8

README: fix markdown

dd19c67

examples: update example Python to make it agnostic

6c241a3

Use more Java-esque variable names for SpannerInputPartitionContext

10ec27c

test: introduce SparkFilterUtilsTest and retrofit for that code

5e3fcba

While here also fixed up the package path to be fully: package com.google.cloud.spark.spanner; instead of erroneously: package com.google.cloud.spark;

make enableDataboost=true explicit and not an inversion with disableD…

0a95456

…ataboost Hao reasoned that databoost being enabled should be an opt-in because it is expensive for customers so that's a good reason to require it to be explicitly specified by the customer.

SpannerScanner: correctly close BatchClient in .planInputPartitions

a943057

This fixes a long standing shutdown failure due to unclosed Spanner objects.

Remove vestige of CreatableRelationProvider which is unneeded

e5758d2

Unit test results from InputPartitonReaderContext

cb58880

Remove vestiges of AVRO+BigQuery imports

47a95bd

Remove unused SpannerBaseRelation

ec25270

odeke-em force-pushed the implement-buildScan branch from ef0d989 to a48073e Compare September 23, 2023 02:31

Make s/userAgent/USER_AGENT/g to Java constant names

493eea2

odeke-em force-pushed the implement-buildScan branch from a48073e to 493eea2 Compare September 23, 2023 02:46

odeke-em merged commit 493eea2 into main Sep 23, 2023
2 checks passed

odeke-em deleted the implement-buildScan branch September 23, 2023 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: implement Spanner connector #59

all: implement Spanner connector #59

odeke-em commented Aug 15, 2023

halio-g commented Sep 6, 2023

halio-g commented Sep 6, 2023

halio-g commented Sep 6, 2023

halio-g commented Sep 12, 2023

halio-g left a comment

odeke-em commented Sep 19, 2023

odeke-em commented Sep 19, 2023

halio-g commented Sep 19, 2023

odeke-em commented Sep 19, 2023

halio-g Sep 20, 2023

all: implement Spanner connector #59

all: implement Spanner connector #59

Conversation

odeke-em commented Aug 15, 2023

halio-g commented Sep 6, 2023

halio-g commented Sep 6, 2023

halio-g commented Sep 6, 2023

halio-g commented Sep 12, 2023

halio-g left a comment

Choose a reason for hiding this comment

odeke-em commented Sep 19, 2023

odeke-em commented Sep 19, 2023

halio-g commented Sep 19, 2023

odeke-em commented Sep 19, 2023

halio-g Sep 20, 2023

Choose a reason for hiding this comment