[hailctl] update to dataproc 2.2 and Spark 3.5.0 #14158

danking · 2024-01-12T20:28:33Z

CHANGELOG: Hail now supports and primarily tests against Dataproc 2.2.5, Spark 3.5.0, and Java 11. We strongly recommend updating to Spark 3.5.0 and Java 11. You should also update your GCS connector after installing Hail: curl https://broad.io/install-gcs-connector | python3. Do not try to update before installing Hail 0.2.131.

https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2

danking · 2024-01-12T20:43:37Z

hail/src/main/scala/is/hail/HailContext.scala


      implicitly[BinaryRegistry[
        DenseMatrix[Double],
        Vector[Double],
        OpMulMatrix.type,
        DenseVector[Double],
      ]].register(
-        DenseMatrix.implOpMulMatrix_DMD_DVD_eq_DVD
+        HasOps.impl_OpMulMatrix_DMD_DVD_eq_DVD


see #13971 (comment)

zyd14 · 2024-01-19T19:36:13Z

My team is pretty excited about hail being released with support for Spark 3.5. One thing I noticed is that it looks like the plan is to restrict to Spark 3.5.0 - would it be possible to allow some wiggle room for minor releases? Spark has been beginning to release upgrades much more often than in the past, so restricting to 3.5.0 will prevent access to bug fixes, feature enhancements, etc.

danking · 2024-02-01T22:03:23Z

@zyd14 that's just a warning message printed during compilation. The requirements.txt file accepts any 3.5.x version.

danking · 2024-02-02T16:16:53Z

I had to pin nbconvert<7.14 due to a new bug in nbconvert.

jupyter/nbconvert#2092

danking · 2024-02-02T16:20:03Z

hail/python/hailtop/batch/batch_pool_executor.py

@@ -232,7 +232,7 @@ async def async_map(
    ) -> AsyncGenerator[int, None]:
        """Aysncio compatible version of :meth:`.map`."""
        if not iterables:
-            return (x for x in range(0))
+            return the_empty_async_generator()


the lint didn't like that I used an empty generator where an async generator was expected.

danking · 2024-02-02T16:23:40Z

Since the dataproc tests only run on main commits (not on every PR commit, due to cost), I submitted a dev deploy to test the latest commit to this branch against dataproc: https://ci.hail.is/batches/8119055

hailctl dev deploy -b danking/hail:dataproc-2.2 -s test_dataproc-37 -s test_dataproc-38

danking · 2024-02-05T17:54:03Z

Previous dev deploy passed. Rebased on main and submitted a new dev deploy: https://ci.hail.is/batches/8120683

ehigham · 2024-02-05T18:18:52Z

@patrick-schultz note changes to build.gradle - is this compatible with your work to use mill?

danking · 2024-02-05T22:27:25Z

hail/src/main/scala/is/hail/HailContext.scala

      case javaVersion(major, _, _) =>
        if (major.toInt != 11)
-          fatal(s"Hail requires Java 8 or 11, found $versionString")
+          warn(s"Hail is tested against Java 11, found $versionString")


I don't fully grok what we were trying to accomplish here.

You can't possibly get to these lines if the JVM version is lower than the byte code, so JVMs <=7 would never execute this. AFAICT, Spark now supports all LTS versions of Java (currently: 8, 11, 17), so I see no strong reason to prohibit any version. Instead, I simply warn if you are using a version of the JVM other than the one with which we test.

In particular:

Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.8+, and R 3.5+. Java 8 prior to version 8u371 support is deprecated as of Spark 3.5.0.

https://spark.apache.org/docs/latest/#downloading

I updated batch/Dockerfile.worker and docker/Dockerfile.base. The former controls which JVM is active at QoB execution time and the latter controls which JVM is active at compilation time. Note that in build.gradle we still target JVM 8 byte code, so Hail should work with other versions.

danking · 2024-02-06T19:46:05Z

@ehigham ready for re-review now with Java 11 changes.

danking · 2024-02-06T19:48:38Z

And here's a dev deploy that runs the dataproc tests. Don't approve until these tests pass! We don't run them on ordinary PRs because they're expensive and slow. We do run them on main commits. For this PR, the chance of having broken dataproc is high enough that we should ensure the tests pass before merging into main.

https://ci.hail.is/batches/8121061

danking · 2024-02-07T00:35:07Z

Delay merging until broadinstitute/install-gcs-connector#6 is merged. Without that PR, users will not have access to a version of the GCS Hadoop connector that does not use tons of memory in JVM 11.

danking · 2024-02-07T20:42:58Z

Rebased and dev deploy kicked off: https://ci.hail.is/batches/8122588

danking · 2024-02-07T20:43:10Z

broadinstitute/install-gcs-connector#6 is merged.

patrick-schultz · 2024-02-08T18:39:52Z

hail/build.sc

+    // WARNING WARNING WARNING
+    // Before changing the breeze version review:
+    // - https://hail.zulipchat.com/#narrow/stream/123011-Hail-Query-Dev/topic/new.20spark.20ndarray.20failures/near/41645
+    // - https://github.com/hail-is/hail/pull/11555
    val core = ivy"org.scalanlp::breeze:1.1"


It appears we've actually been using breeze 1.2, because spark-mllib pulls it in:

❯ mill ivyDepsTree --withCompile --withRuntime --whatDependsOn org.scalanlp:breeze_2.12 (base) [17/17] ivyDepsTree └─ org.scalanlp:breeze_2.12:1.2 ├─ org.apache.spark:spark-mllib-local_2.12:3.3.0 │ ├─ org.apache.spark:spark-graphx_2.12:3.3.0 │ │ └─ org.apache.spark:spark-mllib_2.12:3.3.0 │ └─ org.apache.spark:spark-mllib_2.12:3.3.0 ├─ org.apache.spark:spark-mllib_2.12:3.3.0 └─ org.scalanlp:breeze-natives_2.12:1.1 org.scalanlp:breeze_2.12:1.1 -> 1.2

Does the compileIvyDeps line not exclude them? I want to let Spark use their thing and relocate breeze for our own internal purposes.

Ah, I missed that, that does let us use breeze 1.1. And I now realize even without that, because this is a compile only dependency, we would build with breeze 1.2 but run with 1.1.

danking · 2024-02-22T17:26:34Z

@patrick-schultz @ehigham I'm abdicating responsibility for this. I've too much to wrap up in the next five work days. It looks like there's a mill issue currently. Otherwise I think it should be ready to merge.

danking · 2024-02-23T16:55:34Z

OK, I'm not sure how to fix this but the work is to explain to the GCS Hadoop Connector which credentials we want it to use. See the failure here: https://batch.hail.is/batches/8136069/jobs/49 . It uses CI's credentials instead of the test credentials. We use core-site.xml to do this in Spark <3.5, but the GCS connector is different in Spark 3.5 and it uses different configuration parameters. My most recent change did not successfully configure it.

Daniel G can help you a bit with credentials in Batch if that's necessary but the real work is to figure out how to tell the GCS Hadoop Connector to use the /gsa-key/key.json file.

danking · 2024-02-23T17:06:07Z

Matt S fortuitously asked a question that lead me to https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v3.0.0/gcs/INSTALL.md , so I'm trying that now. That might be the last necessary fix.

danking · 2024-02-27T22:32:00Z

Just the local backend is broken now. I might've just fixed that.

danking · 2024-02-28T18:41:12Z

OK, I won't be able to fix this. @ehigham @patrick-schultz @daniel-goldstein some combo of you three can probably figure it out. The local backend tests that hit requester pays buckets are failing with new Spark. New Spark needs new GCS hadoop connector (see the Dockerfiles). New GCS hadoop connector has brand new configuration parameters. Somehow I managed to make the normal Spark backend work correctly but the Local backend (which still, afaik, uses Spark & Hadoop for filesystems) is still trying to pick up CI's credentials instead of the test account's credentials.

E           hail.utils.java.FatalError: GoogleJsonResponseException: 403 Forbidden
E           GET https://storage.googleapis.com/storage/v1/b/hail-test-requester-pays-fds32/o/zero-to-nine?fields=bucket,name,timeCreated,updated,generation,metageneration,size,contentType,contentEncoding,md5Hash,crc32c,metadata&userProject=hail-vdc
E           {
E             "code": 403,
E             "errors": [
E               {
E                 "domain": "global",
E                 "message": "[email protected] does not have serviceusage.services.use access to the Google Cloud project. Permission 'serviceusage.services.use' denied on resource (or it may not exist).",
E                 "reason": "forbidden"
E               }
E             ],
E             "message": "[email protected] does not have serviceusage.services.use access to the Google Cloud project. Permission 'serviceusage.services.use' denied on resource (or it may not exist)."
E           }
E           
E           Java stack trace:
E           java.io.IOException: Error accessing gs://hail-test-requester-pays-fds32/zero-to-nine
E           	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1986)
E           	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1882)
E           	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystemImpl.getFileInfoInternal(GoogleCloudStorageFileSystemImpl.java:861)
E           	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystemImpl.getFileInfo(GoogleCloudStorageFileSystemImpl.java:833)
E           	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getFileStatus(GoogleHadoopFileSystem.java:724)
E           	at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:115)
E           	at org.apache.hadoop.fs.Globber.doGlob(Globber.java:349)
E           	at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
E           	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2142)
E           	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.globStatus(GoogleHadoopFileSystem.java:759)
E           	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.globStatus(GoogleHadoopFileSystem.java:1277)
E           	at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:162)
E           	at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:85)
E           	at is.hail.io.fs.FS.glob(FS.scala:402)
E           	at is.hail.io.fs.FS.glob$(FS.scala:402)
E           	at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:85)
E           	at is.hail.io.fs.HadoopFS.$anonfun$globAll$1(HadoopFS.scala:154)
E           	at is.hail.io.fs.HadoopFS.$anonfun$globAll$1$adapted(HadoopFS.scala:153)

daniel-goldstein · 2024-03-12T15:49:12Z

docker/core-site.xml

  <property>
    <name>google.cloud.auth.service.account.json.keyfile</name>
    <value>/gsa-key/key.json</value>
  </property>
-


These properties look wrong based on this example

Is this still wrong?

I don't know what to say - seems to work in ci?

This is what Install.md recommends

shrug guess it's fine then

Fixes hail-is#13971 CHANGELOG: Hail now supports and primarily tests against Dataproc 2.2.5, Spark 3.5.0, and Java 11. We strongly recommend updating to Spark 3.5.0 and Java 11. You should also update your GCS connector after installing Hail: curl https://broad.io/install-gcs-connector | python3. Do not try to update before installing Hail 0.2.128. https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2

patrick-schultz

I'm good with this, just one minor request, and a resolution on @daniel-goldstein 's question above.

Also, need to update the hail version in the changelog text

hail/Makefile

patrick-schultz

Can you just change the hail version in the changelog text before we merge it?

ehigham · 2024-04-11T20:12:51Z

I'm good with this, just one minor request, and a resolution on @daniel-goldstein 's question above.

Also, need to update the hail version in the changelog text

Sorry, missed that bit about the changelog

danking added the WIP label Jan 12, 2024

danking force-pushed the dataproc-2.2 branch from 1989bc0 to 8cf1fcc Compare January 12, 2024 20:42

danking changed the title ~~Dataproc 2.2~~ [hailctl] update to dataproc 2.2 Jan 12, 2024

danking commented Jan 12, 2024

View reviewed changes

danking removed the WIP label Jan 12, 2024

danking force-pushed the dataproc-2.2 branch from 6c04a88 to fa324f3 Compare February 1, 2024 22:16

danking assigned ehigham Feb 2, 2024

danking commented Feb 2, 2024

View reviewed changes

danking changed the title ~~[hailctl] update to dataproc 2.2~~ [hailctl] update to dataproc 2.2 and Spark 3.5.0 Feb 2, 2024

danking force-pushed the dataproc-2.2 branch from cc1501c to b1ed23a Compare February 5, 2024 17:53

ehigham previously approved these changes Feb 5, 2024

View reviewed changes

danking added the WIP label Feb 5, 2024

danking commented Feb 5, 2024

View reviewed changes

danking removed the WIP label Feb 6, 2024

danking force-pushed the dataproc-2.2 branch from e935caa to 14ceb37 Compare February 7, 2024 20:25

patrick-schultz reviewed Feb 8, 2024

View reviewed changes

danking added the WIP label Feb 8, 2024

ehigham mentioned this pull request Feb 9, 2024

[query] Update testng and scalatest versions for IDEA support #14277

Merged

danking assigned patrick-schultz Feb 22, 2024

danking removed the WIP label Feb 23, 2024

ehigham mentioned this pull request Mar 5, 2024

[query] LocalBackend hangs if given a gs:// URL #13904

Closed

daniel-goldstein reviewed Mar 12, 2024

View reviewed changes

ehigham force-pushed the dataproc-2.2 branch 2 times, most recently from b0b96c5 to b863f72 Compare March 20, 2024 01:57

ehigham added the stacked PR label Mar 20, 2024

Dan King and others added 4 commits April 11, 2024 11:22

configure new gcs connector

f1acb8b

Update core-site.xml

c156d03

format

9290535

ehigham force-pushed the dataproc-2.2 branch from b863f72 to 9290535 Compare April 11, 2024 15:22

ehigham removed the stacked PR label Apr 11, 2024

ehigham requested a review from patrick-schultz April 11, 2024 15:26

ehigham removed their assignment Apr 11, 2024

patrick-schultz requested changes Apr 11, 2024

View reviewed changes

hail/Makefile Outdated Show resolved Hide resolved

remove gradle artifact cleaning

8e41c7b

ehigham requested a review from patrick-schultz April 11, 2024 19:46

patrick-schultz requested changes Apr 11, 2024

View reviewed changes

ehigham requested a review from patrick-schultz April 11, 2024 20:12

patrick-schultz approved these changes Apr 11, 2024

View reviewed changes

hail-ci-robot merged commit bd0156d into hail-is:main Apr 11, 2024
2 checks passed

ireneisdoomed mentioned this pull request Aug 28, 2024

Update Dataproc and PySpark in Genetics opentargets/issues#3189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hailctl] update to dataproc 2.2 and Spark 3.5.0 #14158

[hailctl] update to dataproc 2.2 and Spark 3.5.0 #14158

danking commented Jan 12, 2024 •

edited by ehigham

Loading

danking Jan 12, 2024

zyd14 commented Jan 19, 2024

danking commented Feb 1, 2024

danking commented Feb 2, 2024

danking Feb 2, 2024

danking commented Feb 2, 2024

danking commented Feb 5, 2024

ehigham commented Feb 5, 2024 •

edited

Loading

danking Feb 5, 2024

danking Feb 6, 2024

danking commented Feb 6, 2024

danking commented Feb 6, 2024

danking commented Feb 7, 2024

danking commented Feb 7, 2024

danking commented Feb 7, 2024

patrick-schultz Feb 8, 2024

danking Feb 13, 2024

patrick-schultz Feb 13, 2024

danking commented Feb 22, 2024

danking commented Feb 23, 2024

danking commented Feb 23, 2024

danking commented Feb 27, 2024

danking commented Feb 28, 2024

daniel-goldstein Mar 12, 2024

patrick-schultz Apr 11, 2024

ehigham Apr 11, 2024

ehigham Apr 11, 2024

daniel-goldstein Apr 11, 2024

patrick-schultz left a comment

patrick-schultz left a comment

ehigham commented Apr 11, 2024

[hailctl] update to dataproc 2.2 and Spark 3.5.0 #14158

[hailctl] update to dataproc 2.2 and Spark 3.5.0 #14158

Conversation

danking commented Jan 12, 2024 • edited by ehigham Loading

Choose a reason for hiding this comment

zyd14 commented Jan 19, 2024

danking commented Feb 1, 2024

danking commented Feb 2, 2024

Choose a reason for hiding this comment

danking commented Feb 2, 2024

danking commented Feb 5, 2024

ehigham commented Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking commented Feb 6, 2024

danking commented Feb 6, 2024

danking commented Feb 7, 2024

danking commented Feb 7, 2024

danking commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking commented Feb 22, 2024

danking commented Feb 23, 2024

danking commented Feb 23, 2024

danking commented Feb 27, 2024

danking commented Feb 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

ehigham commented Apr 11, 2024

danking commented Jan 12, 2024 •

edited by ehigham

Loading

ehigham commented Feb 5, 2024 •

edited

Loading