Slow sql performance #42

talalryz · 2021-04-29T20:14:29Z

Running,

spark.read.table('database.table').limit(10).show()

is a lot faster than running,

spark.sql('SELECT * from database.table limit 10')

Intuitively, we would expect both of these operations to have similar run times. Looking a bit deeper,
it seems that spark.sql is forcing an entire file scan, where as spark.read.table.limit does not. This problem extends all the way to filtering by partition cols as well e.g.,
spark.sql('SELECT * from database.table where partition_col=<value>') also forces a full table scan, while using spark.read.table.filter does not.

Is there something I could be missing, e.g., a spark configuration that could be causing this or is this a known issue?

The text was updated successfully, but these errors were encountered:

pancodia · 2022-10-14T06:38:33Z

Same slow performance for me when using Pyspark on EMR with Glue metastore. But for me there is no difference between the two query methods.

Also found for me if I filter by partition columns, data scan is reduced therefore query speeds up.

heetu · 2023-02-13T10:25:19Z

Hi @talalryz, and @pancodia, did you get any conclusion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow sql performance #42

Slow sql performance #42

talalryz commented Apr 29, 2021

pancodia commented Oct 14, 2022 •

edited

Loading

heetu commented Feb 13, 2023

Slow sql performance #42

Slow sql performance #42

Comments

talalryz commented Apr 29, 2021

pancodia commented Oct 14, 2022 • edited Loading

heetu commented Feb 13, 2023

pancodia commented Oct 14, 2022 •

edited

Loading