Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow sql performance #42

Open
talalryz opened this issue Apr 29, 2021 · 2 comments
Open

Slow sql performance #42

talalryz opened this issue Apr 29, 2021 · 2 comments

Comments

@talalryz
Copy link

Running,

spark.read.table('database.table').limit(10).show()

is a lot faster than running,

spark.sql('SELECT * from database.table limit 10')

Intuitively, we would expect both of these operations to have similar run times. Looking a bit deeper,
it seems that spark.sql is forcing an entire file scan, where as spark.read.table.limit does not. This problem extends all the way to filtering by partition cols as well e.g.,
spark.sql('SELECT * from database.table where partition_col=<value>') also forces a full table scan, while using spark.read.table.filter does not.

Is there something I could be missing, e.g., a spark configuration that could be causing this or is this a known issue?

@pancodia
Copy link

pancodia commented Oct 14, 2022

Same slow performance for me when using Pyspark on EMR with Glue metastore. But for me there is no difference between the two query methods.

Also found for me if I filter by partition columns, data scan is reduced therefore query speeds up.

@heetu
Copy link

heetu commented Feb 13, 2023

Hi @talalryz, and @pancodia, did you get any conclusion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants