You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
spark.sql('SELECT * from database.table limit 10')
Intuitively, we would expect both of these operations to have similar run times. Looking a bit deeper,
it seems that spark.sql is forcing an entire file scan, where as spark.read.table.limit does not. This problem extends all the way to filtering by partition cols as well e.g., spark.sql('SELECT * from database.table where partition_col=<value>') also forces a full table scan, while using spark.read.table.filter does not.
Is there something I could be missing, e.g., a spark configuration that could be causing this or is this a known issue?
The text was updated successfully, but these errors were encountered:
Running,
spark.read.table('database.table').limit(10).show()
is a lot faster than running,
spark.sql('SELECT * from database.table limit 10')
Intuitively, we would expect both of these operations to have similar run times. Looking a bit deeper,
it seems that
spark.sql
is forcing an entire file scan, where asspark.read.table.limit
does not. This problem extends all the way to filtering by partition cols as well e.g.,spark.sql('SELECT * from database.table where partition_col=<value>')
also forces a full table scan, while usingspark.read.table.filter
does not.Is there something I could be missing, e.g., a spark configuration that could be causing this or is this a known issue?
The text was updated successfully, but these errors were encountered: