Spark SQL

SQL interface for structured data processing

Using SQL with DataFrames

Spark SQL allows you to run SQL queries on your DataFrames after registering them as temporary views.

SELECT name, age FROM people WHERE age > 30 ORDER BY age DESC
# Register DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Run SQL query
results = spark.sql("""
  SELECT name, age 
  FROM people 
  WHERE age > 30 
  ORDER BY age DESC
""")

results.show()

DataFrame API vs SQL

DataFrame API

df.select("name", "age")
  .filter(df.age > 30)
  .orderBy(df.age.desc())

SQL Equivalent

SELECT name, age
FROM people
WHERE age > 30
ORDER BY age DESC