Structured data processing with Spark
DataFrames are distributed collections of data organized into named columns, similar to tables in relational databases.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
# From a list of tuples
data = [("Alice", 34), ("Bob", 45), ("Charlie", 28)]
df = spark.createDataFrame(data, ["Name", "Age"])
# From a CSV file
df = spark.read.csv("data.csv", header=True, inferSchema=True)
Operation | Example |
---|---|
Select columns | df.select("Name", "Age") |
Filter rows | df.filter(df.Age > 30) |
Group by | df.groupBy("Department").avg("Salary") |