PySpark DataFrames

Structured data processing with Spark

DataFrame Basics

DataFrames are distributed collections of data organized into named columns, similar to tables in relational databases.

Creating DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# From a list of tuples
data = [("Alice", 34), ("Bob", 45), ("Charlie", 28)]
df = spark.createDataFrame(data, ["Name", "Age"])

# From a CSV file
df = spark.read.csv("data.csv", header=True, inferSchema=True)

Common Operations

Operation Example
Select columns df.select("Name", "Age")
Filter rows df.filter(df.Age > 30)
Group by df.groupBy("Department").avg("Salary")