Transformations & Actions

Understanding PySpark's core operations

Transformations

Transformations are lazy operations that create a new RDD/DataFrame from an existing one.

# map
rdd.map(lambda x: x*2)

# filter
rdd.filter(lambda x: x > 10)

# groupBy
rdd.groupByKey()

# join
rdd1.join(rdd2)

Actions trigger computation and return results to the driver program.

# Collect data to driver
data = rdd.collect()

# Count elements
count = rdd.count()

# Save to file
rdd.saveAsTextFile("output")