Understanding PySpark's core operations
Transformations are lazy operations that create a new RDD/DataFrame from an existing one.
# map
rdd.map(lambda x: x*2)
# filter
rdd.filter(lambda x: x > 10)
# groupBy
rdd.groupByKey()
# join
rdd1.join(rdd2)
Actions trigger computation and return results to the driver program.
# Collect data to driver
data = rdd.collect()
# Count elements
count = rdd.count()
# Save to file
rdd.saveAsTextFile("output")