Understanding Resilient Distributed Datasets
RDD (Resilient Distributed Dataset) is the fundamental data structure of Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel.
from pyspark import SparkContext
sc = SparkContext("local", "RDD Example")
# Create RDD from a list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Create RDD from text file
text_rdd = sc.textFile("data.txt")
# Transformation (lazy evaluation)
squared_rdd = rdd.map(lambda x: x*x)
# Action (immediate execution)
result = squared_rdd.collect() # [1, 4, 9, 16, 25]