RDD Basics

Understanding Resilient Distributed Datasets

What is an RDD?

RDD (Resilient Distributed Dataset) is the fundamental data structure of Spark. It represents an immutable, partitioned collection of elements that can be operated on in parallel.

Creating RDDs

from pyspark import SparkContext

sc = SparkContext("local", "RDD Example")

# Create RDD from a list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Create RDD from text file
text_rdd = sc.textFile("data.txt")

RDD Operations

# Transformation (lazy evaluation)
squared_rdd = rdd.map(lambda x: x*x)

# Action (immediate execution)
result = squared_rdd.collect()  # [1, 4, 9, 16, 25]