PySpark Streaming

Real-time data processing with Spark

Streaming Basics

Spark Streaming provides a high-level abstraction called DStream (Discretized Stream) which represents a continuous stream of data.

Streaming Architecture

Data is divided into small batches (micro-batches) which are then processed by the Spark engine.

Basic Streaming Example

from pyspark.streaming import StreamingContext

# Create a StreamingContext with batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)

# Create a DStream from TCP source
lines = ssc.socketTextStream("localhost", 9999)

# Process each line
words = lines.flatMap(lambda line: line.split(" "))
word_counts = words.countByValue()

# Print the results
word_counts.pprint()

# Start the computation
ssc.start()
ssc.awaitTermination()