Real-time data processing with Spark
Spark Streaming provides a high-level abstraction called DStream (Discretized Stream) which represents a continuous stream of data.
Data is divided into small batches (micro-batches) which are then processed by the Spark engine.
from pyspark.streaming import StreamingContext
# Create a StreamingContext with batch interval of 1 second
ssc = StreamingContext(spark.sparkContext, 1)
# Create a DStream from TCP source
lines = ssc.socketTextStream("localhost", 9999)
# Process each line
words = lines.flatMap(lambda line: line.split(" "))
word_counts = words.countByValue()
# Print the results
word_counts.pprint()
# Start the computation
ssc.start()
ssc.awaitTermination()