Apache Spark

Apache Spark is a popular open-source, distributed general-purpose cluster-computing framework. It provides a unified engine for large-scale data processing, supporting batch processing, real-time streaming, machine learning, and SQL queries.

Key Architecture

In-Memory Computing: Spark’s primary advantage over legacy systems (like Hadoop MapReduce) is its ability to perform computations in-memory, cached across cluster nodes. This significantly reduces disk I/O operations and speeds up iterative algorithms.
Unified APIs: Provides consistent interfaces across languages (Python, Scala, Java, R) and abstractions:
- DataFrames & Datasets: Structured, tabular representations of data with lazy evaluation and optimization through the Catalyst Optimizer.
- Spark SQL: Direct querying of structured data using SQL.
- Structured Streaming: Stream-processing built on the Spark SQL engine, providing end-to-end ACID guarantees.
Ecosystem Libraries: Includes native support for machine learning (MLlib) and graph processing (GraphX).

Spark is widely deployed in pipelines to ingest, clean, transform, and model data at petabyte scale.

Part of the Data & AI Terms glossary.