Hadoop vs. Spark vs. Kafka — How to Structure Modern Big Data Architecture?

15 min readOct 13, 2022

There is a lot of discussion in the big data world around Apache Kafka, Hadoop, and Spark. Which tool should you use for your project? What are the benefits of each? In this article, we will compare three popular frameworks from Apache Foundation — Hadoop, Spark, and Kafka, and help you decide which tools to use for your next big data project!

Implications of Modern Data on Data Engineering

The proliferation of data and the rise of new data sources has had a profound impact on the field of data engineering.

Until recently, many data sources within a particular software were restricted to business functions such as sales, marketing, production, and finance. These systems primarily recorded transactions that were saved in a central data warehouse. Nowadays, on top of these transaction data, systems access substantial amounts of interaction data. A transaction is a business fact, and interaction data provides the context for each transaction.

Let’s look at a simple transaction made in an eCommerce store. Together with information on who bought what, systems need to track and store the whole path to how the purchase was made, how much time was spent on each website, marketing histories like social media impressions and click-throughs, what device was used to make the purchase, location, and so on. This data is not collected through one individual system but rather a set of technologies and solutions (including smartphones and IoT tech). This makes modern data complex and voluminous. It is unquestionable, though, that this big data holds unprecedented value for contemporary businesses.

Data engineering is now responsible for collecting, storing, cleaning, and preparing data for analysis. In addition, data engineers need to be able to work with a variety of tools to build complex processing flows that power modern applications.

Modern big data architecture — modern data sources and analytics systems

With so much data being generated every day, it’s becoming increasingly difficult to manage using traditional methods. This has led to the development of multiple frameworks and technologies to help with the management and processing of big data.

If you want to dig deeper into various features of designing data-intensive applications, head over to our recent article: Future According to Designing Data-Intensive Applications.

Features of a Modern Data Infrastructure

A modern data infrastructure should be able to handle the following:

Variety: Different types of data from multiple sources are ingested and outputted (structured, unstructured, semi-structured).
Velocity: Fast ingest and processing of data in real-time.
Volume: Scalable storage and processing of large amounts of data.
Cheap raw storage: Ability to store data affordably in its original form.
Flexible processing: Ability to run a wide variety of processing engines on the same data.
Support for real-time analytics: Ability to support low-latency analytics on live data.
Support for modern applications: Ability to power new types of applications that require fast, flexible data processing like BI tools, machine learning systems, log analysis, and more.

As you can see, the requirements for modern data infrastructure are quite complex. It needs to be able to handle different types of data from multiple sources processed with low latency and high throughput. Furthermore, it needs to be able to do all of this while being scalable and affordable.

So, how do you go about building such an infrastructure?

Big Data Frameworks

There are many different technologies that you can use to build a modern data infrastructure. In this article, we will focus on three of the most popular frameworks from the Apache Software Foundation: Apache Hadoop, Apache Spark, and Apache Kafka.

Apache Hadoop as a Data Processing Engine

CHECK THIS SERIES

Apache Hadoop is an open-source, big data processing framework. It is designed to handle large amounts of data (from gigabytes to petabytes) in a distributed manner across a cluster of commodity servers. It’s a cost-effective and highly scalable data collection and processing technology that stores and processes structured, semi-structured, and unstructured data (e.g., social media impressions, online clickstream records, web server logs, manufacturing sensor data, etc.).

The genius behind Hadoop is that it can take an immeasurably large data set and break it down into smaller pieces, which are then sent to different servers or nodes in a network that together create a Hadoop cluster. These machines operate on their assigned big data analytics task simultaneously, with the end result being sent over to the end user as one cohesive information unit. By abstracting away the complexities of distributed computing, Hadoop allows users to directly access the system’s functionality through an easy-to-use API.

Hadoop Ecosystem

Hadoop has several different layers that together constitute the Hadoop ecosystem.

HDFS — Hadoop Distributed File System

HDFS makes up the storage layer of the ecosystem as Hadoop’s native file system. It is a distributed file system designed to store large data sets across a cluster of commodity servers. Hadoop distributed file system is scalable, fault-tolerant, and provides high throughput access to data. Even though Hadoop is usually used for distributed data storage, management, and analysis, there are no queries involved when pulling data; therefore, Hadoop classifies more as a data warehouse than it does a database.

YARN — Yet Another Resource Negotiator

YARN is the resource management layer that enables applications to run on a Hadoop cluster. It is responsible for managing resources (CPU, memory, storage) across the cluster and scheduling applications to run on those resources.

Hadoop MapReduce

MapReduce is a programming model for processing large data sets that can be parallelized across a Hadoop cluster. It consists of two phases: the map phase, where data is transformed, and the reduce phase, where the transformed data is aggregated. This method is appropriate for solutions where you want to get insights from big data volumes rather than real-time analytics results.

Hadoop Common (Hadoop Core)

Hadoop Common is a set of utilities and libraries that are required by other Hadoop layers.

Apache Hadoop Use Cases

Harness the full potential of AI for your business

SIGN UP FOR NEWSLETTER

There are many different use cases for Apache Hadoop. Generally, with MapReduce, you can execute massive, delay-tolerant computation operations at a relatively low cost. It’s best for archived data that may be analyzed later. Here are a few of the most popular use cases:

Web log analysis: Hadoop can be used to process web server logs to generate reports on website usage and activity.
Clickstream analysis: Hadoop can be used to process clickstream data to understand user behavior on a website.
Big data analytics: Hadoop can be used to run analytics on large data sets to generate insights. Users may use Hadoop MapReduce through a SQL-like interface with Apache Hive to leverage analytics at scale and send queries at data of any size.
Calculating tasks that are not time-sensitive: Because Hadoop is not a real-time processing engine, it can be used to calculate tasks that are not time-sensitive. For example, you could use Hadoop to calculate the average conversion rates of a product over time from various landing pages.

Advantages of Hadoop

There are many advantages of using Apache Hadoop, including the following:

Flexibility: Hadoop can process structured, semi-structured, and unstructured data.
Scalability: Hadoop is highly scalable with clusters and can be easily expanded to process more data by adding additional nodes.
Fault tolerance and data protection: Hadoop is designed to handle failures gracefully by replicating information to prevent data loss and continue running even if nodes in the cluster fail.
Cost-effective: Hadoop can be run on cheap, commodity hardware (e.g., Amazon S3), which makes it a cost-effective solution for big data processing. Though, when you target in-memory processing or network storage, the maintenance cost of the solution will be significantly higher.

Limitations of Hadoop

Despite its many advantages, Hadoop has a few limitations that should be considered before using it for big data processing:

High-throughput, but with high latency: Although able to process mass amounts of data, Hadoop is designed for batch processing, which means that it can take a long time to process large data sets.
Not suitable for real-time data processing: Hadoop is not designed for real-time processing of data, which means that it is not ideal for applications that require low latency, such as financial applications.
Complexity: Hadoop is a complex system with many components that must be configured and tuned for optimal performance. While Hadoop management is a complex task for larger applications, there are many GUIs available that can help to simplify programming for MapReduce.

Apache Spark as a Batch Processing and Streaming Mechanism

Apache Spark is an open-source, general-purpose distributed processing system used for big data workloads that provides high-level APIs in Java, Scala, Python, and R. It was designed to replace MapReduce and improve upon its shortcomings, such as slow batch processing times and lack of support for interactive and real-time data analysis. This tool uses in-memory caching and optimized query execution to provide fast analytic queries against data of any size.

It is the only data processing framework that combines data and artificial intelligence. Users can apply it to execute huge-scale data transformations and analyses, followed by state-of-the-art machine learning algorithms and graph processing applications.

Spark Ecosystem

Spark is not just a data processing tool but an ecosystem that contains many different tools and libraries. The most important ones are the following:

Spark Core

Spark Core is the heart of the Spark platform. It contains the basic functionality of Spark, including distributed data processing, task scheduling and dispatching, memory management, fault recovery, and interaction with storage systems.

Spark SQL

This module allows for structured data processing. It contains a relational query processor that supports SQL and HiveQL.

Spark Streaming and Structured Streaming

These modules allow Spark to process streaming data. Spark Streaming can process live streams of data, while Structured Streaming can handle stream processing with a higher level of abstraction, with even lower latency.

MLlib

This is Spark’s machine learning library. It contains many common machine learning algorithms that can be applied to large data sets.

GraphX

This is Spark’s graph processing library that enables the analysis of scalable, graph-structured data.

Apache Spark Use Cases

Generally, Spark is the best solution when time is of the essence. Apache Spark can be used for a wide variety of data processing workloads, including:

Real-time processing and insight: Spark can also be used to process data close to real time. For example, you could use Spark Streaming to read live tweets and perform sentiment analysis on them.
Machine learning: You can use Spark MLlib to train machine learning models on large data sets and then deploy those models in your applications. It has prebuilt machine learning algorithms for tasks like regression, classification, clustering, collaborative filtering, and pattern mining. For example, you could use Spark MLlib to build a model that predicts customer churn based on their activity data.
Graph processing: You can use Spark GraphX to process graph-structured data, such as social networks or road networks. For example, you could use GraphX to instantly find the shortest path between two nodes in a graph.

Advantages of Spark

There are many advantages of using Apache Spark, including the following:

Flexibility: Apache Spark can be used for batch processing, streaming, interactive analytics, machine learning, and SQL. All these processes can be seamlessly combined in one application.
Speed: Apache Spark is much faster than MapReduce for most workloads as it uses RAM instead of reading and writing intermediate data to disk storage.
Developer friendly: Apache Spark has a simple API and wide language support that makes it easy to learn and use.

Limitations of Spark

There are also some limitations and disadvantages to using Apache Spark, including the following:

Complexity: While the API is simple, the underlying architecture is complex. This complexity can make it difficult to debug applications and tune performance.
Costly infrastructure: Apache Spark uses RAM for its in-memory computations applied for real-time data processing.
Close-to-real-time: Apache Spark is not designed for true real-time processing as it processes data in micro-batches, with a maximum latency of around 100 milliseconds. For real-time processing, you need to turn to other frameworks like Apache Flink.

Apache Kafka as a Distributed Streaming Platform

Apache Kafka is an open-source, distributed streaming platform that allows developers to create applications that continuously produce and consume data streams. In detail, it enables the creation of applications that react to events as they happen in real-time.

Kafka is a distributed, unidirectional, publish-and-consume system. It may be scaled across several servers or data centers and is designed to handle very large volumes of data with high scalability. The records are replicated and partitioned in such a way that the application can simultaneously service a massive number of users without any noticeable lag in performance.

Kafka Ecosystem

The base Kafka ecosystem is made up of the following components:

Kafka Brokers: These are the servers that run the Kafka platform. A broker can have multiple partitions, and each partition can have multiple replicas.
Kafka Topics: Topics are ordered lists of events with a unique name to which records are published.
Kafka Producers: Producers are processes that publish records to one or more Kafka topics.
Kafka Consumers: Consumers are processes that subscribe to one or more Kafka topics and consume the records published to those topics.
Zookeeper: Zookeeper is a distributed coordination service that is used by Kafka to manage its brokers.

Apache Kafka Use Cases

Kafka can be used for a wide variety of use cases like real-time streaming data pipelines and real-time streaming applications, including the following:

Messaging: You can use Kafka as a message broker to publish and subscribe to messages.
Website Activity Tracking: You can use Kafka to track website activity in near-real-time, such as user clicks and page views.
Metrics: You can use Kafka to collect system metrics from multiple servers and monitor them in near-real-time.
Log Aggregation: You can use Kafka to aggregate application logs from multiple servers and monitor them in near-real-time.

Advantages of Kafka

There are many advantages of using Kafka, including the following:

Scalability: Kafka is horizontally scalable to support a growing number of users and use cases, meaning that it can handle an increasing amount of data by adding more nodes and partitions to the system.
Fault-tolerant: Kafka is fault-tolerant as it replicates data across multiple servers and can automatically recover from node failures.
Low latency and high throughput: Kafka has low latency, meaning that messages are processed quickly; it is also designed to offer high throughput when processing data.

Limitations of Kafka

There are also some limitations and disadvantages to using Kafka, including the following:

Tweaking messages: It can be difficult to change the format of messages after they have been published, as it also affects system performance. Kafka works best with a fixed message format.
Retention of data and reading directly from Kafka is expensive: If you want to keep data for a long time on Kafka clusters, it will be costly. For data retention, it is best to store the data in cheaper cloud storage data lakes.
Unnecessary consuming and computing: Consumers of a Kafka system read the whole record to access a specific field, whereas columnar storage on a data lake allows you to directly access specific fields (using, for instance, Apache Parquet).

If you need a step-by-step guide on the real-life application of Kafka, head over to our blog posts, where we documented the process of building a fast, reliable, and fraud-safe game based on Kafka architecture: Part 1 and Part 2.

Comparing Kafka, Hadoop, and Spark

The following table compares the key features of Kafka, Hadoop, and Spark:

Choosing between Kafka, Hadoop, and Spark is not actually the question, as each of these frameworks has its own purpose. The question is how to structure your big data architecture in order to take advantage of the strengths of each tool.

Why are Kafka, Hadoop, and Spark Go-To-Tools for Big Data Processing and Data Analysis?

Kafka, Hadoop, and Spark are the most popular big data processing and data analysis tools because they address the key challenges of big data.

These three tools can be used together to build a complete big data architecture that can handle any type of data, whether it’s structured, unstructured, or streaming, and in mass amounts.

Data analytics pipeline

They offer low latency and high throughput; they are also built for ease of scalability. When set up and used properly, they are also an economical choice for your infrastructure.

To take advantage of the strengths of each tool, it’s important to understand how they work together as part of big data architecture.

Are you ready for AI implementation?

Download our free ebook and learn how to kickstart your next AI project >

How to Create a Modern Big Data Pipeline with Hadoop, Spark, and Kafka?

In order to create a modern big data pipeline, let’s quickly revise the role that each tool plays:

Hadoop is the foundation of your big data architecture. It’s responsible for storing and processing your data.
Spark is an in-memory processing engine that can perform real-time stream processing or batch processing on data stored in Hadoop.
Kafka is a message broker that can be used to ingest streaming data into Hadoop or process streaming data in real time with Spark.

And now, let’s take a look at how they work together to build a complete big data pipeline. The following diagram shows a typical big data pipeline that uses Hadoop, Spark, and Kafka:

Big data architecture with Kafka, Spark, Hadoop, and Hive for modern applications

As you can see, data is first ingested into Kafka from a variety of sources. The messaging system is the beginning of a big data pipeline, and Apache Kafka is a publish-subscribe messaging solution used as an input mechanism. Kafka streams the data into other tools for further processing. Apache Spark’s streaming APIs allow for real-time data ingestion, while Hadoop MapReduce can store and process the data within the architecture. Spark can then be used to perform real-time stream processing or batch processing on the data stored in Hadoop. Apache Hadoop provides a robust big data pipeline architecture that allows Spark and Kafka to operate on top of it. HDFS, in addition, provides persistent data storage. The processed data is then used by other applications like data visualization systems, ML tools, search engines, BI tools, etc.

Big data architecture based on Kafka, Hadoop, Spark and other frameworks and DBs

Lambda Architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch (historical) and stream-processing methods. The architecture also provides a serving layer to query the data. Kafka is the input source in this architecture; Hadoop runs at the batch processing layer as a persistent data storage that does initial computations for batch queries, and Spark deals with real-time data processing at the speed layer. The serving layer is usually constructed by a NoSQL database for low-latency querying data options.

Real-time stream processing and batch processing in Lambda Architecture

Kappa Architecture

Kappa architecture is a variation of the Lambda architecture that only uses stream-processing methods. Kappa architecture discards the batch layer in favor of speed and simplicity. Data is ingested only once and processed as it arrives, with no need to store it first. Again Kafka works as the pub-sub messaging system, Spark streaming runs at the real-time streaming layer (as it is able to handle both batch and stream processing), and Hadoop works as an ecosystem for those two.

Real-time stream processing in Kappa Architecture

Why Is It Important to Set up a Scalable and Reliable Big Data Pipeline?

As the demand for data grows, it becomes increasingly important to have a scalable and reliable big data pipeline. A well-designed big data pipeline can help you:

Collect and process large amounts of data quickly and efficiently
Store data safely and securely.
Analyze data in real-time to make better business decisions.
Visualize data to gain insights into trends and patterns.
Build a complete big data architecture that can be used by other applications like machine learning solutions, search engines, BI tools, etc.

Setting up a scalable and reliable big data pipeline is essential for organizations that want to take advantage of big data. It is important to note that big data is not exclusive to the largest internet companies, and having a scalable and reliable big data pipeline in place; you can be sure that your organization will be able to leverage large amounts of data quickly and efficiently and create real business value.

If you want to learn more about how to set up a scalable and reliable big data pipeline, contact us today. We would be happy to discuss your specific needs and requirements.