Apache Flume Tutorial
Apache flume is a reliable and distributed tool, which is designed to collect streaming data form several databases to HDFS.
Advantages of Apache Flume
- Apache Flume is very useful to store the data into centralized store.
- Apache Flume is reliable.
Features of Apache Flume
- Apache Flume collects data from several web sources and stores and can be stored into centralized store. For example: HDFS
- Apache flume supports several sources and destinations.
Architecture of Apache Flume
Flume architecture is based on streaming data flows. It uses a simple extensible data model, that allows online analytic applications.
Components of Apache Flume
The Flume agent receives the data from clients or other agents and forwards it to source or sink. The Flume Agent consists of three main components: source, channel and sink.
A source receives the data from the several data generators and transfers it to the channels.
A channel receives the events from source and buffers these events till they are accepted by sink.
- A sink is used to store the data into centralized stores. For example: HDFS.
- A sink consumes the data from one or more channels and delivers to the destination (the destination may be another agent or the central store).
Big Data, as we know, is a collection of large datasets that cannot be processed using traditional computing techniques. Big Data, when analyzed, gives valuable results. Hadoop is an open-source framework that allows to store and process Big Data in a distributed environment across clusters of computers using simple programming models.
Streaming / Log Data
Generally, most of the data that is to be analyzed will be produced by various data sources like applications servers, social networking sites, cloud servers, and enterprise servers. This data will be in the form of log files and events.
Log file − In general, a log file is a file that lists events/actions that occur in an operating system. For example, web servers list every request made to the server in the log files.
On harvesting such log data, we can get information about −
- the application performance and locate various software and hardware failures.
- the user behavior and derive better business insights.
The traditional method of transferring data into the HDFS system is to use the put command. Let us see how to use the put command.
HDFS put Command
The main challenge in handling the log data is in moving these logs produced by multiple servers to the Hadoop environment.
Hadoop File System Shell provides commands to insert data into Hadoop and read from it. You can insert data into Hadoop using the put command as shown below.
$ Hadoop fs –put /path of the required file /path in HDFS where to save the file
Problem with put Command
We can use the put command of Hadoop to transfer data from these sources to HDFS. But, it suffers from the following drawbacks −
- Using put command, we can transfer only one file at a time while the data generators generate data at a much higher rate. Since the analysis made on older data is less accurate, we need to have a solution to transfer data in real time.
- If we use put command, the data is needed to be packaged and should be ready for the upload. Since the webservers generate data continuously, it is a very difficult task.
What we need here is a solutions that can overcome the drawbacks of putcommand and transfer the “streaming data” from data generators to centralized stores (especially HDFS) with less delay.
Problem with HDFS
In HDFS, the file exists as a directory entry and the length of the file will be considered as zero till it is closed. For example, if a source is writing data into HDFS and the network was interrupted in the middle of the operation (without closing the file), then the data written in the file will be lost.
Therefore we need a reliable, configurable, and maintainable system to transfer the log data into HDFS.
Note − In POSIX file system, whenever we are accessing a file (say performing write operation), other programs can still read this file (at least the saved portion of the file). This is because the file exists on the disc before it is closed.
To send streaming data (log files, events etc..,) from various sources to HDFS, we have the following tools available at our disposal −
Scribe is an immensely popular tool that is used to aggregate and stream log data. It is designed to scale to a very large number of nodes and be robust to network and node failures.
Kafka has been developed by Apache Software Foundation. It is an open-source message broker. Using Kafka, we can handle feeds with high-throughput and low-latency.
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log data, events (etc…) from various webserves to a centralized data store.
It is a highly reliable, distributed, and configurable tool that is principally designed to transfer streaming data from various sources to HDFS.
In this tutorial, we will discuss in detail how to use Flume with some examples.