HADOOP Ecosystem

Hadoop came into stage almost 15(13 actually) years ago, promising integrated data storage and multiprocessing infrastructure. With Hadoop, you can create clusters consists of multiple computers(servers) called nodes. Nodes can form a cluster together.

In an Hadoop cluster, every node communicate with others to share their states and information. This communication ensures availability and consistency of the cluster.

We are talking about cluster a.k.a. group of items. How you can manage all the members of a group and get them work together sharing one purpose? There must be a control mechanism, right? Hadoop follows Master/Slave paradigm of distributed systems to manage the cluster. In this architecture each node has one role within a cluster which may be master or slave.

You have to set one of your machines as master node which is called NameNode and the others will be slave nodes which is known as DataNode. The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

Image for post

Hadoop can run with only one master but having your cluster like that may lead you to very problematic state. Hadoop has “Single Point of Failure” problem, if you lost your NameNode, it means your cluster become out of service. But no worries! There is a way to prevent this to be happen, having your cluster with tswo NameNode; one active, one standby and distributing all the information of DataNotes to both of them will do the job.

Hadoop base framework is composed of the following modules:

  • Hadoop Common — contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN — (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
  • Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale data processing.

The term Hadoop is often used for both base modules and sub-modules and also the ecosystem,or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache PigApache HiveApache HBaseApache PhoenixApache SparkApache ZooKeeperApache ImpalaApache FlumeApache SqoopApache Oozie, and Apache Storm.

Image for post
stack can be vary depending on project requirements

There are two key elements in Hadoop framework which are HDFS and MapReduce. We can define HDFS as a storage across a number of linked storage devices. And MapReduce is the main approach you can use for manipulating the data reside in HDFS. MapReduce is the exact operations as its name suggested, Mapping and Reducing the data. Mapping is all about discovering required data and reducing is the process which is filtering out data in according to given rules and parameters.

Hadoop is highly skilled infrastructure but what makes it so popular is the abilities it has gained from the softwares which are developed around or on the top of Hadoop. I have already listed most of that modules above: Hive, Spark, ZooKeeper, Flume, Impala, Sqoop and so on. Each of them has different use then the other and all of them are doing really great operations. After you spend some time thinking about your architectural design of data management pipeline then you will be able to understand which one/s you should use in your project. It is really hard to pick up the best one for your interest, each module has different development logic and they are differentiating in sense of ability and purpose.

In this article I am not going to make a detailed explanation for any of these softwares but I will give you the basic ideas of them. Because they are important to get to understand the logic lies behind using Hadoop.

Reminder: HDFS is a distributed file system and it runs on the top of an Operating System. It basically provides reliable storage infrastructure, the usage of this infrastructure is completely up to you. Hadoop is just a tool, not an answer for something.

Apache Sqoop

Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Apache Hive

Hive is a data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Provides a mechanism to impose structure on a variety of data formats.

Apache Spark

is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Apache ZooKeeper

is a high-performance coordination service for distributed applications. It exposes common services — such as naming, configuration management, synchronization, and group services.

Apache Storm

Apache Storm is another free and open source distributed real-time computation system. Apache Storm makes it easy to reliably process unbounded streams of data. In many ways its similar to Spark. But you should think carefully before which one you need to use, do your research good before making any decision.

Apache Impala

Remember Hive? Impala is very similar and seems like to have same usage with Hive, but story may be different as it seems. Impala provides low latency and high concurrency for BI/analytic queries on Hadoop (not delivered by batch frameworks such as Apache Hive). Impala also scales linearly, even in multitenant environments.

Apache HBASE

Open-Source and Distributed Database modeled and developed after Google’s Big Table. It runs on the top of Hadoop.

Apache Phoenix

Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by combining ACID ransaction capabilities and flexibility of NoSQL Databases since its using HBASE as backing storage.

I tried to draw outlines of Architecture of Hadoop and the technologies emerged around it. I will keep sharing my experiences and opinions over each of the related software which is listed above.