Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation.
In simple terms, Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
It is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster.
Hadoop consists of four main modules:
1. Hadoop Common – contains common Java libraries and utilities needed by other Hadoop modules.
2. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. It runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.
3. Hadoop YARN – It is responsible for managing computing resources in clusters and using them for scheduling users’ applications. It manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.
4. Hadoop MapReduce – It’s an implementation of the MapReduce programming model for large-scale data processing. It’s a framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.
The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user’s program.
Let’s talk about the components.
Apache HBase is the Hadoop database, a distributed, scalable, big data store. It is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). Use Apache HBase when you need random, real time read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.
Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase isn’t a relational data store at all.
An HBase system is designed to scale linearly. It comprises a set of standard tables with rows and columns, much like a traditional database.
HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into HBase, but if you’re running a production cluster, it’s suggested that you have a dedicated ZooKeeper cluster that’s integrated with your HBase cluster.
HBase works well with Hive, a query engine for batch processing of big data, to enable fault-tolerant big data applications.
Basically, HCatalog provides a consistent interface between Apache Hive, Apache Pig, and MapReduce.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
YARN (Yet Another Resource Negotiator)
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. YARN took over this task of managing Hadoop Cluster from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best at.
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. It is a high-level platform for creating programs that run on Apache Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop. The language for this platform is called Pig Latin. One of the major advantages of this language is, it offers several operators. Through them, programmers can develop their own functions for reading, writing, and processing data.
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.
Hive provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. It supports analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio.
Hive allows users to read, write, and manage petabytes of data using SQL. What makes Hive unique is the ability to query large datasets, leveraging Apache Tez or MapReduce, with a SQL-like interface.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Sqoop is a command-line interface application. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra.
Avro provides a convenient way to represent complex data structures within a Hadoop MapReduce job. Avro data can be used as both input to and output from a MapReduce job, as well as the intermediate format.
Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages; one for human editing (Avro IDL) and another which is more machine-readable based on JSON.
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Apache Chukwa is an open source data collection system for monitoring large distributed systems.
Apache Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Apache Chukwa also includes a ï¬‚exible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
ZooKeeper is an open-source project which deals with maintaining configuration information, naming, providing distributed synchronization, group services for various distributed applications. It implements various protocols on the cluster so that the applications need not to implement them on their own.
Apache Flume is a used for moving massive quantities of streaming data into HDFS. For example, collecting log data present in log files from web servers and aggregating it in HDFS for analysis.
In a broad sense, it is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
- Hadoop 2
- Wiki Apache Hive
- Apache Avro
- Apache Hadoop
- AWS 2