Hadoop is no longer just a teenie-weenie stack of code. The future, friends, is indeed 'big'!
Over the years, Hadoop has grown to be of prime importance, so useful that a humongous collection of projects now orbit around it. So much so that Hadoop is no longer just a teenie-weenie stack of code. The Hadoop community is evolving rapidly each day, and more conveniently newer set of tools are being added every now and then to aid the same. |
1. Hadoop
-Java-based Hadoop synchronises other worker nodes in executing a function on data stored locally. Results are then aggregated and reported: map and reduce.
-While programmers concentrate on writing code for analysing data, Hadoop handles the rest by providing thin abstraction over local data storage.
-Designed to work around faults by individual machines.
-The code is available here.
2. Hive
-Regularises the process of extracting bits from all of the files in HBase.
-Offers SQL-like language and pulls out snippets from files.
-Turns standard format data into a stash for querying.
-The code is available here.
3. Sqoop
-Command-line tool that moves large tables full of information for Hive or HBase.
-Controls mapping between the tables and the data storage layer and translating the tables into a configurable combination for HDFS, HBase, or Hive.
-The code is available here.
4. Pig
-Running code written in its own language, called Pig Latin, Pig steers users toward algorithms that are easy to run in parallel across the cluster.
-Functions include averaging data, working with dates, or finding differences between strings.
-The code is available here.
5. Avro
-Avro bundles the data together with a schema for understanding it.
-Comes with a JSON data structure explaining how the data can be parsed.
-The code is available here.
6. Oozie
-Starts multiple Hadoop jobs stemming out from a single job, in the right sequence.
-Manages a workflow specified as a DAG (directed acyclic graph).
7. GIS tools
-The GIS (Geographic Information Systems) tools for Hadoop are Java-based tools for understanding geographic information to run with Hadoop.
-The code is available here.
8. Flume
-Flume dispatches 'agents' to gather information to be stored in the HDFS.
-These agents are triggered by events and can be chained together.
-The code is available here.
9. SQL on Hadoop
-All of these offer a faster path to answers, for instance an ad-hoc query of data on a huge cluster. Of course you could write a ne Hadoop job for the same, however that can be time consuming. Further with SQL answers are provided in simpler language.
-Some of them include: HAWQ, Impalla, Drill, Stinger, and Tajo.
10. Clouds
-Companies like Amazon, are adding another layer of abstraction by accepting just the JAR file filled with software routines. Everything else is then done by the cloud.