last hacked on Jul 22, 2017

# Hadoop Basics ## Why Hadoop over RDBMS Both Hadoop and RDBMS have their uses. RDBMS is useful for small amounts of data that gets updated and deleted often. Hadoop is good for a large amount of data that is has no updates or deletes. Hadoop takes advantage of cheaper commodity hardware to scale with more data, so you don't need to buy larger computers. ## MapReduce MapReduce is used to concurrently process data using several computers and is often written in Java, Ruby, or Python. The two steps are map and reduce. 1. The map step extracts the key and value from the data and returns them as an output. Then the distinct keys have a set of values. (key1, value1) (key1, value2) (key2, value1) (key2, value2) (key2, value3) The `MapReduce` job will transform this into key and a collection of values. (key1, [value1, value2]) (key2, [value1, value2, value3]) 2. This is the input for the reduce step. Next, the many values are reduced to one value per key. (key1, final_value1) (key2, final_value2) One optimization is defining a combiner class. The input for the reduce step might look like this. (key1, [value1, value2]) (key2, [value1, value2, value3]) (key1, [value3, value4, value5]) The output would be: (key1, final_value1) (key2, final_value2) (key1, final_value3) The combiner would make it become: (key1, new_final_value1) (key2, final_value2) This can help optimize the runtime because all the values don't have to necessarily be collected for each key before starting the reduce step. ### Java Map This is the first step of MapReduce. 1. Get the dependencies `hadoop-mapreduce-client-core` and `hadoop-common` with an up to date version of Maven. 2. Create a class inheriting from `Mapper`. 3. Override the `map(T key, Text value, Context context)` method. 4. Extract the key and value. 5. If the data is valid, finish by using the `context.write()` method to set the key and value. ### Java Reduce The second step of MapReduce 1. Create a class extending `Reducer`. 2. Override the `reduce(Text key, Iterable<T> values, Context context)` method. 3. Loop through the `Iterable<T>` 4. Finish by using the `context.write()` method to set the final value for that key. ### Java Job The application running the map and reduce is called a job. Little code is required for a job. 1. Create a Job object and set its `Jar` and name. 2. Set the input files and output file. 3. Set the `Mapper` and `Reducer` classes. 4. Set the key and value classes. 5. Exit on completion using `job.waitForCompletion(true)`. ## Install Hadoop 1. Visit [Apache Hadoop Releases](http://hadoop.apache.org/releases.html) and download a binary file. 2. Place it in a directory and unzip it. 3. Make sure Java is installed `JAVA_HOME` is set to Java's location. 4. Set `HADOOP_HOME` and `PATH`. Point `HADOOP_HOME` to the file you unzipped. % export HADOOP_HOME=your_path/hadoop-x.y.z % export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin 5. Check if Hadoop was installed correctly with this command hadoop version


keep exploring!

back to all projects