Three Famous Google Papers

  1. MapReduce: Simplified Data Processing on Large Clusters (2004)
  2. The Google File System (2003)
  3. Bigtable: A Distributed Storage System for Structured Data (2006)

Google MapReduce: The paper that started Big Data as a Subject

  1. Read the MapReduce paper above. (However, in Section 3, read only Section 3.1. The remaining subsectoins of Section 3 can be skipped on a first reading.)
  2. Study Figure 1 (Execution overview), until you're convinced you understand it. Note that the only nodes are one master node (called a namenode, i.e., nameserver, in the case of Hadoop). There are three mapper nodes. The intermediate files on local disks refer to the mapper nodes. There are two reducer nodes.
  3. The input files are divided into 5 "splits" (Google terminology), or "tasks" (Hadoop terminology). There are 2 output files, because there are 2 reducers. An input "split" (or "task") is often tuned to be 64 MB in size.
  4. A MapReduce (or Hadoop) job is a Java program as a jar file. Inside that Java program, there is a call to MapReduce. The MapReduce jobs has several functions and values as parameters. The functions and values are distributed to the mapper and reducer nodes of the cluster.
  5. Next, we provide an overview of the contents of the MapReduce paper. Note that there is a user-defined map function, combiner, partitioner (all executed on the mapper node) and a reduce function (executed on the reducer node):
    1. map: (k1, v1) → list(k2, v2)
    2. Within each split or task, the items are then combined based on word2, and so the key (k2) now corresponds to a list of values:
      list(k2, v2) → list(k2, list(v2))
    3. combiner (optional): list(k2, list(v2)) → list(k2, v3)
    4. R: user-specified integer: the number of reducer nodes to use
    5. Ordering Guarantee: items are processed on the reducer node in sorted order according to key.
    6. partitioner: (k2, list(v2)) → sent to reducer node with id = Hash(k2) mod R
    7. reducer: list(k2, list(v2)) → list(k2, v4)
  6. Finally, an example (implementation of the 'wc' (word count) program):
    (Assume that the input file is a series of documents, with each document being split into words. If each document is less than 64 MB, then each split or task will correspond to exactly one document.)
    1. map: (document_id, document) → list(word2, 1)
    2. Within each split or task, the items are then combined based on word2, and so the key (word2) now corresponds to a list of values:
      list(word2, 1) → list(word2, list(1, ...))
    3. combiner: list(word2, list(1, ...)) → list(word2, n)
    4. R: user-specified integer: the number of reducer nodes to use
      (We set R=1 in this example.)
    5. Ordering Guarantee: items are processed on the reducer node in sorted order according to key.
    6. partitioner: (word2, n) → sent to reducer node with id = Hash(word2) mod R
    7. reducer: list(word2, list(n')) → list(word2, n'')
  7. Remark: In the special case that the reduce function is associative and commutative, the combiner is often set to execute the same function as the reducer function on the reducer node. In the 'wc' example, the function is to take the sum of the integers in the list. In this case, the combiner (on the mapper node) is sometimes called a "mini-reducer".
  8. Order inversion for relative frequencies: (See Section 3.3)
  9. Trick: Secondary sorting: In Hadoop, the key-value pairs are presented to the reducer node in sorted order according to key. However, if there are many items with the same key, the values for that key are not necessarily sorted. If we also need this secondary sorting feature, then the "value-to-key conversion" trick says to convert (k,v) to ((k,v), v), and define an appropriate sort order for the new key, '(k,v)'.
  10. Joins (relational databases): In the textbook "Data-Intensive Text Processing with MapReduce", see:
    1. Section 3.5.1 (reduce-side join);
    2. Section 3.5.2 (map-side join (merge join for sorted keys)); and
    3. Section 3.5.3 (memory-backed join (hash join)).