Web Resources for CS6240/CS4240: (Large Scale) Parallel Data Processing)

Instructor: Gene Cooperman

Spring, 2020
PLACE:  024 East Village
HOURS: Tuesday, 11:45 am - 1:25 pm ; Thursday, 2:50 pm - 4:30 pm 
SYLLABUS (+ office hours): http://course.ccs.neu.edu/cs6240/parent/syllabus.pdf

(Note: The syllabus, above, is the original syllabus. The mid-term was canceled due to the university shut-down, and we are adding two, three, or four quizzes, depending on the time left in the course after interruptions. This will affect the grade weightings, with the final exam being at most 40% of the course grade, and with 10% for each quiz.)

Below, the course web page is now organized in three sections:

Course HW/QUIZ/FINAL, quizzes, final, in chronological order, with hints and added material for assignments and quizzes
Weighting for course grades (for the online version of the course)
Course Resources and Policies
Comments on my Lectures and on the readings in the text by Jimmy Lin and Chris Dyer

A. CS6240/CS4240: (Large Scale) Parallel Data Processing: HW/QUIZ/FINAL

HW1: Due at "end of day" on Friday, Jan. 31
HW2: Due at "end of day" on Friday, Feb. 14
HW3: Due at "end of day" on Friday, Feb. 28

Spark/Scala on-line resources to help with HW3

(Note that HW3 uses Spark and Scala. Here is a good guide to Scala, and a good guide to Spark. We will begin to discuss Spark/Scala during the week of Feb. 18.)
Here are three tutorials for Scala: from scala.lang.org and scala.lang.org: A Scala Tutorial for Java Programmers and tutorialspoint.

And here are some example Scala programs (from spark.apache.org; Please also look at the Demo-spark for HW3 for still a different example of wc.scala.)

And now that you know Scala and you know that RDDs are like Hadoop files in the HDFS, here is an RDD Programming Guide (repeated from "a good guide to Scala", above).

Quick "cheat sheet" for RDDs (again, from the "good guide to Scala", above)
The detailed RDD API (full documentation, general API)
See also the specific families of APIs that this document refers to:

PairRDDFunctions (RDD functions for key-value pairs)
SequenceFileRDDFunctions (RDD functions that treat each item as an element of a sequence, but incompatible with functions for key-value pairs; you can convert from pair to sequence and back, but use only one model for your analysis)

And RDD Lineage, and Logical Execution Plans (Scala transformations versus actions ==> lazy evaluation of transformations, evaluated only when needed for actions ==> toDebugString is available to show the logical execution plan.)

HW4: Due at "end of day" on Thurs., Mar. 26

     (On Spark & Hadoop; Please review Chapter 5 of text and notes for comments on lectures and text for PageRank.)
     There was a question in class about the syntax for chaining mapreduce jobs in Hadoop, for HW4. My favorite link with an example on chaining: map->reduce->map->reduce->... is here: https://netjs.blogspot.com/2018/07/chaining-mapreduce-job-in-hadoop.html
     This is the older link that tries to be very general. (But for HW4, the chaining link above is probably simpler to follow.): https://knpcode.com/hadoop/mapreduce/how-to-chain-mapreduce-job-in-hadoop/

Two-Phase Locking/Two-Phase Commit: Spark/Scala class lectures and on-line resources

     (For Two-Phase Locking/Commit, also see the section below, with more on locks.)

Mid-term: Canceled due to university shutdown (in-class review was held on Mar. 10, but the mid-term was scheduled for the first day of canceled classes)

QUIZ 1: Tuesday, Mar. 24: basic MapReduce and Joins (Hadoop, but not Scala)
    (Each quiz is 10% of the course grade.)
QUIZ 2: (10% of course grade; PageRank; Tues., Mar. 31)
QUIZ 3 (15 minutes, 10% of course grade): Tues., Apr. 7: Locks -- shared and exclusive locks

     (but two-phase locking, two-phase commit, wait-for graphs will not be covered here.
     They will be covered on the last exam, the double-quiz exam below.)

QUIZ 4 (15 minutes, 10% of course grade): Sun., Apr. 12: on CAP and HBase
HW5: Due at "end of day" on Sun., Apr. 12
COURSE LECTURES: CAP, NoSQL, and HBase ("Bigtable")
COURSE LECTURES: Remaining lectures on locks

DOUBLE-QUIZ EXAM (30 minutes): Tues., Apr. 14 -- last day of classes, 15% of course grade:

     single 30-minute quiz (in lieu of a final exam);
     Emphasis on material after Spring break;
     Topics: Material of QUIZ-2, QUIZ-3, QUIZ-4, and especially including two-phase locking, two-phase commit, wait-for graphs.
     (See discussion on locks in course lecture notes on this page.)

Also, see the supplementary material (supplementing the textbook and the course lectures).

B. New weighting of course grades (changed for online version of course)

Course Grading weight:
     QUIZ-1 .. QUIZ-4: 10% each (40%)
     double-quiz (final): 15%
     5 homeworks: 45%

Undergrads:
Will drop lowest hw grade
Will drop the lowest of the four regular quiz grades (QUIZ-1 through QUIZ-4)
Double-quiz final: There will be a sub-question (part of one question) that the undergrads don't have to answer on the double-quiz/final (and they automatically get credit for that question).

C. Course Resources and Policies

The course syllabus is available.
(Since in-classroom lectures were canceled the day before the midterm, a new course weighting will be devised.
This includes the original 30% for homeworks or more; 10% for each quiz; and the remainder for a final exam. The weighting of the final exam will be no more than the original 40%.)
The syllabus lists several texts; the first half of the course especially uses Data-Intensive Text Processing with MapReduce (by J. Lin and C. Dyer);
The homework directory is also available (and now contains HW1.docx, due end of Friday, Jan. 31).
In addition, you will probably want to look at the MapReduce-Demo before beginning your first homework. For example, HW1 mentions pom.xml, and it is available in the Demo directory.
You will also need AWS accounts. See below for "Information on AWS accounts".
See below for "Information on submitting homeworks".

All information can be found either online when you login to your Khoury Linux account (cd /course/cs6240); or else from the web, where it is visible here: CS 6240 course directory.

The syllabus lists six textbooks. Note especially the book: Data-Intensive Text Processing with MapReduce (by Jimmy Lin and Chris Dyer); or alt. When not otherwise indicated, this will be the default textbook. \vskip0.2\answer

For students taking the CS4240 version of the course, you may drop your lowest homework grade (or simply choose to skip one assignment). On the midterm and final exam, at least one of the harder questions for each exam will be labeled "CS6240 only". For those in CS4240, that question will not be graded, and you should not attempt it.

* Information on AWS accounts

Information on getting a free AWS account is also available.
Further information on setting up the account is also available (taken from a link in HW1.docx).
Finally, you must configure your account: Go to your "AWS Educate" account and got to "Account details", and then select "AWS CLI". Then click the "show" button. You can now configure the account with your tokens, etc.
HINT ON USING AWS from one of the TAs: "One or more helpful tips that I can provide is: (a) to use m4.large machines; and (b) to run the aws configure steps before every execution of the 'make aws' command." (Please tell us if you discover a shortcut from configuring every time.)
ANOTHER HINT ON USING AWS: When you run 'make aws' on AWS itself, the MapReduce job may take about 5 minutes. You must wait until the job completes. If you try to examine the status before the job completes, you will simply see an error indication.

* Information on submitting homeworks

In this version of the course, we will use shell scripts on the Khoury Linux computers (login.ccs.neu.edu) to submit the homework.
Make sure you have a Khoury computer account. They are available at: https://my.ccs.neu.edu/account/apply

If you did not develop your homework answer on Khoury Linux, then copy your files to Khoury Linux and login to it as in the following example:
scp -pr myhw1 login.ccs.neu.edu
ssh login.ccs.neu.edu
If you are on Windows, then consider using a Linux virtual machine or the Windows Subsystem for Linux (WSL). Alternatively, you can use native Windows commands such as putty (choose "putty.exe" for 64-bit) and WinSCP. The putty application is for 'ssh', and the WinSCP application is for 'scp'.
Copy the files that you will submit into a single directory, and then create a single tar file. An easy way to do this is to use 'make distro' from the Makefile in the MapReduce-Demo. Modify the 'distro' target to save the files that you wish to submit. (See below under "Deliverables" for what to do if your output file is more than 1 MB.)
In order to double-check what you submit, consider looking at the contents of your tar file with the Linux command: tar -tvf MR-Demo.tar.gz

Next, you will submit the files. If this is hw1, and you named your tar file HW1.tar.gz, then you would do:
/course/cs6240/homework/submit-cs6240-hw1 HW1.tar.gz

* Deliverables for homework submissions

Generally, homework will be due at the end of a Friday. (I interpret "end of Friday" generously. I will usually set the submit script to stop accepting submissions on my Saturday morning. You must submit on time, regardless of any timestamp on the tar file.
Submit the report as a PDF file (not txt, doc, docx, tex etc!) inside your submit directory that you tar. To simplify the grading process, please name this file yourFirstName_yourLastName_HW#.tar.gz. Here yourFirstName and yourLastName are your first and last name; the # character should be replaced by the HW number. So, if you are Amy Smith submitting the solution for HW 7, your solution file should be named Amy_Smith_HW7.pdf: Make sure your tar file includes the project, log and output files as described above (1 PDF file for the report).
IMPORTANT: If your output file is more than 1 MB, then please create a new output-head file that is just the first 1,000 line. Submit only the output-head file in that case. An easy Linux command to create a file with the first 1,000 lines is:
head -1000 output > output-head

D. Comments on my Lectures and on the readings in the text by Jimmy Lin and Chris Dyer

Chapter 2: MapReduce Basics

The text has a good exposition here. But the original Google paper for MapReduce has an even better exposition. For those who want to dig deeper, please see the original Google MapReduce paper.

For the Primitive MapReduce Algorithm:

Please see: this web page, with a link to the original Google MapReduce paper, and with some explanations to help in reading the MapReduce paper.

Chapter 3: Replicated Join (same as Section 3.5.3: Map-Side Join)

A replicated join, or map-side join, is the idea of loading into the memory of each Mapper an entire table (usually the smaller of the tables). This is described in Section 3.5.3.

In order to implement this in Hadoop, you should use API hook described in Section 3.5.1.

Chapter 5: Comments on PageRank

The pseudo-code for PageRank (simplified version, one iteration only) is displayed in Figure 5.8 of the primary text.

A critical concept for understanding PageRank is that the rank of a page can be thought of either as the "importance of the page", or the "mass of the page". In Physics, we know that mass is conserved. We walso want to conserve the total PageRank across all pages. Each iteration of Figure 5.8 will redistribute the PageRank, but the total PageRank across all pages will remain the same.

On each iteration, the Map function "sends its PageRank" across all outgoing edges. So, line 3 of the figure divides the PageRank of node n by the total number of outgoing edges, and each edges passes along that much of the PageRank. in the Reduce function, each node receives a PageRank from each incoming edge. Its new PageRank will be the sum of the PageRanks that ti gets from each incoming edge.

The PageRank algorithm begins by assigning a PageRank of 1 to each page. After each iteration (after each MapReduce job), the PageRank changes. When the PageRanks stop changing (or change by only a small mount), then we can stop the PageRank algorithm. This si the "stopping condition" or "termination condition".

There are three additional features in a full PageRank algorithm:

1. Hyperjumps ( α > 0 ): With probability α, a node will not send its PageRank across the outgoing edges. Instead it will do a "hyper-jump", in which it transfers its total PageRank to another node, chosen at random. With probability 1-α, the node will pass its PageRank across the outgoing edges. This is why we have two terms in equation 5.1 of Section 5.3 (PageRank) of the textbook.
2. Termination (stopping condition): There are many ways to decide when the PageRank algorithm should stop iterating. One is to compare (in the Reducer) the old PageRank and new PageRank of a node. If the percentage change is less than ε, then we say that it didn't change. If no page changes (or if less than k pages change), then we stop (we terminate the algorithm). (In Hadoop, you will probably want to use "counters" to decide termination. Search in Section 5.2 for "counters" in Section 5.2, and then look it up in the Hadoop API.)
3. Loss of mass, and redistribution: Redistribution: The web crawl will stop at some point, and stop downloading new web pages. So, we have "edge pages" (or "dangling nodes"). Since those edge pages have no outgoing edges, we would lose mass. There are several ways discussed in Section 5.3 to identify the missing mass that was lost, and then redistribute it.

COURSE LECTURES: Two-Phase Locking/Two-Phase Commit

Serializability:
A parallel job could be implemented by a serial schedule (e.g., 'make pseudo' for Hadoop). There are many serial schedules by which one could do the parallel operations. If all of the serial schedules get the same answer, then the parallel job is serializable. If at least one of the schedules produces a different answer, then the parallel job is not serializable. We want all parallel jobs to be serializable -- perhaps by adding extra locks (shared and exclusive).
Advisory versus mandatory locks
Advisory locks will generally require the programmer to use acquire/release calls in their code. There will be an acquire/release pair around each resource. The lock may be exclusive or shared. It is a programmer error if there is no acquire/release around the use of the resource. There also exist mandatory locks, enforced by the operating system. Mandatory locks are more expensive, since they require the application to make a call to the operating system (e.g., requiring more than 1 millisecond on a single computer).
Two-phase locking
Phase 1: Everyone acquires resources (exclusive or shared lock)
Detect deadlock (using the "wait-for graph" idea from distributed deadlocks).
If deadlock is detected, use random back-off: back-off (release resources), and restart after a random delay.
Phase 2: Else, everyone does the work and releases resources.
Two-phase commit
Similar, but now detecting an inconsistent transaction. So, the recipe is now:
Phase 1: Everyone performs the transaction temporarily. There is then a check if the result is consistent. Each client reports back if any inconsistency is detected.
Phase 2: If any agent reported an inconsistency, then the transaction is aborted, and we try again. If no inconsistency is reported, then we commit the transaction. (We make the transaction permanent.)

Example: Joe and Jane have a bank account with $100.
Phase 1: Joe and Jane each want to withdraw $50. So, the bank sends a message to Joe and Jane: "$100 in account."
Joe and Jane each send back a message to the bank: "I withdrew $50. Now, $50 left in account." (Joe and Jane need to wait now to see if the transaction will be committed.)
Now, Joe, Jane, and the bank each report whether the transaction was consistent. Joe and Jane say "Yes, it's all consistent."
The bank says, "No, I had $100. Two times $50 was withdrawn. But I was told there is still $50 in the account. It's not consistent!"
Phase 2: Two agents reported "Yes (consistent)". One agent reported "No (inconsistent)". So the transaction is aborted. It never becomes permanent.
Result: Joe or Jane might try a random back-off (essentially, the two-phase locking). Or maybe Joe or Jane abandons the transaction, and follows a different strategy.

Note that the use of mandatory locks is easiest, since only one process can make a call to the operating system at one time. But it costs 1 millisecond (about the same time as executing 10 million assembly instructions).

If that is inefficient, the next thing to try is two-phase locking. On one computer, this requires the implementation of acquire/release to use an atomic instruction on a memory location to check if someone else has a lock, and if not, then to claim the lock for this process. If there are many transactions, then two-phase locking becomes expensive because we will often detect deadlock, and have to do a random back-off.

If two-phase locking becomes expensive, then we can implement the more efficient two-phase commit. We never do the actual acquire in Phase 1. We just announce a plan. But this is more difficult to implement, since it often requires a central coordinator process that examines the transaction plan, and decides whether to accept it or abort.

CAP, NoSQL, and HBase ("Bigtable")

Slides from Prof. Mirek Riedewald on CAP "theorem" and HBase (the open source re-implementation of Google Bigtable): http://www.ccs.neu.edu/home/mirek/classes/CS6240-stable/CAP-HBase-Hive.pdf (but please ignore the material on Hive; Hive will not be in the course)
Original Bigtable paper of Google (especially, see page 2 of the paper)
HBase: The Definitive Grant (by Lars George; O'Reilly series) -- This is only supplementary, and you will not be held responsible for this on an exam.