Spring, 2020 PLACE: 024 East Village HOURS: Tuesday, 11:45 am - 1:25 pm ; Thursday, 2:50 pm - 4:30 pm SYLLABUS (+ office hours): http://course.ccs.neu.edu/cs6240/parent/syllabus.pdf(Note: The syllabus, above, is the original syllabus. The mid-term was canceled due to the university shut-down, and we are adding two, three, or four quizzes, depending on the time left in the course after interruptions. This will affect the grade weightings, with the final exam being at most 40% of the course grade, and with 10% for each quiz.)
HW1: Due at "end of day" on Friday, Jan. 31
HW2: Due at "end of day" on Friday, Feb. 14
HW3: Due at "end of day" on Friday, Feb. 28
Spark/Scala on-line resources to help with HW3HW4: Due at "end of day" on Thurs., Mar. 26
toDebugString is available to show the logical execution plan.)
- (Note that HW3 uses Spark and Scala. Here is a good guide to Scala, and a good guide to Spark. We will begin to discuss Spark/Scala during the week of Feb. 18.)
- Here are three tutorials for Scala: from scala.lang.org and scala.lang.org: A Scala Tutorial for Java Programmers and tutorialspoint.
- And here are some example Scala programs (from spark.apache.org; Please also look at the Demo-spark for HW3 for still a different example of wc.scala.)
- And now that you know Scala and you know that RDDs are like Hadoop files in the HDFS, here is an RDD Programming Guide (repeated from "a good guide to Scala", above).
- Quick "cheat sheet" for RDDs (again, from the "good guide to Scala", above)
- The detailed RDD API (full documentation, general API)
See also the specific families of APIs that this document refers to:
- PairRDDFunctions (RDD functions for key-value pairs)
- SequenceFileRDDFunctions (RDD functions that treat each item as an element of a sequence, but incompatible with functions for key-value pairs; you can convert from pair to sequence and back, but use only one model for your analysis)
- And RDD Lineage, and Logical Execution Plans (Scala transformations versus actions ==> lazy evaluation of transformations, evaluated only when needed for actions ==>
(On Spark & Hadoop; Please review Chapter 5 of text and notes for comments on lectures and text for PageRank.)
There was a question in class about the syntax for chaining mapreduce jobs in Hadoop, for HW4. My favorite link with an example on chaining: map->reduce->map->reduce->... is here: https://netjs.blogspot.com/2018/07/chaining-mapreduce-job-in-hadoop.html
This is the older link that tries to be very general. (But for HW4, the chaining link above is probably simpler to follow.): https://knpcode.com/hadoop/mapreduce/how-to-chain-mapreduce-job-in-hadoop/
Two-Phase Locking/Two-Phase Commit: Spark/Scala class lectures and on-line resources
(For Two-Phase Locking/Commit, also see the section below, with more on locks.)
Mid-term: Canceled due to university shutdown (in-class review was held on Mar. 10, but the mid-term was scheduled for the first day of canceled classes)QUIZ 1: Tuesday, Mar. 24: basic MapReduce and Joins (Hadoop, but not Scala)
(Each quiz is 10% of the course grade.)
QUIZ 2: (10% of course grade; PageRank; Tues., Mar. 31) QUIZ 3 (15 minutes, 10% of course grade): Tues., Apr. 7: Locks -- shared and exclusive locks
(but two-phase locking, two-phase commit, wait-for graphs will not be covered here.QUIZ 4 (15 minutes, 10% of course grade): Sun., Apr. 12: on CAP and HBase
They will be covered on the last exam, the double-quiz exam below.)
HW5: Due at "end of day" on Sun., Apr. 12
COURSE LECTURES: CAP, NoSQL, and HBase ("Bigtable") COURSE LECTURES: Remaining lectures on locksDOUBLE-QUIZ EXAM (30 minutes): Tues., Apr. 14 -- last day of classes, 15% of course grade:
single 30-minute quiz (in lieu of a final exam);
Emphasis on material after Spring break;
Topics: Material of QUIZ-2, QUIZ-3, QUIZ-4, and especially including two-phase locking, two-phase commit, wait-for graphs.
(See discussion on locks in course lecture notes on this page.)
Also, see the supplementary material (supplementing the textbook and the course lectures).
Course Grading weight:
QUIZ-1 .. QUIZ-4: 10% each (40%)
double-quiz (final): 15%
5 homeworks: 45%
Undergrads:
Will drop lowest hw grade
Will drop the lowest of the four regular quiz grades (QUIZ-1 through QUIZ-4)
Double-quiz final: There will be a sub-question (part of one question)
that the undergrads don't have to answer on the
double-quiz/final (and they automatically get credit for that question).
The course syllabus is available.
(Since in-classroom lectures were canceled the day before the midterm, a new course weighting will be devised.
This includes the original 30% for homeworks or more; 10% for each quiz; and the remainder for a final exam. The weighting of the final exam will be no more than the original 40%.) The syllabus lists several texts; the first half of the course especially uses Data-Intensive Text Processing with MapReduce (by J. Lin and C. Dyer);
The homework directory is also available (and now contains HW1.docx, due end of Friday, Jan. 31).
In addition, you will probably want to look at the MapReduce-Demo before beginning your first homework. For example, HW1 mentions pom.xml, and it is available in the Demo directory.
You will also need AWS accounts. See below for "Information on AWS accounts".
See below for "Information on submitting homeworks".
All information can be found either online when you login to your Khoury Linux account (cd /course/cs6240); or else from the web, where it is visible here: CS 6240 course directory.
The syllabus lists six textbooks. Note especially the book: Data-Intensive Text Processing with MapReduce (by Jimmy Lin and Chris Dyer); or alt. When not otherwise indicated, this will be the default textbook. \vskip0.2\answer
For students taking the CS4240 version of the course, you may drop your lowest homework grade (or simply choose to skip one assignment). On the midterm and final exam, at least one of the harder questions for each exam will be labeled "CS6240 only". For those in CS4240, that question will not be graded, and you should not attempt it.
Information on getting a free AWS account is also available.
Further information on setting up the account is also available (taken from a link in HW1.docx).
Finally, you must configure your account: Go to your "AWS Educate" account and got to "Account details", and then select "AWS CLI". Then click the "show" button. You can now configure the account with your tokens, etc.HINT ON USING AWS from one of the TAs: "One or more helpful tips that I can provide is: (a) to use m4.large machines; and (b) to run the aws configure steps before every execution of the 'make aws' command." (Please tell us if you discover a shortcut from configuring every time.)
ANOTHER HINT ON USING AWS: When you run 'make aws' on AWS itself, the MapReduce job may take about 5 minutes. You must wait until the job completes. If you try to examine the status before the job completes, you will simply see an error indication.
In this version of the course, we will use shell scripts on the Khoury Linux computers (login.ccs.neu.edu) to submit the homework.
Make sure you have a Khoury computer account. They are available at: https://my.ccs.neu.edu/account/apply
- If you did not develop your homework answer on Khoury Linux, then copy your files to Khoury Linux and login to it as in the following example:
scp -pr myhw1 login.ccs.neu.edu
ssh login.ccs.neu.edu
If you are on Windows, then consider using a Linux virtual machine or the Windows Subsystem for Linux (WSL). Alternatively, you can use native Windows commands such as putty (choose "putty.exe" for 64-bit) and WinSCP. The putty application is for 'ssh', and the WinSCP application is for 'scp'.- Copy the files that you will submit into a single directory, and then create a single tar file. An easy way to do this is to use 'make distro' from the Makefile in the MapReduce-Demo. Modify the 'distro' target to save the files that you wish to submit. (See below under "Deliverables" for what to do if your output file is more than 1 MB.)
- In order to double-check what you submit, consider looking at the contents of your tar file with the Linux command: tar -tvf MR-Demo.tar.gz
- Next, you will submit the files. If this is hw1, and you named your tar file HW1.tar.gz, then you would do:
/course/cs6240/homework/submit-cs6240-hw1 HW1.tar.gz* Deliverables for homework submissions
- Generally, homework will be due at the end of a Friday. (I interpret "end of Friday" generously. I will usually set the submit script to stop accepting submissions on my Saturday morning. You must submit on time, regardless of any timestamp on the tar file.
- Submit the report as a PDF file (not txt, doc, docx, tex etc!) inside your submit directory that you tar. To simplify the grading process, please name this file yourFirstName_yourLastName_HW#.tar.gz. Here yourFirstName and yourLastName are your first and last name; the # character should be replaced by the HW number. So, if you are Amy Smith submitting the solution for HW 7, your solution file should be named Amy_Smith_HW7.pdf: Make sure your tar file includes the project, log and output files as described above (1 PDF file for the report).
IMPORTANT: If your output file is more than 1 MB, then please create a new output-head file that is just the first 1,000 line. Submit only the output-head file in that case. An easy Linux command to create a file with the first 1,000 lines is:
head -1000 output > output-head
In order to implement this in Hadoop, you should use API hook described in Section 3.5.1.
A critical concept for understanding PageRank is that the rank of a page can be thought of either as the "importance of the page", or the "mass of the page". In Physics, we know that mass is conserved. We walso want to conserve the total PageRank across all pages. Each iteration of Figure 5.8 will redistribute the PageRank, but the total PageRank across all pages will remain the same.
On each iteration, the Map function "sends its PageRank" across all outgoing edges. So, line 3 of the figure divides the PageRank of node n by the total number of outgoing edges, and each edges passes along that much of the PageRank. in the Reduce function, each node receives a PageRank from each incoming edge. Its new PageRank will be the sum of the PageRanks that ti gets from each incoming edge.
The PageRank algorithm begins by assigning a PageRank of 1 to each page. After each iteration (after each MapReduce job), the PageRank changes. When the PageRanks stop changing (or change by only a small mount), then we can stop the PageRank algorithm. This si the "stopping condition" or "termination condition".
There are three additional features in a full PageRank algorithm:
Example: Joe and Jane have a bank account with $100.
Phase 1: Joe and Jane each want to withdraw $50. So, the bank sends a message to Joe and Jane: "$100 in account."
Joe and Jane each send back a message to the bank: "I withdrew $50. Now, $50 left in account." (Joe and Jane need to wait now to see if the transaction will be committed.)
Now, Joe, Jane, and the bank each report whether the transaction was consistent. Joe and Jane say "Yes, it's all consistent."
The bank says, "No, I had $100. Two times $50 was withdrawn. But I was told there is still $50 in the account. It's not consistent!"
Phase 2: Two agents reported "Yes (consistent)". One agent reported "No (inconsistent)". So the transaction is aborted. It never becomes permanent.
Result: Joe or Jane might try a random back-off (essentially, the two-phase locking). Or maybe Joe or Jane abandons the transaction, and follows a different strategy.
Note that the use of mandatory locks is easiest, since only one process can make a call to the operating system at one time. But it costs 1 millisecond (about the same time as executing 10 million assembly instructions).
If that is inefficient, the next thing to try is two-phase locking. On one computer, this requires the implementation of acquire/release to use an atomic instruction on a memory location to check if someone else has a lock, and if not, then to claim the lock for this process. If there are many transactions, then two-phase locking becomes expensive because we will often detect deadlock, and have to do a random back-off.
If two-phase locking becomes expensive, then we can implement the more efficient two-phase commit. We never do the actual acquire in Phase 1. We just announce a plan. But this is more difficult to implement, since it often requires a central coordinator process that examines the transaction plan, and decides whether to accept it or abort.