Objectives | Summary | Details | Submission | Grading

Assignment 4: Multi-Threading

DUE: Aug 5, 2018, midnight.

Objectives

Summary

This week we're working with threads. One step of creating a search engine is to index all the items we might want to search. This gets around the problem of having to search through everything each time we query. Here's an example: If I want to get a list of movies that are a particular genre, if I don't have an index, the only thing I can do is load up a list of movies and then search through them (possibly sorting via genre first, then binary search). But, if I've indexed the list of movies by genre, then I already have that list. Then, searching through that list of comedies becomes much more efficient.

How does this all fit together?

In the bigger scheme of things, we're putting pieces together to build a search engine. A search engine has a few components: a file crawler, a file processor, an indexer, and a query processor. The file crawler starts in one place, and traverses to find other files. On the web, this means starting at one page and following all the links on the page; in a file system, it means starting in one directory and traversing through the directory structure until all files have been foun. The file processor takes each of those files and finds significant words or concepts in the file. The indexer creates a data structure that (essentially) populates a data structure that makes it easy to find documents that have similar topics, or find documents that contain a certain word. The query processor lets us work with that data structure to get the results based on what we ask for.

In this homework, I'm providing you with a working movie indexer. We have a bunch of movie data, and you're going to build the pieces together to index the titles of these movies. Once the movies have been indexed, when we query the index, we will list out all of the movie titles that contain that word.

The trick is that we have about 5 million movies, videos and TV shows that we want to index. We can't just load them all into memory and search through them! We have to index them.

What does it mean to index them? Well, we start with a bunch of files (about 1000), and each file contains about 10,000 records. We open each file, get the words in the title of the movie from a row, and save which file and which row that record is in. Then, we want to find the movie record that has a title with a given word, we look up that word in our index, which has filename and row number of a movie with that word, and we open that file, go to that row, and print it out.

So what are you building in this assignment? First, you're going to write the part that does the file processing. This should be straightforward; you've done this a bunch already. But, we have so much data to deal with, we're going to throw a few threads at it at a time. Because your virtualbox is set up to have a single core, you won't see much improvement at first. But, you likely have multiple cores available to it. In the very last step, you'll attempt to run your code with multiple cores, which you will hopefully see an improvement.

Finally, if you haven't already, you will be required to use the INTERFACES provided for this assignment. In fact, if you use the provided interfaces, it should be very straightforward, and much more complicated if you try to break the abstractions.

What does our system look like?

The following picture is an overview of the system components:

You'll primarily be working with the FileParser (in orange), as well as the MainSystem.

The overall MovieIndex and DocSet look something like this:

We have a unique DocSet for every unique word in every movie title. The DocSet maintains a list of DocId's (which document contains the movie with this word) and a list of rows in that document (which row the specific movie is in).

When we ask the index for a set SearchResults for a given word, a list of docId-rowId pairs is returned. Then, we use that information to go look up the specific row in the file and return that row to the customer.

This is the overall flow:



Step 1: Write the FileParser

Download the code for this assignment. Spend some time familiarizing yourself, specifically with the header files that describe the interface to the provided system.

The most relevant files:

Then, open the FileParser code and implement the specified pieces. Specifically, what you'll be doing is going through all the files that have been crawled (that is, identified and put in the docId map), parse them, and pass them to the movieIndex to be indexed.

Run the tests and your code, making sure you're satisfied with it.

Finishing Step 1

When you're done with this step:

An example invocation:

adrienne@virtualbox$ ./main data_small/
Crawling directory tree starting at: data_small/
Crawled 24 files.
Parsing and indexing files...


Took 432313 microseconds to parse 24 files.

22197 entries in the index.
Enter a term to search for, or q to quit: seattle


Getting docs for movieset term: "seattle"
tt3277988|tvEpisode|Seattle|Seattle|0|2013|\N|42|Reality-TV  (data_small/xcm, row 878)
tt3281696|tvEpisode|Denver Broncos vs. Seattle Seahawks|Denver Broncos vs. Seattle Seahawks|0|1997|\N|\N|Sport  (data_small/xco, row 418)
tt3281716|tvEpisode|Seattle Seahawks vs. Dallas Cowboys|Seattle Seahawks vs. Dallas Cowboys|0|1980|\N|\N|Sport  (data_small/xco, row 427)
tt3281748|tvEpisode|Seattle Seahawks vs. Los Angeles Raiders|Seattle Seahawks vs. Los Angeles Raiders|0|1992|\N|\N|Sport  (data_small/xco, row 440)
tt3211900|tvEpisode|Wasted in Seattle|Wasted in Seattle|0|2013|\N|44|Crime,Documentary  (data_small/xbj, row 927)
tt3245458|tvEpisode|Seattle Seahawks: Road to 2008|Seattle Seahawks: Road to 2008|0|2008|\N|23|\N  (data_small/xby, row 532)
tt3247806|tvEpisode|St. Louis Cardinals vs. Seattle Seahawks|St. Louis Cardinals vs. Seattle Seahawks|0|1976|\N|\N|Sport  (data_small/xbz, row 561)
Enter a term to search for, or q to quit:
    

Step 2: Build a multithreaded indexer

Now that we have an indexer, we're going to multi-thread the FileParser.

Changes to make:

Finishing Step 2

When you're done with this step:

Experiment

Now that your program works, let's do some experimentation. We spoke in class about how a computer with a single processor won't have much benefit when using multiple threads. We can measure the difference though.

Further, since we're running a virtual machine, we can tell it how many cores to use.

If your machine has multiple cores, it will show on this screen. Crank up the cores, see how performance improves!

Submission

Submit your assignment by pushing your code to Github. Tag it with the tag a4-final.

Tag your last commit, and submit the tag on Blackboard by the due date, midnight (Pacific Time).

In addition, include your notes on how long it takes to parse the files (for both a single thread and multithreaded), and if there is a difference if you were able to increase the number of cores.