DS 2500

Goodreads Predictor¶

Background¶

Goodreads is a social cataloging website that operates as a space to bring book lovers together. With a database of millions of books and reviews, users can track their current reads as well as search for new ones based off their past activity. In a more technologically connected world, Goodreads allows for book browsing at a user's convenience. Rather than relying on recommendations from a friend or a trending title, book recommendations are offered within a self curated network (Kaufman).

Problem¶

How can we accurately predict the next good read for a user? Better yet, is there a way to predict the next most read title of many users? Machine learning techniques offers us aid in exploring these questions.

Proposal¶

To construct such a predictor, we can map book titles to their respective ratings and compare such ratings across numerous users to find evidence of book popularity. The more frequent a book has a high rating, the more popular the book is considered. The specific machine learning tool to progress is still unclear, however, clustering books by their ratings seems like a good start.

Limitations¶

Of course, there exists some limitations such as the fact that not all users commit themselves to reading just one genre. Therefore we must consider the various factors that go into determining the next popular book. Such factors include, but are not limited to, the ratings of books that were previously read by a user and their respective genres. Additionally, if a book has a great number of 5-star ratings, do we define this book as popular?

Dataset¶

We wil use the Goodreads API as the primary source of our data.

Our focus of the books dataset includes:

  • book_id: assigned book identification codes
  • original_title: title of book
  • ratings_count: count of how many ratings a book has
  • ratings_1: how many 1 star reviews a book has
  • ratings_2: how many 2 star reviews a book has
  • ratings_3: how many 3 star reviews a book has
  • ratings_4: how many 4 star reviews a book has
  • ratings_5: how many 5 star reviews a book has

Our focus of the ratings dataset incudes:

  • user_id: unique user identification codes
  • book_id: assigned book identification codes
  • rating: book rating on scale of 1-5
In [1]:
import pandas as pd
In [12]:
books = pd.read_csv('books_sample.csv')
books.head()
Out[12]:
book_id goodreads_book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
0 1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
1 2 3 3 4640799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPré 1997.0 Harry Potter and the Philosopher's Stone ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...
2 3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight ... 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s...
3 4 2657 2657 3275794 487 61120081 9.780061e+12 Harper Lee 1960.0 To Kill a Mockingbird ... 3198671 3340896 72586 60427 117415 446835 1001952 1714267 https://images.gr-assets.com/books/1361975680m... https://images.gr-assets.com/books/1361975680s...
4 5 4671 4671 245494 1356 743273567 9.780743e+12 F. Scott Fitzgerald 1925.0 The Great Gatsby ... 2683664 2773745 51992 86236 197621 606158 936012 947718 https://images.gr-assets.com/books/1490528560m... https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

In [13]:
ratings = pd.read_csv('book_ratings.txt')
ratings.head()
Out[13]:
user_id book_id rating
0 1 258 5
1 2 4081 4
2 2 260 5
3 2 9296 5
4 2 2318 3

Notes¶

  • The number of users for comparison is ambigious at the moment as there are millions upon millions, but observing a greater number can improve accuracy.
  • The csv files loaded contain only 100 books and ratings as a sample, but a bigger dataset of 1000+ will be looked at for the project.