DS 2500
Goodreads is a social cataloging website that operates as a space to bring book lovers together. With a database of millions of books and reviews, users can track their current reads as well as search for new ones based off their past activity. In a more technologically connected world, Goodreads allows for book browsing at a user's convenience. Rather than relying on recommendations from a friend or a trending title, book recommendations are offered within a self curated network (Kaufman).
How can we accurately predict the next good read for a user? Better yet, is there a way to predict the next most read title of many users? Machine learning techniques offers us aid in exploring these questions.
To construct such a predictor, we can map book titles to their respective ratings and compare such ratings across numerous users to find evidence of book popularity. The more frequent a book has a high rating, the more popular the book is considered. The specific machine learning tool to progress is still unclear, however, clustering books by their ratings seems like a good start.
Of course, there exists some limitations such as the fact that not all users commit themselves to reading just one genre. Therefore we must consider the various factors that go into determining the next popular book. Such factors include, but are not limited to, the ratings of books that were previously read by a user and their respective genres. Additionally, if a book has a great number of 5-star ratings, do we define this book as popular?
We wil use the Goodreads API as the primary source of our data.
Our focus of the books dataset includes:
Our focus of the ratings dataset incudes:
import pandas as pd
books = pd.read_csv('books_sample.csv')
books.head()
book_id | goodreads_book_id | best_book_id | work_id | books_count | isbn | isbn13 | authors | original_publication_year | original_title | ... | ratings_count | work_ratings_count | work_text_reviews_count | ratings_1 | ratings_2 | ratings_3 | ratings_4 | ratings_5 | image_url | small_image_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2767052 | 2767052 | 2792775 | 272 | 439023483 | 9.780439e+12 | Suzanne Collins | 2008.0 | The Hunger Games | ... | 4780653 | 4942365 | 155254 | 66715 | 127936 | 560092 | 1481305 | 2706317 | https://images.gr-assets.com/books/1447303603m... | https://images.gr-assets.com/books/1447303603s... |
1 | 2 | 3 | 3 | 4640799 | 491 | 439554934 | 9.780440e+12 | J.K. Rowling, Mary GrandPré | 1997.0 | Harry Potter and the Philosopher's Stone | ... | 4602479 | 4800065 | 75867 | 75504 | 101676 | 455024 | 1156318 | 3011543 | https://images.gr-assets.com/books/1474154022m... | https://images.gr-assets.com/books/1474154022s... |
2 | 3 | 41865 | 41865 | 3212258 | 226 | 316015849 | 9.780316e+12 | Stephenie Meyer | 2005.0 | Twilight | ... | 3866839 | 3916824 | 95009 | 456191 | 436802 | 793319 | 875073 | 1355439 | https://images.gr-assets.com/books/1361039443m... | https://images.gr-assets.com/books/1361039443s... |
3 | 4 | 2657 | 2657 | 3275794 | 487 | 61120081 | 9.780061e+12 | Harper Lee | 1960.0 | To Kill a Mockingbird | ... | 3198671 | 3340896 | 72586 | 60427 | 117415 | 446835 | 1001952 | 1714267 | https://images.gr-assets.com/books/1361975680m... | https://images.gr-assets.com/books/1361975680s... |
4 | 5 | 4671 | 4671 | 245494 | 1356 | 743273567 | 9.780743e+12 | F. Scott Fitzgerald | 1925.0 | The Great Gatsby | ... | 2683664 | 2773745 | 51992 | 86236 | 197621 | 606158 | 936012 | 947718 | https://images.gr-assets.com/books/1490528560m... | https://images.gr-assets.com/books/1490528560s... |
5 rows × 23 columns
ratings = pd.read_csv('book_ratings.txt')
ratings.head()
user_id | book_id | rating | |
---|---|---|---|
0 | 1 | 258 | 5 |
1 | 2 | 4081 | 4 |
2 | 2 | 260 | 5 |
3 | 2 | 9296 | 5 |
4 | 2 | 2318 | 3 |