Publishers have to vet almost a million books every year and determine which ones will be the most popuar and make them the most money. Bookstores, whether they are chain stores or individually owned, also have to choose what books to have in stock.
Goodreads is a platform for finding and rating books. It's used by millions of people worldwide. It creates a large database of information ranging from book length, ratings, multiple genre classifications, and book descriptions. The goal of this project is to try to predict book popularity (average rating) before it's published based on the provided characteristics of the book.
We will use a dataset on Goodreads from Kaggle. Fields in the dataset:
import pandas as pd
df_books = pd.read_csv('book_data.csv')
df_books.head()
book_authors | book_desc | book_edition | book_format | book_isbn | book_pages | book_rating | book_rating_count | book_review_count | book_title | genres | image_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Suzanne Collins | Winning will make you famous. Losing means cer... | NaN | Hardcover | 9.78044E+12 | 374 pages | 4.33 | 5519135 | 160706 | The Hunger Games | Young Adult|Fiction|Science Fiction|Dystopia|F... | https://images.gr-assets.com/books/1447303603l... |
1 | J.K. Rowling|Mary GrandPré | There is a door at the end of a silent corrido... | US Edition | Paperback | 9.78044E+12 | 870 pages | 4.48 | 2041594 | 33264 | Harry Potter and the Order of the Phoenix | Fantasy|Young Adult|Fiction | https://images.gr-assets.com/books/1255614970l... |
2 | Harper Lee | The unforgettable novel of a childhood in a sl... | 50th Anniversary | Paperback | 9.78006E+12 | 324 pages | 4.27 | 3745197 | 79450 | To Kill a Mockingbird | Classics|Fiction|Historical|Historical Fiction... | https://images.gr-assets.com/books/1361975680l... |
3 | Jane Austen|Anna Quindlen|Mrs. Oliphant|George... | «È cosa ormai risaputa che a uno scapolo in po... | Modern Library Classics, USA / CAN | Paperback | 9.78068E+12 | 279 pages | 4.25 | 2453620 | 54322 | Pride and Prejudice | Classics|Fiction|Romance | https://images.gr-assets.com/books/1320399351l... |
4 | Stephenie Meyer | About three things I was absolutely positive.F... | NaN | Paperback | 9.78032E+12 | 498 pages | 3.58 | 4281268 | 97991 | Twilight | Young Adult|Fantasy|Romance|Paranormal|Vampire... | https://images.gr-assets.com/books/1361039443l... |
One key features of books is the quality of the writing itself which you can't really measure and it's not included in the data. This could make our predictions a little inaccurate since we don't have the whole picture.
This type of problem is a classification problem because we have a lot of nominal features and using those nominal features we want to predict what the populatity/average rating(1-2, 2-3, 3-4, 4-5) of the book would be.