Book Popularity Prediction¶

Motivation:¶

Problem¶

Publishers have to vet almost a million books every year and determine which ones will be the most popuar and make them the most money. Bookstores, whether they are chain stores or individually owned, also have to choose what books to have in stock.

Solution¶

Goodreads is a platform for finding and rating books. It's used by millions of people worldwide. It creates a large database of information ranging from book length, ratings, multiple genre classifications, and book descriptions. The goal of this project is to try to predict book popularity (average rating) before it's published based on the provided characteristics of the book.

Dataset¶

We will use a dataset on Goodreads from Kaggle. Fields in the dataset:

  • book_author: author of the book
  • book_desc: a short description of the plot of the book
  • book_edition: regular print or special edition (ex. 50th anniversary)
  • book_format: hardcover, paperback, etc
  • book_isbn: unique book identification code
  • book_pages: number of pages
  • book_rating: average book rating on a scale of 1 to 5
  • book_rating_count: number of ratings the book has
  • book_review_count: number of written reviews the book has
  • book_title
  • genres: all of the relevant genres of the specific book
In [2]:
import pandas as pd
df_books = pd.read_csv('book_data.csv')
df_books.head()
Out[2]:
book_authors book_desc book_edition book_format book_isbn book_pages book_rating book_rating_count book_review_count book_title genres image_url
0 Suzanne Collins Winning will make you famous. Losing means cer... NaN Hardcover 9.78044E+12 374 pages 4.33 5519135 160706 The Hunger Games Young Adult|Fiction|Science Fiction|Dystopia|F... https://images.gr-assets.com/books/1447303603l...
1 J.K. Rowling|Mary GrandPré There is a door at the end of a silent corrido... US Edition Paperback 9.78044E+12 870 pages 4.48 2041594 33264 Harry Potter and the Order of the Phoenix Fantasy|Young Adult|Fiction https://images.gr-assets.com/books/1255614970l...
2 Harper Lee The unforgettable novel of a childhood in a sl... 50th Anniversary Paperback 9.78006E+12 324 pages 4.27 3745197 79450 To Kill a Mockingbird Classics|Fiction|Historical|Historical Fiction... https://images.gr-assets.com/books/1361975680l...
3 Jane Austen|Anna Quindlen|Mrs. Oliphant|George... «È cosa ormai risaputa che a uno scapolo in po... Modern Library Classics, USA / CAN Paperback 9.78068E+12 279 pages 4.25 2453620 54322 Pride and Prejudice Classics|Fiction|Romance https://images.gr-assets.com/books/1320399351l...
4 Stephenie Meyer About three things I was absolutely positive.F... NaN Paperback 9.78032E+12 498 pages 3.58 4281268 97991 Twilight Young Adult|Fantasy|Romance|Paranormal|Vampire... https://images.gr-assets.com/books/1361039443l...

Potential Problems¶

One key features of books is the quality of the writing itself which you can't really measure and it's not included in the data. This could make our predictions a little inaccurate since we don't have the whole picture.

Methods:¶

This type of problem is a classification problem because we have a lot of nominal features and using those nominal features we want to predict what the populatity/average rating(1-2, 2-3, 3-4, 4-5) of the book would be.