Rationale

I have collated Goodreads review data with Amazon sales rankings. Now I would like to see what type of insights can be drawn from the interaction of these variables.

Explore the data

My plotting library here is Altair. It interops with my website framework, fastpages, to print interactive charts in the reader's browser. Unfortunately, doing this can have a heavy load on the browser, and so Altair limits the maximum data points in a graph to 5,000.

I have other options for creating a huge graph of 500,000 books, but for now I'll take slices of the full dataframe and get an idea of their shape and texture of the data.

import pandas as pd
import altair as alt
full_df = pd.read_csv(f'../../records/books-with-salesrank.csv', low_memory=False).drop(columns='Unnamed: 0')

Highly-reviewed books

Because there are so many books in here, I'm going to sort them by text_reviews_count, and take the most-reviewed slice to start. That way I should be able to recognize some of the titles.

sorted_df = full_df.sort_values(by='text_reviews_count', ascending=False)

plotting_df = sorted_df.set_index(o for o in range(len(sorted_df))).loc[:4999]

The following code sets up a chart in Altair. Like I said, I'm using this library because it works by default with fastpages. But it's a quite nice tool, actually. It's a declarative plotting tool, rather than an imperative one. You tell it which variables to encode to which visual information channels, and it figures out the details.

plotting_chart = alt.Chart(plotting_df).mark_point().encode(
    x = alt.X('rank',  
          scale=alt.Scale(type='log', reverse=True), 
         ),
    y = alt.Y('text_reviews_count',
          scale=alt.Scale(type='log'), 
             ),
    color = alt.Color('average_rating', 
                    scale=alt.Scale(scheme='yellowgreenblue')),
    tooltip = ['title_gr', 'author_name', 'publication_year', 'category', 'rank']
).properties(
    width = 640,
    height = 640
).interactive()

# Drawing some lines at the mean of each scale, to make quadrants
x_line = alt.Chart(pd.DataFrame({'x': [int(plotting_df['rank'].mean())]})).mark_rule().encode(x='x')
y_line = alt.Chart(pd.DataFrame({'y': [int(plotting_df['text_reviews_count'].mean())]})).mark_rule().encode(y='y')

plotting_chart + x_line + y_line