I have collected different types of data on books, most recently Amazon sales rank, and explored direct plots of those variables in Sales data exploration. Now I want to use minimum distortion embedding, or MDE, to project those multi-dimensional variables into a 2D embedding space, and plot that.

MDE

Doing the embedding requires a subset of data that is purely numerical, so let's grab thos ecolumns and covnert to an ndarray. We'll use the same slice of the dataset from the previous slice, for consistency. We can scale it up to the full dataset later.

MDE also wants no NaN values, so we can use .dropna() to drop those. Only a fraction of the dataset has those missing values, so hopefully won't affect it too much.

import pandas as pd
import altair as alt

sorted_corner_df = pd.read_csv('../../records/sorted_corner_df.csv')

numbers_df = sorted_corner_df[['publication_year', 'num_pages', 'average_rating', 'text_reviews_count', 'rank']].dropna()

len(numbers_df)
4283
numbers_array = numbers_df.to_numpy()
import pymde

%matplotlib inline
mde = pymde.preserve_neighbors(numbers_array, verbose=True)

embedding = mde.embed(verbose=True)
Oct 18 04:27:41 PM: Computing 15-nearest neighbors, with max_distance=None
Oct 18 04:27:50 PM: Exact nearest neighbors by brute force 
Oct 18 04:27:50 PM: Computing quadratic initialization.
Oct 18 04:27:51 PM: Fitting a centered embedding into R^2, for a graph with 4283 items and 81243 edges.
Oct 18 04:27:51 PM: `embed` method parameters: eps=1.0e-05, max_iter=300, memory_size=10
Oct 18 04:27:51 PM: iteration 000 | distortion 0.227899 | residual norm 0.0647687 | step length 0.593024 | percent change 0.0415001
Oct 18 04:27:51 PM: iteration 030 | distortion 0.092099 | residual norm 0.00182861 | step length 1 | percent change 1.06161
Oct 18 04:27:51 PM: iteration 060 | distortion 0.081070 | residual norm 0.000390266 | step length 1 | percent change 0.500224
Oct 18 04:27:52 PM: iteration 090 | distortion 0.079912 | residual norm 0.000300041 | step length 1 | percent change 0.308907
Oct 18 04:27:52 PM: iteration 120 | distortion 0.079098 | residual norm 0.000215961 | step length 1 | percent change 0.315328
Oct 18 04:27:52 PM: iteration 150 | distortion 0.078769 | residual norm 0.00022695 | step length 1 | percent change 0.0179414
Oct 18 04:27:52 PM: iteration 180 | distortion 0.078665 | residual norm 7.00269e-05 | step length 1 | percent change 0.073022
Oct 18 04:27:52 PM: iteration 210 | distortion 0.078616 | residual norm 6.49021e-05 | step length 1 | percent change 0.0910793
Oct 18 04:27:53 PM: iteration 240 | distortion 0.078571 | residual norm 5.87788e-05 | step length 1 | percent change 0.102136
Oct 18 04:27:53 PM: iteration 270 | distortion 0.078543 | residual norm 5.88444e-05 | step length 1 | percent change 0.0373851
Oct 18 04:27:53 PM: iteration 299 | distortion 0.078523 | residual norm 7.04834e-05 | step length 1 | percent change 0.0188343
Oct 18 04:27:53 PM: Finished fitting in 2.162 seconds and 300 iterations.
Oct 18 04:27:53 PM: average distortion 0.0785 | residual norm 7.0e-05
pymde.plot(embedding, color_by=numbers_df['rank'])
<AxesSubplot:>

A weird shape! Finally!

These data points should be the same as the blue-purple graph from my last post (average_rating vs rank), but arranged by the constraints of each numerical feature. Let's compare the two.

The following chart uses a common brush selection for two graphs with different encoded axes. That means if you select points on one side, you should be able to see the same points light up on the other.

numbers_df[['mde_x', 'mde_y']] = embedding.tolist()

embed_df = numbers_df.merge(sorted_corner_df)
brush = alt.selection_interval() 

chart1 = alt.Chart(embed_df).mark_point(filled=True).encode(
    size = alt.Size('ratings_count',
          scale=alt.Scale(type='log', range=[.075,75]), 
             ),
    color=alt.condition(
        brush, 
        alt.Color('text_reviews_count', type='quantitative',
        scale=alt.Scale(type='log', scheme='bluepurple')
                     ), 
        alt.value('lightgray')),
    opacity = alt.OpacityValue(0.6),
    tooltip = ['title_gr', 'author_name', 'publication_year', 'category', 'rank', 'text_reviews_count', 'average_rating', 'ratings_count']
).properties(
    width = 420,
    height = 340
).add_selection(
    brush
)

chart1.encode(x = 'mde_x', y = 'mde_y') & chart1.encode(x = alt.X('rank', scale=alt.Scale( domain=[-10, 23000], reverse=True), ), y = alt.Y('average_rating', scale=alt.Scale(type='linear', domain=[3.5,4.7])))