I have collected different types of data on books, most recently Amazon sales rank, and explored direct plots of those variables in Sales data exploration. Now I want to use minimum distortion embedding, or MDE, to project those multi-dimensional variables into a 2D embedding space, and plot that.

MDE

Doing the embedding requires a subset of data that is purely numerical, so let's grab thos ecolumns and covnert to an ndarray. We'll use the same slice of the dataset from the previous slice, for consistency. We can scale it up to the full dataset later.

MDE also wants no NaN values, so we can use .dropna() to drop those. Only a fraction of the dataset has those missing values, so hopefully won't affect it too much.

import pandas as pd
import altair as alt

sorted_corner_df = pd.read_csv('../../records/sorted_corner_df.csv')

numbers_df = sorted_corner_df[['publication_year', 'num_pages', 'average_rating', 'text_reviews_count', 'rank']].dropna()

len(numbers_df)
4283
numbers_array = numbers_df.to_numpy()
import pymde

%matplotlib inline
mde = pymde.preserve_neighbors(numbers_array, verbose=True)

embedding = mde.embed(verbose=True)
Sep 16 09:41:57 AM: Computing 15-nearest neighbors, with max_distance=None
Sep 16 09:42:06 AM: Exact nearest neighbors by brute force 
Sep 16 09:42:07 AM: Computing quadratic initialization.
Sep 16 09:42:07 AM: Fitting a centered embedding into R^2, for a graph with 4283 items and 81252 edges.
Sep 16 09:42:07 AM: `embed` method parameters: eps=1.0e-05, max_iter=300, memory_size=10
Sep 16 09:42:07 AM: iteration 000 | distortion 0.229112 | residual norm 0.0507309 | step length 0.631308 | percent change 0.034604
/home/mage/.local/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  allow_unreachable=True)  # allow_unreachable flag
Sep 16 09:42:07 AM: iteration 030 | distortion 0.094106 | residual norm 0.00126619 | step length 1 | percent change 1.55099
Sep 16 09:42:08 AM: iteration 060 | distortion 0.081850 | residual norm 0.00057349 | step length 1 | percent change 0.259603
Sep 16 09:42:08 AM: iteration 090 | distortion 0.080357 | residual norm 0.000341796 | step length 1 | percent change 0.0393842
Sep 16 09:42:08 AM: iteration 120 | distortion 0.079737 | residual norm 0.00017189 | step length 1 | percent change 0.266014
Sep 16 09:42:08 AM: iteration 150 | distortion 0.079471 | residual norm 0.000146057 | step length 1 | percent change 0.132361
Sep 16 09:42:09 AM: iteration 180 | distortion 0.079352 | residual norm 0.000133967 | step length 1 | percent change 0.15208
Sep 16 09:42:09 AM: iteration 210 | distortion 0.079294 | residual norm 0.000101026 | step length 1 | percent change 0.0602279
Sep 16 09:42:09 AM: iteration 240 | distortion 0.079239 | residual norm 6.6527e-05 | step length 1 | percent change 0.191316
Sep 16 09:42:09 AM: iteration 270 | distortion 0.079207 | residual norm 5.19311e-05 | step length 1 | percent change 0.011526
Sep 16 09:42:10 AM: iteration 299 | distortion 0.079185 | residual norm 4.66989e-05 | step length 1 | percent change 0.0680014
Sep 16 09:42:10 AM: Finished fitting in 2.412 seconds and 300 iterations.
Sep 16 09:42:10 AM: average distortion 0.0792 | residual norm 4.7e-05
pymde.plot(embedding, color_by=numbers_df['rank'])
<AxesSubplot:>

A weird shape! Finally!

These data points should be the same as the blue-purple graph from my last post (average_rating vs rank), but arranged by the constraints of each numerical feature. Let's compare the two.

The following chart uses a common brush selection for two graphs with different encoded axes. That means if you select points on one side, you should be able to see the same points light up on the other.

numbers_df[['mde_x', 'mde_y']] = embedding.tolist()

embed_df = numbers_df.merge(sorted_corner_df)
brush = alt.selection_interval() 

chart = alt.Chart(embed_df).mark_point(filled=True).encode(
    size = alt.Size('ratings_count',
          scale=alt.Scale(type='log', range=[.075,75]), 
             ),
    color=alt.condition(
        brush, 
        alt.Color('text_reviews_count', type='quantitative',
        scale=alt.Scale(type='log', scheme='bluepurple')
                     ), 
        alt.value('lightgray')),
    opacity = alt.OpacityValue(0.6),
    tooltip = ['title_gr', 'author_name', 'publication_year', 'category', 'rank', 'text_reviews_count', 'average_rating', 'ratings_count']
).properties(
    width = 420,
    height = 340
).add_selection(
    brush
)

chart.encode(x = 'mde_x', y = 'mde_y') & chart.encode(x = alt.X('rank', scale=alt.Scale( domain=[-10, 23000], reverse=True), ), y = alt.Y('average_rating', scale=alt.Scale(type='linear', domain=[3.5,4.7])))