Rationale

I want another piece of information for the books in my dataset: their sales rank. I know it's possible to get this number from Amazon, because I have used professional bookseller software. However, this software requires a licensing agreement from Amazon, and it can only display sales rank data about one book ata time. So I need a different way to get it for a lot of books at once.

A starting point is the sales rank data available from Jianmo Ni's project here:

In this notebook I extract those rankings and map them to the books in my previous Goodreads-based data. Pretty similar process to that notebook, actually. Feel free to skip this and go straight to Sales data exploration to see pretty graphs.

I might be able to pull the metadata for all the books into one dataframe, but it's 4GB of JSON so I may run out of memory. Otherwise I'll have to chunk the JSON again.

from pathlib import Path
import pandas as pd
meta_json = LOCALPATH/'meta_Books.json'

meta_df = pd.read_json(meta_json, lines=True, chunksize=200)

test_df = next(meta_df)

test_df[['asin','title','rank','category']]
asin title rank category
0 0000092878 Biology Gods Living Creation Third Edition 10 ... 1,349,781 in Books ( []
1 000047715X Mksap 16 Audio Companion: Medical Knowledge Se... 1,702,625 in Books ( [Books, New, Used & Rental Textbooks, Medicine...
2 0000004545 Flex! Discography of North American Punk, Hard... 6,291,012 in Books ( [Books, Arts & Photography, Music]
3 0000013765 Heavenly Highway Hymns: Shaped-Note Hymnal 2,384,057 in Books ( [Books, Arts & Photography, Music]
4 0000000116 Georgina Goodman Nelson Womens Size 8.5 Purple... 11,735,726 in Books ( []
... ... ... ... ...
195 0001839225 Spring Story (Brambly Hedge) 270,015 in Books ( [Books, Children's Books, Animals]
196 0001942212 Little Grey Rabbit and the Weasels (The Little... 2,367,441 in Books ( [Books, Children's Books]
197 0001942123 Little Grey Rabbit's paint-box (The Little Gre... 3,608,059 in Books ( []
198 0001982869 The Monster Book of Witches, Vampires, Spooks ... 2,670,790 in Books ( [Books, Children's Books, Humor]
199 0001983806 Pookie Believes in Santa Claus 2,047,631 in Books ( [Books, Children's Books]

200 rows × 4 columns

from tqdm import tqdm

big_meta_df = pd.read_json(meta_json, lines=True, chunksize=100000)

for i, chunk in tqdm(enumerate(big_meta_df)):
    filtered_chunk = chunk[['asin','title','rank','category']]
    filtered_chunk.to_csv(f'../../records/amazon-books{str(i).zfill(4)}.csv')
30it [03:03,  6.10s/it]

I have exported just the useful fields into local CSV files, for easy access. Now I want to transform the rank and category fields into useful datatypes: rank should be a number and category should be the last, most specific of the listed genres. ASINs are supposedly correlated to ISBN-10 numbers, so they should work decently out of the box.

First I'll write a couple small functions to turn the text strings of rank and category into the proper types. Then transform them through pandas, and write a total file to disk. Hopefully it will be small enough to keep in memory now that we've cut a lot of useless data and JSON syntax.

import string

def rank_to_int(rank):
# This try/except loop is a piece of code written by the devil with my hands. Would not recommend
    try:
        assert len(rank) > 0
        if type(rank) == list:
            rank = rank[0]

        num_str = rank.split(' ')[0]
        no_punc_str = num_str.translate(str.maketrans('', '', string.punctuation))

        return(int(no_punc_str))
    except:
        return(None)

def category_to_genre(category):
    if type(category) == str:
        category = category[1:-1].replace("'", '').replace('"', '').replace(' ', '').split(',')
    if len(category) > 0 and len(category[-1]) > 0:
        return(category[-1])
    else:
        return(None)

I'm sure that there is a prettier way to do this. But I'm hoping to only have to do it once more here, so I can brute force it a little bit. Next time I find myself doing this, though, it's time to figure out the smart-guy method.

total_df = pd.DataFrame()
for i in tqdm(range(30)):
    df = pd.read_csv(f'../../records/amazon-books{str(i).zfill(4)}.csv')
    df['category'] = df['category'].apply(category_to_genre)
    df['rank'] = df['rank'].apply(rank_to_int)
    
    total_df = pd.concat([total_df,df], ignore_index=True)
100%|██████████| 30/30 [00:23<00:00,  1.29it/s]
final_df = total_df.drop(columns="Unnamed: 0")
 

Collate the data

What does the data look like, now that we can fit it into memory?

final_df
asin title rank category
0 0000092878 Biology Gods Living Creation Third Edition 10 ... 1349781.0 None
1 000047715X Mksap 16 Audio Companion: Medical Knowledge Se... 1702625.0 Medicine&HealthSciences
2 0000004545 Flex! Discography of North American Punk, Hard... 6291012.0 Music
3 0000013765 Heavenly Highway Hymns: Shaped-Note Hymnal 2384057.0 Music
4 0000000116 Georgina Goodman Nelson Womens Size 8.5 Purple... 11735726.0 None
... ... ... ... ...
2934944 B01HJBPTUI Made Men: A Thriller (Law of Retaliation Book ... 1436585.0 Thrillers&Suspense
2934945 B01HJC63ZM Raptor&#39;s Desire (A Planet Desire novelette... 1060621.0 Erotica
2934946 B01HJEB422 LG K4 Case,LG Optimus Zone 3 Case,LG Spree Cas... NaN None
2934947 B01HJDS76Y Magickal Incantations 645067.0 NewAge&Spirituality
2934948 B01HJFHYMA White Away EURO 2016 Match Football Soccer Adu... 5339602.0 None

2934949 rows × 4 columns

I only have one continuous datapoint to work with here, the rank. Categorical and identifying points are great for labels, but not so much for graphing. I'd like to get more data.

Fortunately, the asin field can be used to map books from Amazon to Goodreads data. Let's grab that data from before.

pd.set_option("display.max_columns", None)


goodreads_df = pd.read_csv(f'../../records/cleaned_goodreads_books.csv')
goodreads_df.columns
Index(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'isbn',
       'text_reviews_count', 'series', 'language_code', 'popular_shelves',
       'asin', 'average_rating', 'similar_books', 'description', 'format',
       'link', 'authors', 'publisher', 'num_pages', 'isbn13',
       'publication_month', 'publication_year', 'url', 'image_url', 'book_id',
       'ratings_count', 'work_id', 'title', 'top_genre', 'author_name'],
      dtype='object')
goodreads_df = goodreads_df[['isbn', 'asin', 'title', 'author_name', 'top_genre', 'publisher', 'publication_year', 'format', 'num_pages', 'average_rating', 'ratings_count', 'text_reviews_count', 'description'] ]

Let's map the Amazon sales rank values into the goodreads dataframe as new columns. Have to do two groups, some have ASIN as their index.

isbns = goodreads_df.loc[goodreads_df['isbn'].isin(final_df['asin'])]
asins = goodreads_df.loc[goodreads_df['asin'].isin(final_df['asin'])]
len(isbns), len(asins)
(532698, 513)
asins_full_df = asins.join(final_df.set_index('asin'), on='asin', lsuffix='_gr', rsuffix='_az')
isbns_full_df = isbns.join(final_df.set_index('asin'), on='isbn', lsuffix='_gr', rsuffix='_az')
full_df = pd.concat([isbns_full_df, asins_full_df])

Final dataset

full_df
isbn asin title_gr author_name top_genre publisher publication_year format num_pages average_rating ratings_count text_reviews_count description title_az rank category
0 0312853122 NaN W.C. Fields: A Life on Film Ronald J. Fields p St. Martin's Press 1984.0 Paperback 256.0 4.00 3.0 1.0 NaN W. C. Fields: A Life on Film 3974312.0 Biographies&amp;Memoirs
3 0850308712 NaN Runic Astrology: Starcraft and Timekeeping in ... Nigel Pennick runes NaN NaN NaN NaN 3.40 15.0 5.0 NaN Runic Astrology: Starcraft and Timekeeping in ... 2163024.0 NewAge&Spirituality
5 0425040887 NaN The Wanting of Levine Michael Halberstam read-gave-away Berkley Publishing Group 1979.0 Paperback NaN 4.38 12.0 4.0 NaN The Wanting of Levine 2453377.0 Literature&amp;Fiction
6 1934876569 NaN All's Fairy in Love and War (Avalon: Web of Ma... Rachel Roberts fantasy Seven Seas 2009.0 Paperback 216.0 4.22 98.0 6.0 To Kara's astonishment, she discovers that a p... Avalon: Web of Magic Book 8: All's Fairy in Lo... 1176771.0 ScienceFiction&amp;Fantasy
7 0922915113 NaN The Devil's Notebook Anton Szandor LaVey occult Feral House 2000.0 Paperback 147.0 3.81 986.0 39.0 Wisdom, humor, and dark observations by the fo... The Devil's Notebook 29944.0 World
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1204491 NaN B0007E28P4 The Ballad Mongers: Rise of the Modern Folk Song Oscar Brand music Minerva Press/Funk & Wagnalls/Reader's Digest ... 1967.0 Paperback 256.0 3.60 2.0 1.0 The folk song performer & impresario presents ... The Ballad Mongers: Rise of the Modern Folk Song 3051315.0 None
1207326 NaN B00005X8YJ The Best of Science Fiction Groff Conklin science-fiction NaN 1946.0 Hardcover NaN 4.05 1.0 1.0 NaN The Best of Science Fiction 3313873.0 None
1207326 NaN B00005X8YJ The Best of Science Fiction Groff Conklin science-fiction NaN 1946.0 Hardcover NaN 4.05 1.0 1.0 NaN The Best of Science Fiction 3313873.0 None
1208053 NaN B003UW4N76 He Who Knows Himself Knows His Lord Muhammad Sa'id al-Jamal ar-Rifa'i ash-Shadhuli islam Sidi Muhammad Press 2007.0 Paperback 157.0 4.62 8.0 3.0 Presented by Shaykh Muhammad Sa'id al-Jamal al... He Who Knows Himself, Knows His Lord By Shaykh... 3574865.0 None
1208115 NaN 0996644725 Destroyer of Light (Hades & Persephone #2) Rachel Alexander mythology Rachel Alexander 2016.0 Paperback 492.0 4.42 263.0 23.0 The marriage of Hades and Persephone blossoms ... Destroyer of Light 902309.0 Fantasy

533227 rows × 16 columns

Half a million datapoints. It's not as comprehensive as either dataset by itself, but it has consistent levels of metadata for each book. I think that will help to find good features in the data.

Saving to disk. I explore this next in Sales data exploration.

full_df.to_csv(f'../../records/books-with-salesrank.csv')