How can I access a huge dataset?

Some of my data, specifically the goodreads dataset and the Amazon sales ranks, are in huge JSON files that need to be converted to an appropriate ontology for a book. Loading this all into memory at once will crash my computer. I have to grab the JSON file with the chunksize attribute, which will make a generator that only loads a certain amount of rows into memory at once.

import pandas as pd

pd.set_option("display.max_columns", None)
books_json = pd.read_json(LOCALPATH + 'goodreads/downloads/goodreads_books.json', lines=True, chunksize=200)

Just check that the data looks good in the first chunk:

for chunk in books_json:
    break

chunk.head()
isbn text_reviews_count series country_code language_code popular_shelves asin is_ebook average_rating kindle_asin similar_books description format link authors publisher num_pages publication_day isbn13 publication_month edition_information publication_year url image_url book_id ratings_count work_id title title_without_series
600 11 [935729] US en-GB [{'count': '6', 'name': 'currently-reading'}, ... B01GQONJZG true 4.05 B01GQONJZG [] Heat sizzles during the winter...\nJohn Prince... https://www.goodreads.com/book/show/30821261-o... [{'author_id': '263660', 'role': ''}] https://www.goodreads.com/book/show/30821261-o... https://s.gr-assets.com/assets/nophoto/book/11... 30821261 34 51413481 One Cold Night (ARe Fearless #3) One Cold Night (ARe Fearless #3)
601 9780985779 14 [449075] US en-US [{'count': '497', 'name': 'to-read'}, {'count'... false 3.77 B008LC1OK4 [27826531, 18888727, 25822644, 22911249, 15835... Empty-nester Carol Andrews is thrilled when da... Mass Market Paperback https://www.goodreads.com/book/show/15837689-m... [{'author_id': '3001335', 'role': ''}] Baby Boomer Mysteries Press 237 9780985779900 7 2012 https://www.goodreads.com/book/show/15837689-m... https://images.gr-assets.com/books/1355459200m... 15837689 32 21576715 Marriage Can Be Murder (Baby Boomer Mystery, #3) Marriage Can Be Murder (Baby Boomer Mystery, #3)
602 1 [] US [{'count': '1', 'name': 'to-read'}] true 2.00 [] https://www.goodreads.com/book/show/19048323-d... [{'author_id': '2770657', 'role': ''}] 9783455501629 https://www.goodreads.com/book/show/19048323-d... https://s.gr-assets.com/assets/nophoto/book/11... 19048323 2 27073087 Der Gedanke lenkt den Körper: Körpersprache - ... Der Gedanke lenkt den Körper: Körpersprache - ...
603 7 [542966] US fin [{'count': '11777', 'name': 'to-read'}, {'coun... false 3.54 [24044142, 18664985, 16281371, 17131859, 52261... Alue X:n salaisuus raottuu\nEtelarajaan, halli... Paperback https://www.goodreads.com/book/show/25799070-h... [{'author_id': '33919', 'role': ''}, {'author_... Like 382 31 9789520112899 9 2015 https://www.goodreads.com/book/show/25799070-h... https://images.gr-assets.com/books/1446633523m... 25799070 55 25384114 Hallinta (Eteläraja-trilogia, #2) Hallinta (Eteläraja-trilogia, #2)
604 2 [] US [{'count': '64', 'name': 'to-read'}, {'count':... B00CY2I1XK true 3.77 B00CY2I1XK [] During the course of gaining a divorce from th... https://www.goodreads.com/book/show/19048320-h... [{'author_id': '3459410', 'role': ''}] https://www.goodreads.com/book/show/19048320-h... https://s.gr-assets.com/assets/nophoto/book/11... 19048320 20 25192239 His Brother's Widow His Brother's Widow

I can move through it in chunks and save each piece as a separate .csv file, then process that data more easily. And in the process, I might as well filter out any books I definitely don't want -- like ebooks.

chunk['is_ebook'].value_counts()
false    142
true      58
Name: is_ebook, dtype: int64
from tqdm import tqdm

I want to get the biggest size of chunk that my memory can easily allow, to save time on processing.

books_json = pd.read_json('/run/media/mage/INDESTRUCTIBLESLIME/Replaceable/datasets/goodreads/downloads/goodreads_books.json', lines=True, chunksize=100000)
for i, chunk in tqdm(enumerate(books_json)):
    filtered_chunk = chunk[chunk['is_ebook'] == 'false']
    filtered_chunk.to_csv(f'../../records/goodreads_books_{str(i).zfill(4)}.csv')
#     print(i, len(filtered_chunk))
24it [07:43, 19.33s/it]

Over TWO MILLION BOOKS in this dataset! How exciting. But a lot of that data may be dirty, or not useful to us. Let's do something about that.

It's good practice to look at a sample chunk and clean it up filtering out any columns or rows that we don't want. Then run that process across all the chunks. Hopefully the final product will be small enough to hold in memory at once, and we can start doing transforms.

df = pd.read_csv('../../records/goodreads_books_0007.csv')

Pandas has some functions for exploratory data analysis, but for a quick statistical overview we cna use the pandas_profiling library. Hopefully this will create widgets that transfer to the blog version...

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_widgets()

Takeaways from the profile:

Columns to remove:

  • with more than 50% missing values:
      edition_information
      kindle_asin
      `
  • with constant values:
      country_code
      is_ebook
      `

The asin values are mostly missing, but we're going to need those to compare to Amazon data later so keep that column.

Lots of missing values for isbn and isbn13, maybe they will cancel each other out?

The average_rating is around 3.8, with a standard deviation around 0.53. A 4.57 rating is 95th percentile, whcih might be important to know later. Almost no books have 0, 1, or 2 ratings, so the 5th percentile is right at 3.00.

Formats: still have audiobooks in here. Probably better to clean them out actively and leave weird one-off data in there, rtather than restrict only to known book formats.

The publication_month might be useful, but not publication_day because too mny values are missing. Unfortunately publication_year is missing 15% of it's

I am not really qualified to comment on th popularity of non-english books, but the lang field is missing a lot of values, so I'll keep all those books until I find a compelling reason to leave them out. But should probably filter out all the non-English-lang rows..

Only about half of books have a similar_books measurement, but that seems like ti will be really useful so I'll keep them all.

The num_pages distribution looks good, just skewed by a few multi-volume sets. That's fine for our purposes.

It looks like title_without_series actually includes the series name anyway, so let's just drop that entirely.

After that much cleanup, the dataset should be a little smaller and then we can really explore the whole terrain.

drop_idx = ['edition_information','kindle_asin', 'country_code', 'is_ebook', 'publication_day', 'title_without_series']
dropped_df = df.drop(drop_idx, axis=1)
len(dropped_df)
71081
langs = [o for o in df['language_code'].value_counts().index if o not in ['eng','en-US','en-GB','en-CA']]
eng_df = dropped_df[~dropped_df['language_code'].isin(langs)]
len(eng_df)
53984
formats = ['Audio CD', 'Audiobook', 'Audible Audio', 'Audio', 'audio cd', 'Audio Cassette', 'MP3 CD', 'CD-ROM', 'Podiobook', 'Audio Play', 'MP3', 'audio']
paper_df = eng_df[~eng_df['format'].isin(formats)]
len(paper_df)
51176
def clean_df(df):
    dropped_df = df.drop(drop_idx, axis=1)
    eng_df = dropped_df[~dropped_df['language_code'].isin(langs)]
    paper_df = eng_df[~eng_df['format'].isin(formats)]
    return(paper_df)
cleaned_df = clean_df(df)
len(cleaned_df)
51176

How can I collate the data into an accessible format?

I do want all the data in one dataframe. Let's save it as a CSV, which will probably be large but not nearly as bad as the JSON that we started with.

total_df = pd.DataFrame()
for i in tqdm(range(24)):
    df = pd.read_csv(f'../../records/goodreads_books_{str(i).zfill(4)}.csv')
    cleaned_df = clean_df(df)
#     print(len(cleaned_df), len(total_df))
    total_df = pd.concat([total_df,cleaned_df], ignore_index=True)
100%|██████████| 24/24 [01:12<00:00,  3.01s/it]
total_df.tail()
Unnamed: 0 isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count work_id title
1215978 2360645 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts
1215979 2360647 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride
1215980 2360651 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis
1215981 2360652 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection
1215982 2360653 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3)
total_df.to_csv('../../records/cleaned_goodreads_books.csv')

Explore

We're down to 1.2 million books, in a 4GB CSV file instead of a 9GB JSON! :sweat_smile: Here I restart my notebook environment and load the total_df to clear out some memory, though.

import pandas as pd

pd.set_option("display.max_columns", None)
total_df = pd.read_csv('../../records/cleaned_goodreads_books.csv')
total_df.tail()
Unnamed: 0 Unnamed: 0.1 isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count work_id title
1215978 1215978 2360645 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts
1215979 1215979 2360647 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride
1215980 1215980 2360651 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis
1215981 1215981 2360652 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection
1215982 1215982 2360653 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3)

Find top genre for each book

The popular_shelves column has a list of which "shelves", or tags really, have been associated with this book, and how many times. We can use this to extract a categorical top_genre feature.

def get_genre(shelf_dict):
    if len(shelf_dict) > 0:
        top = [o for o in shelf_dict if o['name'] not in ['to-read', 'currently-reading', 'kindle', 'non-fiction', 'nonfiction']]
        if len(top):
            return(top[0]['name'])
        else:
            return('NONFIC')
    return('UNK')

Make sure to wrap the function in a generator here, so that it will lazy-load each row rather than grab the entire popular_shelves column at once

g = (get_genre(eval(o)) for o in total_df['popular_shelves'])
total_df['top_genre'] = pd.Series(g, dtype='unicode')
total_df.tail()
Unnamed: 0 Unnamed: 0.1 isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count work_id title top_genre
1215978 1215978 2360645 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts biography
1215979 1215979 2360647 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride harlequin
1215980 1215980 2360651 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis mystery
1215981 1215981 2360652 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection poetry
1215982 1215982 2360653 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3) erotica

Getting bonus indexes from pandas lol. I know there's a smart way to load things without this problem but I forgot to do it. To save time from loading again I'll just drop those columns.

total_df.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1).tail()
isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count work_id title top_genre
1215978 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts biography
1215979 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride harlequin
1215980 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis mystery
1215981 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection poetry
1215982 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3) erotica

Get author data

I want the author for each work. Their names are associated with their author ids in a different JSON file, we'll have to copy the data in somehow

author_df = pd.read_json('/run/media/mage/INDESTRUCTIBLESLIME/Replaceable/datasets/goodreads/downloads/goodreads_book_authors.json', lines=True)
author_df = author_df.set_index('author_id')
author_df
average_rating text_reviews_count name ratings_count
author_id
604031 3.98 7 Ronald J. Fields 49
626222 4.08 28716 Anita Diamant 546796
10333 3.92 5075 Barbara Hambly 122118
9212 3.68 36262 Jennifer Weiner 888522
149918 3.82 96 Nigel Pennick 1740
... ... ... ... ...
197551 4.36 4 Patty Furbush 11
3988103 4.33 3 Jim Schlinkman 6
13464507 4.00 2 Rich Jolly 18
7427847 3.31 1 sr@ mwrGn 13
5401342 3.70 11 Barry S. Brown 43

829529 rows × 4 columns

total_df['authors']
0                      [{'author_id': '604031', 'role': ''}]
1                       [{'author_id': '10333', 'role': ''}]
2                        [{'author_id': '9212', 'role': ''}]
3                      [{'author_id': '149918', 'role': ''}]
4                     [{'author_id': '3041852', 'role': ''}]
                                 ...                        
1215978    [{'author_id': '10681', 'role': ''}, {'author_...
1215979                [{'author_id': '319441', 'role': ''}]
1215980    [{'author_id': '2448', 'role': ''}, {'author_i...
1215981           [{'author_id': '82312', 'role': 'Editor'}]
1215982               [{'author_id': '7789809', 'role': ''}]
Name: authors, Length: 1215983, dtype: object
author_df.loc[604031]
average_rating                    3.98
text_reviews_count                   7
name                  Ronald J. Fields
ratings_count                       49
Name: 604031, dtype: object
def get_author(a_list):
    try:
        top = int(eval(a_list)[0]['author_id'])
        return(author_df.loc[top]['name'])
    except IndexError as e:
        print(e)
        return('UNK')
from tqdm import tqdm
g = [get_author(a_list) for a_list in tqdm(total_df['authors'])]
total_df['author_name'] = g
  0%|          | 0/1215983 [04:12<?, ?it/s]
total_df.tail()
Unnamed: 0 Unnamed: 0.1 isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count work_id title top_genre author_name
1215978 1215978 2360645 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts biography Beatrice Gormley
1215979 1215979 2360647 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride harlequin Trish Morey
1215980 1215980 2360651 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis mystery Arthur Conan Doyle
1215981 1215981 2360652 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection poetry Nicola Baxter
1215982 1215982 2360653 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3) erotica S.E. Reign
total_df.keys()
Index(['Unnamed: 0', 'Unnamed: 0.1', 'isbn', 'text_reviews_count', 'series',
       'language_code', 'popular_shelves', 'asin', 'average_rating',
       'similar_books', 'description', 'format', 'link', 'authors',
       'publisher', 'num_pages', 'isbn13', 'publication_month',
       'publication_year', 'url', 'image_url', 'book_id', 'ratings_count',
       'work_id', 'title', 'top_genre', 'author_name'],
      dtype='object')
total_df.to_csv('../../records/cleaned_goodreads_books.csv')

Extract a subset of data for graphing

Not all of the columns will cluster nicely for graphing. Here we can extract a smaller dataset, with just continuous and categorical variables that are useful for exploratory data analysis.

graph_df = total_df[['title',
                    'author_name',
                    'top_genre',
                    'publisher',
                    'publication_year',
                    'format',
                    'num_pages',
                    'average_rating',
                    'ratings_count',
                    'text_reviews_count',
                    'description',
                   ]]
graph_df['publication_year'] = [g if g < 2021 else 1950 for g in graph_df['publication_year']]
/home/mage/.local/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
graph_df.tail()
title author_name top_genre publisher publication_year format num_pages average_rating ratings_count text_reviews_count description
1215978 Jacqueline Kennedy Onassis: Friend of the Arts Beatrice Gormley biography Aladdin 2002.0 Paperback 176.0 4.36 18.0 1.0 One of the most popular series ever published ...
1215979 The Spaniard's Blackmailed Bride Trish Morey harlequin Harlequin 2007.0 Paperback 192.0 3.42 112.0 9.0 Blackmailed into marriage to save her family, ...
1215980 Sherlock Holmes and the July Crisis Arthur Conan Doyle mystery MX Publishing 2015.0 Paperback 148.0 3.50 6.0 2.0 Sir Arthur Conan Doyle is brought back to life...
1215981 The Children's Classic Poetry Collection Nicola Baxter poetry Smithmark Publishers 1996.0 Hardcover 96.0 4.00 36.0 6.0 Gathers poems by William Blake, Emily Bronte, ...
1215982 101 Nights: Volume One (101 Nights, #1-3) S.E. Reign erotica Guerrilla Wordfare 2014.0 Paperback 306.0 4.37 70.0 17.0 Volume One contains: "Claimed," "Tainted," and...
graph_df.to_csv('../../records/to_graph.csv')