Still working on the Invisible Canon. Today I want to find a superset of canons for fiction: not genre fiction, but "literature". Of course, the Stanford Literary Lab has already done work on this. In their Pamphlet 8: Between Canon and Corpus they demonstrate that overlaying different "top 100 books of the 20th century" lists leads to significant overlap, and can be used to triangulate a sort of "most voted for" list. They use six different lists and end up with roughly 400 unique books.

I can use this "found canon" technique to craft a data feature, something like "number of times this work has been canonized". That's useful in its own right, but it could also be the target for a collaborative filtering system.

I need to extract that information from a bunch of HTML tables and PDF reports, because academics. And then map it to the features from my previous distillation. One problem that will arise is that goodreads data is sorted by edition, while the canon lists are sorted by work. So the first problem is to reshape the data. Pandas provides MultiIndex, a way of stacking data in multiple dimensions. let's try that

import pandas as pd

pd.set_option("display.max_columns", None)
total_df = pd.read_csv('../../records/cleaned_goodreads_books.csv')
total_df = total_df.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'], axis=1)
total_df.tail()
isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count work_id title top_genre author_name
1215978 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 25313618.0 Jacqueline Kennedy Onassis: Friend of the Arts biography Beatrice Gormley
1215979 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 2710420.0 The Spaniard's Blackmailed Bride harlequin Trish Morey
1215980 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 46130263.0 Sherlock Holmes and the July Crisis mystery Arthur Conan Doyle
1215981 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 2349247.0 The Children's Classic Poetry Collection poetry Nicola Baxter
1215982 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 41332799.0 101 Nights: Volume One (101 Nights, #1-3) erotica S.E. Reign
total_df = total_df.set_index(['work_id'])
total_df
isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count title top_genre author_name
work_id
5400751.0 0312853122 1.0 [] NaN [{'count': '3', 'name': 'to-read'}, {'count': ... NaN 4.00 [] NaN Paperback https://www.goodreads.com/book/show/5333265-w-... [{'author_id': '604031', 'role': ''}] St. Martin's Press 256.0 9780312853129 9.0 1984.0 https://www.goodreads.com/book/show/5333265-w-... https://images.gr-assets.com/books/1310220028m... 5333265 3.0 W.C. Fields: A Life on Film p Ronald J. Fields
8948723.0 NaN 7.0 ['189911'] eng [{'count': '58', 'name': 'to-read'}, {'count':... B00071IKUY 4.03 ['19997', '828466', '1569323', '425389', '1176... Omnibus book club edition containing the Ladie... Hardcover https://www.goodreads.com/book/show/7327624-th... [{'author_id': '10333', 'role': ''}] Nelson Doubleday, Inc. 600.0 NaN NaN 1987.0 https://www.goodreads.com/book/show/7327624-th... https://images.gr-assets.com/books/1304100136m... 7327624 140.0 The Unschooled Wizard (Sun Wolf and Starhawk, ... fantasy Barbara Hambly
6243154.0 0743294297 3282.0 [] eng [{'count': '7615', 'name': 'to-read'}, {'count... NaN 3.49 ['6604176', '6054190', '2285777', '82641', '75... Addie Downs and Valerie Adler were eight when ... Hardcover https://www.goodreads.com/book/show/6066819-be... [{'author_id': '9212', 'role': ''}] Atria Books 368.0 9780743294294 7.0 2009.0 https://www.goodreads.com/book/show/6066819-be... https://s.gr-assets.com/assets/nophoto/book/11... 6066819 51184.0 Best Friends Forever chick-lit Jennifer Weiner
278577.0 0850308712 5.0 [] NaN [{'count': '32', 'name': 'to-read'}, {'count':... NaN 3.40 [] NaN NaN https://www.goodreads.com/book/show/287140.Run... [{'author_id': '149918', 'role': ''}] NaN NaN 9780850308716 NaN NaN https://www.goodreads.com/book/show/287140.Run... https://images.gr-assets.com/books/1413219371m... 287140 15.0 Runic Astrology: Starcraft and Timekeeping in ... runes Nigel Pennick
278578.0 1599150603 7.0 [] NaN [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.13 [] Relates in vigorous prose the tale of Aeneas, ... Paperback https://www.goodreads.com/book/show/287141.The... [{'author_id': '3041852', 'role': ''}] Yesterday's Classics 162.0 9781599150604 9.0 2006.0 https://www.goodreads.com/book/show/287141.The... https://s.gr-assets.com/assets/nophoto/book/11... 287141 46.0 The Aeneid for Boys and Girls history Alfred J. Church
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25313618.0 0689852959 1.0 [] NaN [{'count': '22', 'name': 'to-read'}, {'count':... NaN 4.36 [] One of the most popular series ever published ... Paperback https://www.goodreads.com/book/show/331839.Jac... [{'author_id': '10681', 'role': ''}, {'author_... Aladdin 176.0 9780689852954 9.0 2002.0 https://www.goodreads.com/book/show/331839.Jac... https://s.gr-assets.com/assets/nophoto/book/11... 331839 18.0 Jacqueline Kennedy Onassis: Friend of the Arts biography Beatrice Gormley
2710420.0 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 The Spaniard's Blackmailed Bride harlequin Trish Morey
46130263.0 178092870X 2.0 [] eng [{'count': '702', 'name': 'to-read'}, {'count'... NaN 3.50 ['12064253', '25017213', '571796', '27306126',... Sir Arthur Conan Doyle is brought back to life... Paperback https://www.goodreads.com/book/show/26168430-s... [{'author_id': '2448', 'role': ''}, {'author_i... MX Publishing 148.0 9781780928708 8.0 2015.0 https://www.goodreads.com/book/show/26168430-s... https://images.gr-assets.com/books/1440592011m... 26168430 6.0 Sherlock Holmes and the July Crisis mystery Arthur Conan Doyle
2349247.0 0765197456 6.0 [] NaN [{'count': '37', 'name': 'to-read'}, {'count':... NaN 4.00 [] Gathers poems by William Blake, Emily Bronte, ... Hardcover https://www.goodreads.com/book/show/2342551.Th... [{'author_id': '82312', 'role': 'Editor'}] Smithmark Publishers 96.0 9780765197450 8.0 1996.0 https://www.goodreads.com/book/show/2342551.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2342551 36.0 The Children's Classic Poetry Collection poetry Nicola Baxter
41332799.0 162378140X 17.0 ['658195'] eng [{'count': '56', 'name': 'to-read'}, {'count':... NaN 4.37 ['23562786', '13548289', '26094541', '20570173... Volume One contains: "Claimed," "Tainted," and... Paperback https://www.goodreads.com/book/show/22017381-1... [{'author_id': '7789809', 'role': ''}] Guerrilla Wordfare 306.0 9781623781408 4.0 2014.0 https://www.goodreads.com/book/show/22017381-1... https://images.gr-assets.com/books/1398621236m... 22017381 70.0 101 Nights: Volume One (101 Nights, #1-3) erotica S.E. Reign

1215983 rows × 24 columns

total_df.loc[total_df.index.duplicated() == True]
isbn text_reviews_count series language_code popular_shelves asin average_rating similar_books description format link authors publisher num_pages isbn13 publication_month publication_year url image_url book_id ratings_count title top_genre author_name
work_id
3349802.0 174114244X 9.0 [] NaN [{'count': '19688', 'name': 'to-read'}, {'coun... NaN 3.79 ['8359929', '723742', '297130', '7570244', '39... From the moment Ross's fiancee Aimee was kille... NaN https://www.goodreads.com/book/show/820229.Sec... [{'author_id': '7128', 'role': ''}] NaN NaN 9781741142440 NaN NaN https://www.goodreads.com/book/show/820229.Sec... https://images.gr-assets.com/books/1293769966m... 820229 82.0 Second Glance fiction Jodi Picoult
3349802.0 0340897260 46.0 [] en-GB [{'count': '19688', 'name': 'to-read'}, {'coun... NaN 3.79 ['8359929', '723742', '297130', '7570244', '39... From the moment Ross's fiancee Aimee was kille... Paperback https://www.goodreads.com/book/show/820226.Sec... [{'author_id': '7128', 'role': ''}] Hodder 483.0 9780340897263 NaN 2008.0 https://www.goodreads.com/book/show/820226.Sec... https://images.gr-assets.com/books/1363397305m... 820226 334.0 Second Glance fiction Jodi Picoult
3349802.0 0340897279 4.0 [] eng [{'count': '19688', 'name': 'to-read'}, {'coun... NaN 3.79 ['8359929', '723742', '297130', '7570244', '39... From the moment Ross's fiancee Aimee was kille... Mass Market Paperback https://www.goodreads.com/book/show/820227.Sec... [{'author_id': '7128', 'role': ''}] Hodder 420.0 9780340897270 NaN 2007.0 https://www.goodreads.com/book/show/820227.Sec... https://images.gr-assets.com/books/1288638236m... 820227 24.0 Second Glance fiction Jodi Picoult
206370.0 0684801302 16.0 [] NaN [{'count': '1654', 'name': 'to-read'}, {'count... NaN 4.13 ['25343', '256004', '426682', '160909', '13422... An award-winning research psychologist who has... Hardcover https://www.goodreads.com/book/show/213189.The... [{'author_id': '14734208', 'role': ''}, {'auth... Simon & Schuster 240.0 9780684801308 2.0 1997.0 https://www.goodreads.com/book/show/213189.The... https://s.gr-assets.com/assets/nophoto/book/11... 213189 70.0 The Heart of Parenting: Raising an Emotionally... parenting John M. Gottman
752200.0 080500291X 1.0 ['191162'] NaN [{'count': '8205', 'name': 'to-read'}, {'count... NaN 4.14 ['7926', '42337', '7904', '7932', '377889', '3... Saturdays can make dreams come true when the M... NaN https://www.goodreads.com/book/show/8037412-th... [{'author_id': '3420', 'role': ''}] NaN NaN 9780805002911 NaN NaN https://www.goodreads.com/book/show/8037412-th... https://images.gr-assets.com/books/1412898312m... 8037412 6.0 The Saturdays childrens Elizabeth Enright
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2206102.0 0373126794 13.0 [] eng [{'count': '231', 'name': 'to-read'}, {'count'... NaN 3.52 ['2685097', '1866878', '2597992', '6282598', '... Revenge, passion and an arranged marriage...Du... Paperback https://www.goodreads.com/book/show/2200344.Th... [{'author_id': '621880', 'role': ''}] Harlequin 192.0 9780373126798 11.0 2007.0 https://www.goodreads.com/book/show/2200344.Th... https://s.gr-assets.com/assets/nophoto/book/11... 2200344 223.0 The Spanish Duke's Virgin Bride harlequin Chantelle Shaw
16154954.0 NaN 16.0 [] eng [{'count': '773', 'name': 'to-read'}, {'count'... NaN 4.07 ['978053', '425481', '361551', '255045', '2581... "The Short Happy Life of Francis Macomber" is ... NaN https://www.goodreads.com/book/show/7195902-th... [{'author_id': '1455', 'role': ''}] NaN NaN NaN NaN NaN https://www.goodreads.com/book/show/7195902-th... https://images.gr-assets.com/books/1329637203m... 7195902 290.0 The Short Happy Life of Francis Macomber short-stories Ernest Hemingway
7198840.0 0749927577 1.0 [] NaN [{'count': '28', 'name': 'to-read'}, {'count':... NaN 4.32 [] Anne Jirsch is a psychic with an extraordinary... Paperback https://www.goodreads.com/book/show/2233591.In... [{'author_id': '1009480', 'role': ''}, {'autho... Piatkus Books 302.0 9780749927578 2.0 2007.0 https://www.goodreads.com/book/show/2233591.In... https://images.gr-assets.com/books/1328821191m... 2233591 12.0 Instant Intuition: A Psychic's Guide to Findin... books-i-owe Anne Jirsch
3155241.0 0263853357 1.0 [] eng [{'count': '178', 'name': 'to-read'}, {'count'... NaN 3.32 ['2200344', '6408413', '2789084', '2494238', '... A marriage is forever, not just for convenienc... Paperback https://www.goodreads.com/book/show/6818068-co... [{'author_id': '4990', 'role': ''}] Mills & Boon 192.0 9780263853353 7.0 2007.0 https://www.goodreads.com/book/show/6818068-co... https://images.gr-assets.com/books/1504917893m... 6818068 3.0 Contracted: A Wife for the Bedroom harlequin Carol Marinelli
2710420.0 0373126476 9.0 [] NaN [{'count': '78', 'name': 'to-read'}, {'count':... NaN 3.42 ['2200344', '695337', '10333421', '1934240', '... Blackmailed into marriage to save her family, ... Paperback https://www.goodreads.com/book/show/2685097-th... [{'author_id': '319441', 'role': ''}] Harlequin 192.0 9780373126477 7.0 2007.0 https://www.goodreads.com/book/show/2685097-th... https://s.gr-assets.com/assets/nophoto/book/11... 2685097 112.0 The Spaniard's Blackmailed Bride harlequin Trish Morey

288092 rows × 24 columns

len(total_df.index.unique())
927891

It worked! Now I have a list of 927,891 individual works. Time to import the different canons, and associate their data to the books!

Gather the canons

I can build here off the great work of others. As mentioned above, the Stanford Literary Lab published six canons(PDF) that they concatenated into their 20th Century Fiction corpus.

I've also found the incredible The Greatest Books project by Shane Sherman, which does a similar thing but with 196 lists! Unfortunately he only distributes the rankings post-weighting, whereas I would like each list to be a separate column and compare how many times each book gets mentioned. I have sent him a request for the lists but i may have to scrape this website myself...

For now, I should gather the lists that didn't make it onto Mr. Sherman's list. Those would be the bestsellers and the reader rankings, for the most part. He explicitly prefers prestige over popularity. So there are three lists from the LitLab pamphlet that I suspect didn't make it into the Greatest Books list, and a few from other places around the web, especially Penguin's 100 must-read classic books, as chosen by our readers. I err on the side of popularity, myself, but the goodreads ratings should do plenty to balance that effect.

%ls ../../records
bkunde.csv                      goodreads_books_0016.csv
cleaned_goodreads_books.csv     goodreads_books_0017.csv
goodbooks-genre-pop-time.html*  goodreads_books_0018.csv
goodreads_books_0000.csv        goodreads_books_0019.csv
goodreads_books_0001.csv        goodreads_books_0020.csv
goodreads_books_0002.csv        goodreads_books_0021.csv
goodreads_books_0003.csv        goodreads_books_0022.csv
goodreads_books_0004.csv        goodreads_books_0023.csv
goodreads_books_0005.csv        goodreads-classics.csv
goodreads_books_0006.csv        library-journal.csv
goodreads_books_0007.csv        modern-library-readers-list.csv
goodreads_books_0008.csv        modern-library.tsv
goodreads_books_0009.csv        penguin-readers.csv
goodreads_books_0010.csv        postcolonial-studies.csv
goodreads_books_0011.csv        pub-weekly.csv
goodreads_books_0012.csv        to_graph.csv*
goodreads_books_0013.csv        ucsd-goodreads-genre-pop-time.html*
goodreads_books_0014.csv        wikipedia-bestselling-books.csv
goodreads_books_0015.csv
bestsellers = pd.read_csv('../../records/wikipedia-bestselling-books.csv')
library = pd.read_csv('../../records/library-journal.csv')
penguin = pd.read_csv('../../records/penguin-readers.csv')
ml_readers = pd.read_csv('../../records/modern-library-readers-list.csv')
pw_readers = pd.read_csv('../../records/pub-weekly.csv') 
psa = pd.read_csv('../../records/postcolonial-studies.csv') 
for i in [bestsellers, library, penguin, ml_readers, pw_readers, psa]:
    print(len(i))
167
150
100
100
83
100

See how these lists are all different lengths? That's why we can't just average out the different rankings. We need each book to have a one-hot encoding of the list, either 0 if it's not on or 1 if it is. So we have to add columns to our dataframe, one for each list. That really means I need a bunch of small lists, not one megalist like The Greatest Books provides. I'll have to scrape them.

import requests
from bs4 import BeautifulSoup
url = "https://thegreatestbooks.org/lists/28"
r = requests.get(url)
htm = BeautifulSoup(r.text, 'html.parser')
h4s = htm.find_all('h4')
[a.get_text() for a in h4s[0].findAll('a')]
['Don Quixote', 'Miguel de Cervantes']
def get_list_from_htm(htm):
    title = htm.find_all('h2')[0].get_text()
    h4s = htm.find_all('h4')
    books = [[a.get_text() for a in o.findAll('a')][:2] for o in h4s if len(o) > 1]
    return(title, books)
get_list_from_htm(htm)
('TIME Magazine All Time 100 Novels  by TIME Magazine',
 [['The Adventures of Augie March', 'Saul Bellow'],
  ["All the King's Men", 'Robert Penn Warren'],
  ['American Pastoral', 'Philip Roth'],
  ['Animal Farm', 'George Orwell'],
  ['Appointment in Samarra', "John O'Hara"],
  ["Are You There God? It's Me, Margaret", 'Judy Blume'],
  ['The Assistant', 'Bernard Malamud'],
  ['Atonement', 'Ian McEwan'],
  ['Beloved', 'Toni Morrison'],
  ['The Berlin Stories', 'Christopher Isherwood'],
  ['The Big Sleep', 'Raymond Chandler'],
  ['The Blind Assassin', 'Margaret Atwood'],
  ['Blood Meridian', 'Cormac McCarthy'],
  ['Brideshead Revisited', 'Evelyn Waugh'],
  ['The Bridge of San Luis Rey', 'Thornton Wilder'],
  ['Call It Sleep', 'Henry Roth'],
  ['Catch-22 ', 'Joseph Heller'],
  ['The Catcher in the Rye', 'J. D. Salinger'],
  ['A Clockwork Orange', 'Anthony Burgess'],
  ['The Confessions of Nat Turner', 'William Styron'],
  ['The Corrections', 'Jonathan Franzen'],
  ['The Crying of Lot 49 ', 'Thomas Pynchon'],
  ['A Dance to the Music of Time', 'Anthony Powell'],
  ['The Day of the Locust ', 'Nathanael West'],
  ['Death Comes for the Archbishop', 'Willa Cather'],
  ['A Death in the Family ', 'James Agee'],
  ['The Death of the Heart ', 'Elizabeth Bowen'],
  ['Deliverance ', 'James Dickey'],
  ['Dog Soldiers ', 'Robert Stone'],
  ['Falconer', 'John Cheever'],
  ["The French Lieutenant's Woman ", 'John Fowles'],
  ['The Golden Notebook', 'Doris Lessing'],
  ['Go Tell it on the Mountain ', 'James Baldwin'],
  ['Gone With the Wind ', 'Margaret Mitchell'],
  ['The Grapes of Wrath ', 'John Steinbeck'],
  ["Gravity's Rainbow ", 'Thomas Pynchon'],
  ['The Great Gatsby ', 'F. Scott Fitzgerald'],
  ['A Handful of Dust ', 'Evelyn Waugh'],
  ['The Heart Is A Lonely Hunter', 'Carson McCullers'],
  ['The Heart of the Matter ', 'Graham Greene'],
  ['Herzog ', 'Saul Bellow'],
  ['Housekeeping', 'Marilynne Robinson'],
  ['A House for Mr. Biswas ', 'V. S. Naipaul'],
  ['I, Claudius ', 'Robert Graves'],
  ['Infinite Jest ', 'David Foster Wallace'],
  ['Invisible Man ', 'Ralph Ellison'],
  ['Light in August ', 'William Faulkner'],
  ['The Lion, The Witch and the Wardrobe ', 'C. S. Lewis'],
  ['Lolita ', 'Vladimir Nabokov'],
  ['Lord of the Flies ', 'William Golding'],
  ['The Lord of the Rings', 'J. R. R. Tolkien'],
  ['Loving', 'Henry Green'],
  ['Lucky Jim ', 'Kingsley Amis'],
  ['The Man Who Loved Children ', 'Christina Stead'],
  ["Midnight's Children ", 'Salman Rushdie'],
  ['Money ', 'Martin Amis'],
  ['The Moviegoer', 'Walker Percy'],
  ['Mrs. Dalloway ', 'Virginia Woolf'],
  ['Naked Lunch', 'William S. Burroughs'],
  ['Native Son ', 'Richard Wright'],
  ['Neuromancer ', 'William Gibson'],
  ['Never Let Me Go ', 'Kazuo Ishiguro'],
  ['Nineteen Eighty Four', 'George Orwell'],
  ['On the Road', 'Jack Kerouac'],
  ["One Flew Over the Cuckoo's Nest ", 'Ken Kesey'],
  ['The Painted Bird ', 'Jerzy Kosinski'],
  ['Pale Fire ', 'Vladimir Nabokov'],
  ['A Passage to India ', 'E. M. Forster'],
  ['Play It As It Lays ', 'Joan Didion'],
  ["Portnoy's Complaint ", 'Philip Roth'],
  ['Possession ', 'A. S. Byatt'],
  ['The Power and the Glory', 'Graham Greene'],
  ['The Prime of Miss Jean Brodie ', 'Muriel Spark'],
  ['Rabbit, Run ', 'John Updike'],
  ['Ragtime', 'E. L. Doctorow'],
  ['The Recognitions ', 'William Gaddis'],
  ['Red Harvest ', 'Dashiell Hammett'],
  ['Revolutionary Road', 'Richard Yates'],
  ['The Sheltering Sky ', 'Paul Bowles'],
  ['Slaughterhouse-Five ', 'Kurt Vonnegut'],
  ['Snow Crash ', 'Neal Stephenson'],
  ['The Sot-Weed Factor ', 'John Barth'],
  ['The Sound and the Fury', 'William Faulkner'],
  ['The Sportswriter ', 'Richard Ford'],
  ['The Spy Who Came in From the Cold ', 'John le Carré'],
  ['The Sun Also Rises ', 'Ernest Hemingway'],
  ['Their Eyes Were Watching God ', 'Zora Neale Hurston'],
  ['Things Fall Apart ', 'Chinua Achebe'],
  ['To Kill a Mockingbird ', 'Harper Lee'],
  ['To the Lighthouse ', 'Virginia Woolf'],
  ['Tropic of Cancer ', 'Henry Miller'],
  ['Ubik', 'Philip K. Dick'],
  ['Under the Net ', 'Iris Murdoch'],
  ['Under the Volcano ', 'Malcolm Lowry'],
  ['Watchmen ', 'Alan Moore'],
  ['White Noise ', 'Don DeLillo'],
  ['White Teeth ', 'Zadie Smith'],
  ['Wide Sargasso Sea', 'Jean Rhys'],
  ['An American Tragedy', 'Theodore Dreiser'],
  ['At Swim Two-Birds', "Flann O'Brien"]])

Hey, that worked! The website has good semantic HTML, so it will be easy to extrapolate the same process to the rest of the

I want to make a dataframe of these lists, organized by title. That way I can sort it into the larger Goodreads dataset, or extract information from that one to this one, either way.

I don't care about relative rankings within a given list, just the cumulative amount of rankings across lists. So I can use a one-hot encoding: 0 if a book is not on a list, 1 if it is. There's probably a nifty method for this, but I've never done it before, so I'm just going to implement it manually right now.

First I'll try it on the data I already have. Then I'll start scraping the website.

local_lists = [bestsellers, library, penguin, ml_readers, pw_readers, psa]
for i in local_lists:
    print(i.keys())
Index(['Book', 'Author(s)', 'Original language', 'First published',
       'Approximate sales', 'Genre', 'Author'],
      dtype='object')
Index(['\nK', 'L', 'M', 'R', 'App.', 'Points', 'Rank', 'Author', 'Title',
       'Date'],
      dtype='object')
Index(['Title', ' Author', ' Year', 'Author'], dtype='object')
Index(['Book', ' Author', ' Date', ' Rank', 'Author'], dtype='object')
Index(['Book', ' Author', ' Date', 'Author'], dtype='object')
Index(['Title', ' Author', ' Date', ' Rank', 'Author'], dtype='object')

Don't care about date or genre or any of that, I can reconstruct that later. Either 'Book' or 'Title' exist in each one, and 'Author' or in one case 'Author(s)', and that's all I can get from the Greatest Books website, so that's what i will work with here.

I didn't pair names with these lists, so I'll just zip up a little list of titles real quick:

list_names = ['Wikipedia Bestselling', 'Library Journal', 'Penguin Readers', 'Modern Library Readers', 'PW Bestsellers', 'Postcolonial Studies']
titles_df = pd.DataFrame()
for i in range(6):
    name = list_names[i]
    print(name)
    df = local_lists[i]
    if 'Book' in df.keys():
        bk = 'Book'
    else: bk = 'Title'
    au = 'Author(s)' if 'Author(s)' in df.keys() else 'Author' if 'Author' in df.keys() else ' Author'
    df['Author'] = df[au]
    new_df = df[[bk, 'Author']].set_index(bk)
    new_df[name] = 1
    titles_df = titles_df.append(new_df)
Wikipedia Bestselling
Library Journal
Penguin Readers
Modern Library Readers
PW Bestsellers
Postcolonial Studies
titles_df
Author Wikipedia Bestselling Library Journal Penguin Readers Modern Library Readers PW Bestsellers Postcolonial Studies
The Hobbit J. R. R. Tolkien 1.0 NaN NaN NaN NaN NaN
Harry Potter and the Philosopher's Stone J. K. Rowling 1.0 NaN NaN NaN NaN NaN
The Little Prince Antoine de Saint-Exupéry 1.0 NaN NaN NaN NaN NaN
Dream of the Red Chamber Cao Xueqin 1.0 NaN NaN NaN NaN NaN
And Then There Were None Agatha Christie 1.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
Nervous Conditions Tsitsi Dangarembga NaN NaN NaN NaN NaN 1.0
The Palace Of The Peacock Wilson Harris NaN NaN NaN NaN NaN 1.0
Rebecca Daphne Du Maurier NaN NaN NaN NaN NaN 1.0
The Autobiography Of My Mother Jamaica Kincaid NaN NaN NaN NaN NaN 1.0
Cat’s Eye Margaret Atwood NaN NaN NaN NaN NaN 1.0

700 rows × 7 columns

onehot = titles_df.fillna(0).sum(level=0)

onehot
Wikipedia Bestselling Library Journal Penguin Readers Modern Library Readers PW Bestsellers Postcolonial Studies
The Hobbit 1.0 0.0 0.0 0.0 0.0 0.0
Harry Potter and the Philosopher's Stone 1.0 0.0 0.0 0.0 0.0 0.0
The Little Prince 1.0 0.0 0.0 0.0 0.0 0.0
Dream of the Red Chamber 1.0 0.0 0.0 0.0 0.0 0.0
And Then There Were None 1.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
Goodbye To Berlin 0.0 0.0 0.0 0.0 0.0 1.0
Nervous Conditions 0.0 0.0 0.0 0.0 0.0 1.0
The Palace Of The Peacock 0.0 0.0 0.0 0.0 0.0 1.0
The Autobiography Of My Mother 0.0 0.0 0.0 0.0 0.0 1.0
Cat’s Eye 0.0 0.0 0.0 0.0 0.0 1.0

603 rows × 6 columns

Brilliant! That's a one-hot encoding for each list, sorted by item. Now I just encapsulate that logic into a function and run it across each new list as I scrape them from the internet!

full_titles_df = titles_df
def books_df_from_list(list_from_htm):
    name, books = list_from_htm
    new_df = pd.DataFrame(data=books, columns=['Title','Author'])
    new_df[name] = 1
    return(new_df)

Need a list of all the fiction lists, not the nonfiction ones. Can scrape URLs from the homepage

url = "https://thegreatestbooks.org/lists/details"
r = requests.get(url)
htm = BeautifulSoup(r.text, 'html.parser')
hrefs = [o.get('href') for o in htm.find_all('a') if o.get('href') is not None]
links = [o for o in hrefs if '/lists/' in o]
links = [o for o in links if 'http' not in o]
links[:5]
['/lists/28', '/lists/122', '/lists/114', '/lists/120', '/lists/44']

Now we put it all together, pull each list and append it to the df, then merge the duplicates for a great big one-hot-encoded best Books list...

from tqdm import tqdm
for link in tqdm(links):
    try:
        url = f"https://thegreatestbooks.org{link}"
        r = requests.get(url)
        htm = BeautifulSoup(r.text, 'html.parser')    
        bks = get_list_from_htm(htm)
        new_df = books_df_from_list(bks)
        full_titles_df = full_titles_df.append(new_df.set_index('Title'))
    except IndexError as e:
        print(e, page)
        pass
full_titles_df
Author Wikipedia Bestselling Library Journal Penguin Readers Modern Library Readers PW Bestsellers Postcolonial Studies Top 100 Works in World Literature by Norwegian Book Clubs, with the Norwegian Nobel Institute Biblioteca by Argentina The 25 Favorite Books of 100 Francophone Writers by Telerama For The Love of Books by For The Love of Books The Top 10: The Greatest Books of All Time by The Top 10 (Book) The 100 Best Books of World Literature by ABC.es The Ideal Library by Book El Pais Favorite Books of 100 Spanish Authors by El Pais Pour une Bibliothèque Idéale by Raymond Queneau 1001 Books You Must Read Before You Die by The Book Koen Book Distributors Top 100 Books of the Past Century by themodernnovel.com The Celebrity Reading List by Gardiner Public Library Finest Works of Fiction by Martin Seymour-Smith and Editors The 100 Best Non-Fiction Books of the Century by National Review Great Books of the Western World by Great Books Foundation 100 Life-Changing Books by National Book Award The New York Public Library's Books of the Century by New York Public Library The New Lifetime Reading Plan by The New Lifetime Reading Plan Great Books by The Learning Channel The 100 Greatest British Novels by BBC The 50 Best Books of the Century by Intercollegiate Studies Institute Världsbiblioteket (The World Library) by Tidningen Boken Recommended Books by Academy of Achievement The Greatest 20th Century Novels by Waterstone "Best Foreign Work of Fiction" by Transfuge The 16 Greatest Books of All Time by NYU Local ZEIT-Bibliothek der 100 Bücher by Die Zeit The Bigger Read List by English PEN 100 Novels That Shaped Our World by BBC "Our Readable Century", The Best Books of the 20th Century by January Magazine 100 Books to Read in a Lifetime by Amazon.com (USA) 100 Books to Read in a Lifetime by Amazon.com (UK) The 100 Greatest Books Ever Written by Easton Press 48 Good Books by University of Buffalo 25 acclaimed international writers choose 25 of the best books from the last 25 years by Wasafiri Magazine The Modern Library | 100 Best Novels by Modern Library The 21st Century's 12 Greatest Novels by BBC Top 100 World Literature Titles by Perfection Learning A Premature Attempt at the 21st Century Canon by Vulture The 100 Favorite Novels of Librarians by Bookman.com Third World Novels… The Top 10 by New Internationalist 110 Best Books: The Perfect Library by The Telegraph The Great American Read by PBS Best German Novels of the Twentieth Century by Wikipedia The New Vanguard by New York Times Select 100 by University of Wisconsin-Milwaukee The Modern Library | 100 Best Nonfiction by The Modern Library The Millions: The Best Fiction of the Millennium by The Millions 100 Best Books by Montana State University 100 Best Novels in English Since 1900 by Counterpunch Radcliffe's 100 Best Novels by Radcliffe Publishing Course The 75 Best Books of the Past 75 Years by Parade Magazine Harvard Book Store Staff's Favorite 100 Books by Harvard Book Store Man Booker Prize by Man Booker Prize PEN/Faulkner Award for Fiction by PEN/Faulkner Pulitzer Prize for Fiction by Pulitzer Prize National Book Award - Nonfiction by National Book Foundation Pulitzer Prize for Biography or Autobiography by Pulitzer Prize James Tait Black Memorial Prize by Wikipedia National Book Critics Circle Award - Fiction by National Book Critics Circle National Book Critics Circle Award - Nonfiction by National Book Critics Circle Best Books Ever by bookdepository.com Pulitzer Prize for History by Pulitzer Prize National Book Award - Fiction by National Book Foundation Pulitzer Prize for Non-Fiction by Pulitzer Prize How to Read Literature Like a Professor: A Reading List by Thomas C. Foster 100 Essential Books by Bravo! Magazine The Greatest Novel of All Time by William Faulkner W. Somerset Maugham’s Ten Greatest Novels of All Time by Great Novelists and Their Novels The 100 Greatest Novels by greatbooksguide.com The College Board: 101 Great Books Recommended for College-Bound Readers by http://www.uhlibrary.net/pdf/college_board_recommended_books.pdf The Great Books Reader by Book The Best Classics by The Times In Which These Are the 100 Greatest Novels by ThisRecording.com How to Read and Why by Harold Bloom The Novel 100: A Ranking of the Greatest Novels of All Time by The Novel 100 50 Greatest Books of All Time by Globe and Mail The 100 Greatest Novels of All Time: The List by The Observer Books That Changed the World: The 50 Most Influential Books in Human History by Book Masterpieces of World Literature by Frank N. Magill Great Books by Anthony O'Hear TIME Magazine All Time 100 Novels by TIME Magazine The Telegraph’s 100 Novels Everyone Should Read by Telegraph 50 Books That Changed the World by Open Education Database The best books in Spanish for the last 25 years by El Pais Top 10 British, Irish or Commonwealth Novels from 1980 to 2005 by The Observer The 100 best books of the 21st century by The Guardian The 100 Best Books in the World by AbeBooks.de (in German) What Is the Best Work of American Fiction of the Last 25 Years? by New York Times 50 Books to Read Before You Die by Complex The 50 Best Nonfiction Books of the Past 25 Years by Slate D. G. Myers’ 50 Greatest English Language Novels by D. G. Myers Modern classics: 11 novels that belong in the classroom by Today.com The 100 Best Books of the Decade(2000) by Times 50 Books to (Re-)Read at 50 by nextavenue The Best Books of the 2000s by The Onion AV Club Extreme Classics: The 100 Greatest Adventure Books of All Time by National Geographic Adventure Magazine The Best Southern Novels of All Time by Oxford American Books of the Decade by The Guardian The 80 Books Every Man Should Read by Esquire Paste Magazine's Best Books of the Decade(2000-2009) by Paste Magazine The 50 Books Everyone Needs to Read, 1963-2013 by Flavor Wire The New Classics - 100 Best Reads from 1983 to 2008 by Entertainment Weekly The 10 Best of the Decade(2000) by Entertainment Weekly From Zero to Well-Read in 100 Books by Jeff O'Neal at Bookriot.com 50 Books to Read Before You Die by Barnes and Noble The Book of Great Books: A Guide to 100 World Classics by Book The 100 Greatest American Novels, 1893 – 1993 by Jeff O'Neal at Bookriot.com Donald Barthelme’s Reading List by Believer Mag 100 Best Novels Written in English by The Guardian Books That Changed the World by Book Costa Book Award - Best Novel by Costa Coffee Le Monde's 100 Books of the Century by Le Monde Entertainment Weekly's Top 100 Novels by Entertainment Weekly The Dream of the Great American Novel by Book The Graphic Canon by Book Greatest Prose Works of the 20th Century by Vladimir Nabokov 20th Century's Greatest Hits: 100 English-Language Books of Fiction by Larry McCaffery Robert McCrum's top 10 books of the twentieth century by The Guardian 100 Major Works of Modern Creative Nonfiction by About.com Waterstone's Books of the Century by LibraryThing The Best Fiction Books of the 2010s by Time 50 Memorable Books from 50 Years of Books to Remember by The New York Public Library The 100 Best Nonfiction Books of All Time by The Guardian The Best Southern Nonfiction of All Time by Oxford American The 100 Most Influential Books Ever Written by Martin Seymour-Smith 25 Books to Read Before you Die: 21st Century by Powell's Books Top 10 Fiction Books of the Decade(2010) by Entertainment Weekly The 40 Best Novels of the 2010s by Paste Magazine 100 Most Influential Books of the Century by Boston Public Library
The Hobbit J. R. R. Tolkien 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Harry Potter and the Philosopher's Stone J. K. Rowling 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
The Little Prince Antoine de Saint-Exupéry 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Dream of the Red Chamber Cao Xueqin 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
And Then There Were None Agatha Christie 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
A Room of One's Own Virginia Woolf NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
Native Son Richard Wright NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
Syntactic Structures Noam Chomsky NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
The Feminine Mystique Betty Friedan NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
Mein Kampf Adolf Hitler NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0

11225 rows × 137 columns

full_onehot = full_titles_df.fillna(0).sum(level=0)

full_onehot
Wikipedia Bestselling Library Journal Penguin Readers Modern Library Readers PW Bestsellers Postcolonial Studies Top 100 Works in World Literature by Norwegian Book Clubs, with the Norwegian Nobel Institute Biblioteca by Argentina The 25 Favorite Books of 100 Francophone Writers by Telerama For The Love of Books by For The Love of Books The Top 10: The Greatest Books of All Time by The Top 10 (Book) The 100 Best Books of World Literature by ABC.es The Ideal Library by Book El Pais Favorite Books of 100 Spanish Authors by El Pais Pour une Bibliothèque Idéale by Raymond Queneau 1001 Books You Must Read Before You Die by The Book Koen Book Distributors Top 100 Books of the Past Century by themodernnovel.com The Celebrity Reading List by Gardiner Public Library Finest Works of Fiction by Martin Seymour-Smith and Editors The 100 Best Non-Fiction Books of the Century by National Review Great Books of the Western World by Great Books Foundation 100 Life-Changing Books by National Book Award The New York Public Library's Books of the Century by New York Public Library The New Lifetime Reading Plan by The New Lifetime Reading Plan Great Books by The Learning Channel The 100 Greatest British Novels by BBC The 50 Best Books of the Century by Intercollegiate Studies Institute Världsbiblioteket (The World Library) by Tidningen Boken Recommended Books by Academy of Achievement The Greatest 20th Century Novels by Waterstone "Best Foreign Work of Fiction" by Transfuge The 16 Greatest Books of All Time by NYU Local ZEIT-Bibliothek der 100 Bücher by Die Zeit The Bigger Read List by English PEN 100 Novels That Shaped Our World by BBC "Our Readable Century", The Best Books of the 20th Century by January Magazine 100 Books to Read in a Lifetime by Amazon.com (USA) 100 Books to Read in a Lifetime by Amazon.com (UK) The 100 Greatest Books Ever Written by Easton Press 48 Good Books by University of Buffalo 25 acclaimed international writers choose 25 of the best books from the last 25 years by Wasafiri Magazine The Modern Library | 100 Best Novels by Modern Library The 21st Century's 12 Greatest Novels by BBC Top 100 World Literature Titles by Perfection Learning A Premature Attempt at the 21st Century Canon by Vulture The 100 Favorite Novels of Librarians by Bookman.com Third World Novels… The Top 10 by New Internationalist 110 Best Books: The Perfect Library by The Telegraph The Great American Read by PBS Best German Novels of the Twentieth Century by Wikipedia The New Vanguard by New York Times Select 100 by University of Wisconsin-Milwaukee The Modern Library | 100 Best Nonfiction by The Modern Library The Millions: The Best Fiction of the Millennium by The Millions 100 Best Books by Montana State University 100 Best Novels in English Since 1900 by Counterpunch Radcliffe's 100 Best Novels by Radcliffe Publishing Course The 75 Best Books of the Past 75 Years by Parade Magazine Harvard Book Store Staff's Favorite 100 Books by Harvard Book Store Man Booker Prize by Man Booker Prize PEN/Faulkner Award for Fiction by PEN/Faulkner Pulitzer Prize for Fiction by Pulitzer Prize National Book Award - Nonfiction by National Book Foundation Pulitzer Prize for Biography or Autobiography by Pulitzer Prize James Tait Black Memorial Prize by Wikipedia National Book Critics Circle Award - Fiction by National Book Critics Circle National Book Critics Circle Award - Nonfiction by National Book Critics Circle Best Books Ever by bookdepository.com Pulitzer Prize for History by Pulitzer Prize National Book Award - Fiction by National Book Foundation Pulitzer Prize for Non-Fiction by Pulitzer Prize How to Read Literature Like a Professor: A Reading List by Thomas C. Foster 100 Essential Books by Bravo! Magazine The Greatest Novel of All Time by William Faulkner W. Somerset Maugham’s Ten Greatest Novels of All Time by Great Novelists and Their Novels The 100 Greatest Novels by greatbooksguide.com The College Board: 101 Great Books Recommended for College-Bound Readers by http://www.uhlibrary.net/pdf/college_board_recommended_books.pdf The Great Books Reader by Book The Best Classics by The Times In Which These Are the 100 Greatest Novels by ThisRecording.com How to Read and Why by Harold Bloom The Novel 100: A Ranking of the Greatest Novels of All Time by The Novel 100 50 Greatest Books of All Time by Globe and Mail The 100 Greatest Novels of All Time: The List by The Observer Books That Changed the World: The 50 Most Influential Books in Human History by Book Masterpieces of World Literature by Frank N. Magill Great Books by Anthony O'Hear TIME Magazine All Time 100 Novels by TIME Magazine The Telegraph’s 100 Novels Everyone Should Read by Telegraph 50 Books That Changed the World by Open Education Database The best books in Spanish for the last 25 years by El Pais Top 10 British, Irish or Commonwealth Novels from 1980 to 2005 by The Observer The 100 best books of the 21st century by The Guardian The 100 Best Books in the World by AbeBooks.de (in German) What Is the Best Work of American Fiction of the Last 25 Years? by New York Times 50 Books to Read Before You Die by Complex The 50 Best Nonfiction Books of the Past 25 Years by Slate D. G. Myers’ 50 Greatest English Language Novels by D. G. Myers Modern classics: 11 novels that belong in the classroom by Today.com The 100 Best Books of the Decade(2000) by Times 50 Books to (Re-)Read at 50 by nextavenue The Best Books of the 2000s by The Onion AV Club Extreme Classics: The 100 Greatest Adventure Books of All Time by National Geographic Adventure Magazine The Best Southern Novels of All Time by Oxford American Books of the Decade by The Guardian The 80 Books Every Man Should Read by Esquire Paste Magazine's Best Books of the Decade(2000-2009) by Paste Magazine The 50 Books Everyone Needs to Read, 1963-2013 by Flavor Wire The New Classics - 100 Best Reads from 1983 to 2008 by Entertainment Weekly The 10 Best of the Decade(2000) by Entertainment Weekly From Zero to Well-Read in 100 Books by Jeff O'Neal at Bookriot.com 50 Books to Read Before You Die by Barnes and Noble The Book of Great Books: A Guide to 100 World Classics by Book The 100 Greatest American Novels, 1893 – 1993 by Jeff O'Neal at Bookriot.com Donald Barthelme’s Reading List by Believer Mag 100 Best Novels Written in English by The Guardian Books That Changed the World by Book Costa Book Award - Best Novel by Costa Coffee Le Monde's 100 Books of the Century by Le Monde Entertainment Weekly's Top 100 Novels by Entertainment Weekly The Dream of the Great American Novel by Book The Graphic Canon by Book Greatest Prose Works of the 20th Century by Vladimir Nabokov 20th Century's Greatest Hits: 100 English-Language Books of Fiction by Larry McCaffery Robert McCrum's top 10 books of the twentieth century by The Guardian 100 Major Works of Modern Creative Nonfiction by About.com Waterstone's Books of the Century by LibraryThing The Best Fiction Books of the 2010s by Time 50 Memorable Books from 50 Years of Books to Remember by The New York Public Library The 100 Best Nonfiction Books of All Time by The Guardian The Best Southern Nonfiction of All Time by Oxford American The 100 Most Influential Books Ever Written by Martin Seymour-Smith 25 Books to Read Before you Die: 21st Century by Powell's Books Top 10 Fiction Books of the Decade(2010) by Entertainment Weekly The 40 Best Novels of the 2010s by Paste Magazine 100 Most Influential Books of the Century by Boston Public Library
The Hobbit 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Harry Potter and the Philosopher's Stone 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The Little Prince 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Dream of the Red Chamber 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
And Then There Were None 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Decline of the West 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
The History of the Standard Oil Company 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
Theory of Games and Economic Behavior 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
AA Big Book 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
Behaviorism 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

4414 rows × 136 columns

full_onehot.to_csv('../../records/greatest-book-lists-onehot.csv')

And there we have it! 136 book lists, 11,225 votes, and 4,414 unique titles. Now we can do all kinds of math on these features, including recommender systems, networkgraphs, etc. But for now: a quick ranking by count, then sweet sleep...

counts = zip(full_onehot.index, full_onehot.sum(axis=1))
counts_df = pd.DataFrame(sorted(counts, key=lambda x:x[1], reverse=True))
counts_df.columns = ['Title', 'Listed count']
counts_df.index = counts_df.index + 1
counts_df
Title Listed count
1 Ulysses 51.0
2 The Great Gatsby 50.0
3 One Hundred Years of Solitude 44.0
4 Lolita 43.0
5 Nineteen Eighty Four 42.0
... ... ...
4410 Decline of the West 1.0
4411 The History of the Standard Oil Company 1.0
4412 Theory of Games and Economic Behavior 1.0
4413 AA Big Book 1.0
4414 Behaviorism 1.0

4414 rows × 2 columns

That looks like the classics, all right... :shrug:

I'll save this too, and upload to my website so you can peruse it yourself. Find it here (CSV)

counts_df.to_csv('../assets/2021-07-27-found-canon.csv')