What requirements do I have for Invisible Canon?

All canons are biased, so let me lay out my biases now.

One requirement I've already discussed:

  • 5,000 books maximum

This is to fit on the shelves, but also to define a reasonable outer limit on the number of books that could be considered "canonical". Though the number of books keeps growing, even accelerating, there's a limit to how much any person can be asked to read.

If you read one book a week, it would take 95 years to read the whole five thousand. A goal to aspire to.

Another requirement, hinted above, is:

  • these books have to fit into a certain number of shelves, divided into categories

We have about half our shelves given to novels, and the other half to nonfiction of various genres. I would like to keep it that way, for secret reasons.

Our ontology is nonstandard, but matches closely enough to the model in the above image. The proportion of books in each category could be changed. In fact I would like to change it, to more accurately reflect our sales data. I'll return to that.

Beyond that, I have a general preference for

  • popular, i.e. marketable, books.

This is not a library, after all. A library values rare and well-preserved books. A bookstore values books that sell quickly and often, and are easy to replace.

For the latter reason, another preference is for

  • books that are in print (and ideally stocked by my distributor)

Finally, I would like to avoid certain types of books, which have proven to not be worth carrying for various reasons: they are inflammatory, controversial, or ephemeral, or they are printed in such numbers that they are massively discounted at big-box stores. The bookstore has limited space, time and money for books -- in fact we all do -- so we want to carry a mostly consistent selection, rather than ever-shuffling flavors of the month. Framed as a positive criterion, let's say:

  • the classics, old, modern and future.

What books will become classics?

In fact, another dimension on which we can think of books is something like classicness. I read somewhere that 6,000 books were published in the year 1920. Among them was James Joyce's Ulysses, a modern classic and one of the most canonical listed in Pamphlet 17. How many other books do we remember from that year? A hundred or so are listed on Wikipedia. Of those I recognize maybe ten.

A one in a thousand chance of a book becoming a classic? Not bad odds, you might think. But in 2019, four million books were published. Will we remember even 4,000 of them in a hundred years? I doubt it.

If the odds of creating a classic haven't scaled with the acceleration of printing, perhaps they have scaled with the growth of the literate population. Let's do a little code (finally!) and see if that seems reasonable.

Data below comes from Our World in Data

pop_literate_pc = {'1920': 31.62, '2016':86.25}
pop_illiterate_pc = {'1920':68.38, '2016':13.75}
total_pop = {'1920':(7.38+95.63+114.87+47.21+164.86+1005)*1000000,
{'1920': 1434950000.0, '2016': 7464380000.0}
total_literate_pop = {k: total_pop[k]*pop_literate_pc[k]*.01 for k in total_pop}
{'1920': 453731190.0, '2016': 6438027750.0}
literate_pop_diff = total_literate_pop['2016'] / total_literate_pop['1920']

There are 14 times as many literate adults today as there were in 1920. That includes both writers and readers, so perhaps it will scale to our imaginary classicness metric.

Given my ballpark estimate of 1 major work and 10 classics from 1920, that would suggest that 14 major works and 140 classics were published in 2016. Is that reasonable? Perhaps still a high estimate, but looking through Goodreads' 'Most popular books published in 2016' I recognize quite a few as big sellers (though ironically I've only read one of the top 40, because I spend my time doing things like this instead of reading).

If this theory holds, that is the relation between number of literate people and the number of classics printed?

total_literate_pop['2016'] / 140

Maybe one in every 45 million people writes a classic book? Well, perhaps it's not fair to use the number of readers instead of the number of people trying to write. If we assume that a writer publishes one book in a year -- a reasonable assumption, given publishing house timelines -- how many books were published in 2016? We can divide that by our hypothetical 140 classics.

Wikipedia latest data ranges widely but should average out to roughly 2010s. They have a total of 2,200,000!

2200000 / 140

One in fifteen thousand books becomes a classic? And one in a hundred and fifty thousand books becomes a major work.

Sounds believable to me...

What is the point of this "classicness" discussion?

Classicness is a different variable than prestige or popularity. It's something like prominence: you can't tell which mountain is the biggest when you're in the mountains, but you can see it from a distance. At distance, though, details are lost. It's hard to see who was an almost-classic of a hundred years ago. And as time passes, even the classics are blurred, and only the really great works stand out. Books from hundreds or thousands of years ago, if they're still in print today, have a higher average quality than books of recent eras. They had to be good, to be copied or preserved through so much time.

Initial conditions, structural biases and random chance all play a part in what becomes classic, of course. Some books that should be classics simply never got the reach that they deserved. One could argue for a prescriptive canon, that proactively rights these wrongs, and restores the true greats to their rightful position; but there are many better qualified than I to make those claims. I will attempt to make a descriptive analysis of the data, with my biases stated plainly above, and leave it to them to challenge my findings.

Whatever intrinsic or extrinsic forces cause books to become classic, the fact is that some do. Some books reach through time and across half the world to touch people and inspire them. I want to know why, what factors affect it, and extrapolate that to all the books in the archive. A human eye may not be able to identify a future classic, but perhaps a computer can.


The criteria, summed:

  1. Five thousand books
  2. Divided into categories
  3. Marketable books
  4. Still in print
  5. With classic potential

I want a set of books that you could theoretically buy and fill your house with. Small enough to read over a lifetime, but big enough to ramble around in. Books that reward rereads, books that allow you to participate in the Great Conversation. Representative of diverse voices, across time as well as space. A concentrated seed of wisdom that could also be a time capsule of our era.

If I plant this seed -- if I buy all these books and stock them in the neighborhood store -- a forest will grow. Each book sold is another seed sown nearby.

I hope that people who look back at this canon will think it worthy of talking about, even if it is only the amateur science of an uneducated bookseller. Many other canons exist, but I don't think they show the true balance of what people are reading in 2021. I hope to restore that balance.

What categories of book should I canonize, and in what proportions?

The store has a certain number of shelves, divided into a custom ontology of sections with different numbers of shelves per section. The number of shelves per section was originally set just by how many of each kind of book we could acquire (starting a used-and-new bookstore from scratch is a bit of a scramble). We also focused on local authors at the beginning, but shifted away from that as it was clear their sales couldn't be the main profit center of the store.

In general, we have shifted the shelf ratios to match the profit generation of each section. This may be subject to feedback loop effects, and we're still limited by available used stock. So though I would like to use pretty much the same set of categories, I want to change their relative sizes.

I have aggregated some data from my store into the file sections.csv. We can explore it a bit here to get a feel for the ontology.

import pandas as pd
sections = pd.read_csv('../assets/sections.csv')
section sold_num on_hand_num samples
0 Fiction 2932 2222 Kids Book, A Promised Land, A Gentleman in Mos...
1 Sci-Fi/Fantasy 919 602 Dune, The Three-Body Problem, Kindred
2 Beliefs 915 465 The Four Agreements: A Practical Guide to Pers...
3 Southwest 814 486 60 Hikes Within 60 Miles: Albuquerque: Includi...
4 Kids 761 358 $2.00 Book, Stinky Monster Pooems, Happy Monst...
%matplotlib inline

import numpy as np
from numpy.polynomial.polynomial import polyfit
import matplotlib.pyplot as plt

x = sections['on_hand_num']
y = sections['sold_num']
fig, ax = plt.subplots(figsize=(15,10))

plt.plot(x, y, '.')

# Label each dot with its section name
sections[['on_hand_num', 'sold_num', 'section']].apply(lambda x: ax.text(*x), axis=1)

# Linear regression to find the middle
b, m = polyfit(x, y, 1)
plt.plot(x, b + m * x, '-')

# Save to disk

Can you tell that we have a whole wall of Fiction books and an aisle each of Sci-Fi and Mystery?

This chart shows the magnitude of each category, but the size disparity makes it hard to read. Let's run the same plot but put the axes on a log scale.

fig, ax = plt.subplots(figsize=(15,10))
plt.plot(x, y, '.')

# Label each dot with its section name
sections[['on_hand_num', 'sold_num', 'section']].apply(lambda x: ax.text(*x), axis=1)

# Linear regression to find the middle
b, m = polyfit(x, y, 1)
plt.plot(x, b + m * x, '-')

# Save to disk

This graph shows which categories are doing better and worse. The x axis is books on shelf, and the y axis is books sold. So categories above the line are selling better than would be expected given their size, and categories below are selling worse.

This doesn't necessarily mean that underperforming categories should be downsized, though! The ideal balance for every category would be to sell more than we keep on hand, but that can be accomplished through different means. Location is one factor: Machines, Buildings, Ancients and Games are all single shelves located in hard-to-reach places, perhaps they could be relocated. Some categories don't work as well in "used" versions: Family, Food, and Money are all categories where people want the newest research, not old advice and fads that have ended. Many of these sections would likley sell better if they had more new books. A few, especially History, Memoir and Biography, are legit overstocked. To even get any new books into them we'll have to cull some, and we can also reduce their overall size to expand other categories.

In contrast, categories that sell much better than their size mostly already have New books incorporated into them. Sci-fi, Beliefs, Southwest, Flora, Brain and Fiction are categories where we have made special effort to carry a consistent selection of new books. We have also done this with People, Memoir, and Self, though they don't seem to be selling better than average. They may have started off slowly, though; especially Self, which is one of those sections where books age quickly and people want the latest stuff.

An interesting counterpart to Classicness, here: which categories have the most classic books? One would guess Fiction, Philosophy, Poetry, History, Drama? And Beliefs, of course: what's more canon than religion? These are the type of book we started grouping in a shelf called "Classics" (based mainly off whether the book itself says "classic" on the cover lol), which might provide signal on what types of book are currently canonized by publishers.

Classic books are those which get better with age. But there are also Anticlassic genres, where the newest information is probably the best: Family, Machines, Money, Games, Food, Science, and the sciencey categories Brain, Health, Fauna, Flora and Earth. Here we should expect to see fewer old books in the classics, and generally a smaller collection of canon-worthy books. Average age of a book in the category might be a useful feature to extract.

In general, the fact that the categories trend in a clear correlation tells me that the amount of books sold is strongly influenced by the amount of books stocked. Which frees up the constraint a bit! I can change the sizes of various categories (both to fit the data, and to fit my preference as a curator) without worrying about a catastrophic shift in sales. And remember: "sales" here are working as a proxy for the public's desire for books, not necessarily any commercial intentions on my part.

The next step will be to gather as much data as I can from the internet, and create a giant table of correspondences (probably by ISBN/EAN, the existing categorical ID for books). Then explore it for usable features.