I'm building a canon!



What is this project, and who does it benefit?

This is a personal project by internet user @deepfates to make a "canon" of about 5,000 books, selected from all possible books using data available on the internet.

Though I work at a bookstore, this project is not a commercial endeavor. I want to explore the multidimensional space of all existing books: a latent space that connects both the interior concepts and the exterior context of all books, and which implies the existence of all possible books that might yet be imagined.

This is not a new concept, of course. Writers and readers have studied this space forever, and changed it by their observation. Scholars have developed a discourse about which authors should be propagated and a meta-discourse about how that decision should be made. Poets and novelists have gestured to it through metaphor, or explicated it through genre tropes.

Personally I was inspired by Terry Pratchett's L-space formulation:

Books, everywhere, affect all other books -- including books in the future, and books never written. Pratchett calls these invisible writings.

My goal is to explore L-space, looking for hidden information that distinguishes some books -- the Canon-- from others -- the Archive -- and use that to make a corpus of books across a broad cross-section of genres and topics. Once selected, these books can work as training data to predict canonicity in the future.

This will be my Invisible Canon. In these notes I will provide prose, code and data to justify and document my process.


Who are you to choose the Canon?

As Odysseus would say, I'm nobody! I'm a person living in the third decade of the twenty-first century. I have no college degree, no institutional backing, no special knowledge. I just love books.

I sell books for a living, at a small bookstore in the desert southwest USA. My store has room for about 5,000 curated books, since we must leave space for the chaotic flow of used books as well. I do have a little insider information: my inventory and sales data (which I will show some peeks of, but is not itself open-source). The marketplace provides some information about which books are desirable just by the amount of times people will buy them.

I want to act as an arbiter between the implicit preferences of this market, and the explicit preferences cointained in sales data and book reviews around the globe, and choose books that my customers will buy, recommend and seek out. It's my duty, as a curator, to supply what people want to read. I don't think any of the available canons represent that, so I will make one.

What is a "canon", anyway?

It's a group of books selected from the totality of all books, with the intention to elevate and/or recognize their status. This is distinct from the "published", which includes all books ever made public in some sense, and the "archive", which is the subset of the published that has been preserved and is available to study.

Too, a canon is not simply a 'corpus'. A corpus is a portion of the archive selected for a specific research purpose; a canon is a portion that is meant to represent some societal value.

Here I must defer to the Stanford Literary Lab, specifically these papers in their Pamphlets series:

  • 8 Between Canon and Corpus pdf

  • 11 Canon/Archive pdf

  • 17 Popularity/Prestige pdf

In these articles, the digital humanities scholars of the Literary Lab have empirically measured the canon and archive (of literature) and found trends that correlate with their deep domain knowledge. I intend to follow recklessly in the direction they headed, and explore new terrain.

The thesis developed in these papers is, first, that the canon can be taken as a superset of all canons put forth by individuals or groups, and secondly that you can project features of these books into a "space" that correlates to their canonicity.

The authors test several different features, but prefer a framing of Popularity vs Prestige: market success vs academic success, roughly. This diagram, from Pamphlet 17, shows the use:

a scatter plot of genres with popularity on the x and prestige on the y axis

Popularity here is measured by number of reviews on Goodreads, and prestige by number of "Primary Subject Author" citations in the MLA database. To the right, something is more marketable, and to upward more well-regarded by academia. Up and to the right, therefore, is the direction of canon.

This tracks with my own anecdotal evidence: the four most popular categories, Fiction, Sci-Fi, Kids, and Mystery/Thriller, are among the bestselling categories in my store.

This brings up a good point: my audience is specific, and so my curation must be as well.

Next episode