I'm building a canon!
This is a personal project by internet user @deepfates to make a "canon" of about 5,000 books, selected from all possible books using data available on the internet.
Though I work at a bookstore, this project is not a commercial endeavor. I want to explore the multidimensional space of all existing books: a latent space that connects both the interior concepts and the exterior context of all books, and which implies the existence of all possible books that might yet be imagined.
This is not a new concept, of course. Writers and readers have studied this space forever, and changed it by their observation. Scholars have developed a discourse about which authors should be propagated and a meta-discourse about how that decision should be made. Poets and novelists have gestured to it through metaphor, or explicated it through genre tropes.
Personally I was inspired by Terry Pratchett's L-space formulation:
The study of invisible writings was a new discipline made available by the discovery of the bi-directional nature of Library-Space. The thaumic mathematics are complex, but boil down to the fact that all books, everywhere, affect all other books.— tech wiz (@deepfates) June 7, 2021
Books, everywhere, affect all other books -- including books in the future, and books never written. Pratchett calls these invisible writings.
My goal is to explore L-space, looking for hidden information that distinguishes some books -- the Canon-- from others -- the Archive -- and use that to make a corpus of books across a broad cross-section of genres and topics. Once selected, these books can work as training data to predict canonicity in the future.
This will be my Invisible Canon. In these notes I will provide prose, code and data to justify and document my process.
As Odysseus would say, I'm nobody! I'm a person living in the third decade of the twenty-first century. I have no college degree, no institutional backing, no special knowledge. I just love books.
I sell books for a living, at a small bookstore in the desert southwest USA. My store has room for about 5,000 curated books, since we must leave space for the chaotic flow of used books as well. I do have a little insider information: my inventory and sales data (which I will show some peeks of, but is not itself open-source). The marketplace provides some information about which books are desirable just by the amount of times people will buy them.
I want to act as an arbiter between the implicit preferences of this market, and the explicit preferences cointained in sales data and book reviews around the globe, and choose books that my customers will buy, recommend and seek out. It's my duty, as a curator, to supply what people want to read. I don't think any of the available canons represent that, so I will make one.
It's a group of books selected from the totality of all books, with the intention to elevate and/or recognize their status. This is distinct from the "published", which includes all books ever made public in some sense, and the "archive", which is the subset of the published that has been preserved and is available to study.
Too, a canon is not simply a 'corpus'. A corpus is a portion of the archive selected for a specific research purpose; a canon is a portion that is meant to represent some societal value.
Here I must defer to the Stanford Literary Lab, specifically these papers in their Pamphlets series:
In these articles, the digital humanities scholars of the Literary Lab have empirically measured the canon and archive (of literature) and found trends that correlate with their deep domain knowledge. I intend to follow recklessly in the direction they headed, and explore new terrain.
The thesis developed in these papers is, first, that the canon can be taken as a superset of all canons put forth by individuals or groups, and secondly that you can project features of these books into a "space" that correlates to their canonicity.
The authors test several different features, but prefer a framing of Popularity vs Prestige: market success vs academic success, roughly. This diagram, from Pamphlet 17, shows the use:
Popularity here is measured by number of reviews on Goodreads, and prestige by number of "Primary Subject Author" citations in the MLA database. To the right, something is more marketable, and to upward more well-regarded by academia. Up and to the right, therefore, is the direction of canon.
This tracks with my own anecdotal evidence: the four most popular categories, Fiction, Sci-Fi, Kids, and Mystery/Thriller, are among the bestselling categories in my store.
This brings up a good point: my audience is specific, and so my curation must be as well.