A Book of the Web

Is there any crucial difference between publishing a text in print and on-line besides reaching out to a different group of people and allowing it a different lifespan? In both cases, the text has a chance to be considered worth preserving and end up in all sorts of libraries. The on-line environment has created its own hybrid form between the text and the library and this is key to understanding how digital text produces difference.

Historically, we have been treating texts as discrete units, that are distinguished by their material properties such as cover, binding, script. These characteristics establish them as either a book, a magazine, a diary, sheet music and so on. One book differs from another book, books differ from magazines, printed matter differs from handwritten manuscripts. Each volume is a self-contained whole, further distinguished by descriptors such as title, author, date, publisher, and classification codes that allow it being located and referred to. The demarcation of a publication as a container of text works as a frame or a boundary which organises the way it can be located and read. Researching a particular subject matter, the reader is carried along classification schemes under which volumes are organised, along references inside texts, pointing to yet other volumes, and along tables of contents and indexes of subjects that are appended to texts, pointing to places within that volume.

So while their material properties separate texts into distinct objects, bibliographic information provides each object with a unique identifier, a unique address in the world of print culture. Such identifiable objects are further replicated and distributed across containers that we call libraries, where they are to be accessed.

The on-line environment however, intervenes in this condition. It establishes shortcuts. Through search engine, digital texts can be searched for any text sequence, regardless of their distinct materiality and bibliographic specificity. This changes the way they function as a library, and the way its main object, the book, is to be rethought.

(1) Rather than operate as distinct entities, multiple texts are simultaneously accessible through full-text search as if they are one long text, with its portions spread across the web, and including texts that had not been considered as candidates for library collections.

(2) The unique identifier at hand for these text portions is not the bibliographic information, but the URL.

(3) The text is as long as web-crawlers of a given search engine are set to reach, refashioning the library into a storage of indexed data.

These are some of the lines along which on-line texts appear to produce difference. The first contrasts the distinct printed publication to the machine-readable text, the second the bibliographic information to the URL, and the third the library to the search engine.

The introduction of full-text search has created an environment in which all machine-readable on-line documents at reach are effectively treated as one single document. For any text-sequence to be locatable, it doesn't matter in which file format it appears, nor whether its interface is a database-powered website or mere directory listing. As long as text can be extracted from a document, it is a container of text sequences and itself is a sequence in a "book" of the web.

Even though this is hardly any news after almost two decades of Google Search ruling, little seems to have changed with respect to the forms and genres of writing. Loyal to standard forms of publishing, most writing still adheres to the principle of coherence, based on the units such as book chapters, journal papers, newspaper articles, etc., that are designed to be read from the beginning to the end.

Still, the scope of textual forms appearing in search results, and thus a corpus of texts in which they are being brought into, is radically diversified: it may include discussion board comments, product reviews, private e-mails, weather information, spam etc., the type of content that used to be omitted from library collections. Rather than being published in a traditional sense, all these texts are produced onto digital networks by mere typing, copying, OCR-ing, generated by machines, by sensors tracking movement, temperature, etc.

Even though portions of these texts may come with human or nonhuman authors attached, authors have relatively little control over discourses their writing gets embedded in. This is also where the ambiguity of copyright manifests itself. Crawling bots pre-read the internet with all its attached devices according to the agenda of their maintainers, and the decisions about which, how and to whom the indexed texts are served in search results is in the code of a library.

Libraries in this sense are not restricted to digitised versions of physical public or private libraries as we know them from history. Commercial search engines, intelligence agencies, and virtually all forms of on-line text collections can be thought of as libraries.

Acquisition policies figure here on the same level with crawling bots, dragnet/surveillance algorithms, and arbitrary motivations of users, all of which actuate the selection and embedding of texts into structures that regulate their retrievability and through access control produce certain kinds of communities or groups of readers. The author's intentions of partaking in this or that discourse are confronted by discourse-conditioning operations of retrieval algorithms. Hence, Google structures discourse through its Google Search differently from how the Internet Archive does with its Wayback Machine, and from how the GCHQ does it with its dragnet programme.

They are all libraries, each containing a single "book" whose pages are URLs with timestamps and geostamps in the form of IP address. Google, GCHQ, JStor, Elsevier – each maintains its own searchable corpus of texts. The decisions about who, to which sections and under which conditions is to be admitted are informed by a mix of copyright laws, corporate agendas, management hierarchies, and national security issues. Various sets of these conditions that are at work in a particular library, also redefine the notion of publishing and of the publication, and in turn the notion of public.

Corporate journal repositories exploit publicly funded research by renting it only to libraries which can afford it; intelligence agencies are set to extract texts from any moving target, basically any networked device, apparently in public interest and away from the public eye; publicly-funded libraries are being prevented by outdated copyright laws and bureaucracy from providing digitised content on-line; search engines create a sense of giving access to all the public record on-line while only a few know what is excluded and how search results are ordered.

It is within and against this milieu that the libraries such as the Internet Archive, Wikileaks, Aaaaarg, UbuWeb, Monoskop, Memory of the World, Nettime, TheNextLayer and others gain their political agency. Their counter-techniques available for negotiating the publics of publishing include self-archiving, open access, book liberation, leaking, whistle-blowing, open source search algorithms and so on.

Digitisation and posting texts on-line are interventions in the procedures that make search possible. Operating on-line collections of texts is as much the work of organising texts within libraries, as is placing them within books of the web.

Dušan Barok

Originally written 15-16 June 2015 in Prague, Brno and Vienna for a talk given at the Technopolitics seminar in Vienna on 16 June 2015. Revised 29 December 2015 in Bergen. I want to thank Femke Snelting for helpful comments and editing.

A Book of the Web

From Mondothèque

What links here