A bag but is language nothing of words

From Mondothèque

Revision as of 20:30, 14 December 2015 by Michael Murtaugh (talk | contribs) (Notes)

In text indexing and other machine reading applications the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a data structure (word histograms or weighted vectors) where the original order of the words in sentence form is stripped away. While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride or a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS.
Michael Murtaugh

DRAFT DOCUMENT

Bag of words

In text indexing and other machine reading applications (such as Google's core business of search) the term "bag of words" is frequently used to underscore how processing algorithms often represent text using a reductive data structure where the original order of the words in sentence form is stripped away. In a recent blog post, Michael Erasmus explains the technique in the context of "tf-idf": http://michaelerasm.us/tf-idf-in-10-minutes/:

First, let's just define what I mean with document. For our purposes, a document can be thought of all the words in a piece of text, broken down by how frequently each word appears in the text.

Say for example, you had a very simple document such as this quote:

Just the fact that some geniuses were laughed at does not imply that all who are laughed at are geniuses. They laughed at Columbus, they laughed at Fulton, they laughed at the Wright brothers. But they also laughed at Bozo the Clown - Carl Sagan

This structure is also often referred to as a Bag of Words. Although we care about how many times a word appear in a document, we ignore the order in which words appear.

While "bag of words" might well serve as a cautionary reminder to programmers of the essential violence perpetrated to a text and a call to critically question the efficacy of methods based on subsequent transformations, the expression's use seems in practice more like a badge of pride ... a schoolyard taunt that would go: Hey language: you're nothin' but a big BAG-OF-WORDS. In this way BOW celebrates the apparently perfunctory step of "breaking" a text into a purer form amenable to computation, to stripping language of its silly redundant repetitions and foolishly contrived stylistic phrasings to reveal its cleaner inner essence.

Liebers's Standard Telegraphic Code Book (1896)

After July, 1904, all combinations of letters that do not exceed ten will pass as one cipher word, provided that it is pronounceable, or that it is taken from the following languages: English, French, German, Dutch, Spanish, Portuguese or Latin -- International Telegraphic Converence, July 1903

How We Think: Digital Media and Contemporary Technogenesis

Katherine Hayles devotes a chapter entitled "Technogenesis in Action: Telegraph Code Books and the Place of the Human" in her book of 2006 How We Think: Digital Media and Contemporary Technogenesis

[...] my focus in this chapter is on the inscription technology that grew parasitically alongside the monopolistic pricing strategies of telegraph companies: telegraph code books. Constructed under the bywords “economy,” “secrecy,” and “simplicity,” telegraph code books matched phrases and words with code letters or numbers. The idea was to use a single code word instead of an entire phrase, thus saving money by serving as an information compression technology. Generally economy won out over secrecy, but in specialized cases, secrecy was also important.

On the shift to the "machine-centric":

The interaction between code and language shows a steady movement away from a human-centric view of code toward a machine-centric view, thus anticipating the development of full-fledged machine codes with the digital computer.

On the relation to a *universal language*:

Along with the invention of telegraphic codes comes a paradox that John Guillory has noted: code can be used both to clarify and occlude (2010b). Among the sedimented structures in the technological unconscious is the dream of a universal language. Uniting the world in networks of communication that flashed faster than ever before, telegraphy was particularly suited to the idea that intercultural communication could become almost effortless. In this utopian vision, the effects of continuous reciprocal causality expand to global proportions capable of radically transforming the conditions of human life. That these dreams were never realized seems, in retrospect, inevitable.

On the embodiment of receiving the codes:

Once learned and practiced routinely, however, sound receiving became as easy as listening to natural-language speech; one decoded automatically, going directly from sounds to word impressions. A woman who worked on Morse code receiving as part of the massive effort at Bletchley Park to decrypt German Enigma transmissions during World War II reported that after her intense experiences there, she heard Morse code everywhere—in traffic noise, bird songs, and other ambient sounds—with her mind automatically forming the words to which the sounds putatively corresponded. Although no scientific data exist on the changes sound receiving made in neural functioning, we may reasonably infer that it brought about long-lasting changes in brain activation patterns, as this anecdote suggests.

On the so-called "mutilations" of messages:

If bodily capacities enabled the “miraculous” feat of sound receiving, bodily limitations often disrupted and garbled messages. David Kahn (1967) reports that “a telegraph company’s records showed that fully half its errors stemmed from the loss of a dot in transmission, and another quarter by the insidious false spacing of signals” (839). (Kahn uses the conventional “dot” here, but telegraphers preferred “dit” rather than “dot” and “dah” rather than “dash,” because the sounds were more distinctive and because the “dit dah” combination more closely resembled the alternating patterns of the telegraph sounder.) Kahn’s point is illustrated in Charles Lewes’s “Freaks of the Telegraph” (1881), in which he complained of the many ways in which telegrams could go wrong. He pointed out, for example, that in Morse code bad (dah dit dit dit [b] dit dah [a] dah dit dit [d]) differs from dead (dah dit dit [d] dit [e] dit dah [a] dah dit dit [d]) only by a space between the d and e in dead (i.e., _. . . . _ _ . . versus _. . . . _ _. .). This could lead to such confounding transformations as “Mother was bad but now recovered” into “Mother was dead but now recovered.” Of course, in this case a telegraph operator (short of believing in zombies) would likely notice something was amiss and ask for confirmation of the message—or else attempt to correct it himself.

What telegraph code books do is remind us of is the relation of language in general to economy. Whether they may be economies of memory, attention, costs paid to a telecommunicatons company, or in terms of computer processing time or storage space, encoding knowledge is a form of shorthand and always involves an interplay with what we then expect to perform or "get out" of the resulting encoding.

Google Tap

In a Google April fools "prank" (where fake product announcements are made each April 1st, reportedly the product of Google's famous "20%" time for "side" projects ) [1]

Claiming to be developed by "Reed Morse", great grandson of Samuel Morse the developer of the telegraph.

What's notable about Google's (mock) interface of telegraphy is that it although they cite people's frustrations with modern devices having"too many buttons" they end up presenting an interface with two when the telegraphic interface was just one, and was routinely operated "blind" while the operators eyes read the message and perhaps made notation on paper while. While made in jest, the misunderstanding is telling as the performative interface of the telegraphs single button, is translated to one where the essential and initial form of the message is symbolic, containing those two binary symbols "dot" and "dash".

VODER

Voder03.jpg
Schematic-Circuit-of-the-VODER.jpeg
VODER-Worlds-Fair-Pamphlet.jpeg

At the 1940 New York World's Fair, the VODER speaking machine system was demonstrated in sensational fashion. The system was developed by Homer Dudley, an engineer at AT&T Bell labs.

https://www.youtube.com/watch?v=0rAyrmm7vv0

(video)

It's far and away much more human sounding than any text to speech system of today. Why? Because of the way it's performed. Rather than starting from written language broken into approximate translation of phonetic fragments and then applying a slew of statistical and other techniques in an attempt to bring back some sense of the natural expression of a human voice, the voder system merely offers its user a palette of sounds and leaves it to the operator to perform them.

She saw me

Who saw you?
She saw me

Whom did whe see?
She saw me

Did she see you or hear you?
She saw me

Extracting Patterns and Relations from the World Wide Web

Sergey Brin, In Proceedings of the WebDB Workshop at EDBT 1998 http://www-db.stanford.edu/~sergey/extract.ps

The World Wide Web provides a vast source of information of almost all types,

ranging from DNA databases to resumes to lists of favorite restaurants. However, this information is often scattered among many web servers and hosts, using many different formats. If these chunks of information could be extracted from the World Wide Web and integrated into a structured form, they would form an unprecedented source of information. It would include the largest international directory of people, the largest and most diverse databases of products, the

greatest bibliography of academic works, and many other useful resources
2.1 The Problem

Here we define our problem more formally: Let D be a large database of unstructured information such as the World Wide Web

Data mining pre-google

A traditional algorithm could not compute the large itemsets in the lifetime of the universe. [...] Yet many data sets are difficult to mine because they have many frequently occurring items, complex relationships between the items, and a large number of items per basket. In this paper we experiment with word usage in documents on the World Wide Web (see Section 4.2 for details about this data set). This data set is fundamentally different from a supermarket data set. Each document has roughly 150 distinct words on average, as compared to roughly 10 items for cash register transactions. We restrict ourselves to a subset of about 24 million documents from the web. This set of documents contains over 14 million distinct words, with tens of thousands of them occurring above a reasonable support threshold. Very many sets of these words are highly correlated and occur often.[1]

Raw data

Tim Berners Lee and the urge to "liberate your documents"

So, we're at the stage now where we have to do this -- the people who think it's a great idea. And all the people -- and I think there's a lot of people at TED who do things because -- even though there's not an immediate return on the investment because it will only really pay off when everybody else has done it -- they'll do it because they're the sort of person who just does things which would be good if everybody else did them. OK, so it's called linked data. I want you to make it. I want you to demand it. [2]

Notes

Parallel shifts, telegraph and telephony a shift occurs from language as something performed by a human body, to becoming captured in code, and occurring at a machine scale. In document processing a similar shift occurs from language as writing to language as symbolic sets of information to be treated to statistical methods for extracting knowledge in the form of relationships.

The interest in "machinic" (minimal human intervention) involves on first glance "machinic" in the traditional sense of automating labour, replacing the human work of categorizing with an automated process; in this way opening up the process to a larger quantity of pages and a range of "esoteric" topics which would not be possible to handle with traditional editorial processes. This "machinic" shift is a business model that learns to extract the value of web surfers behaviour; this process is then echoed in google's book digitization which similarly "extracts" / exploits the value of the collection librarian (on top of the work of the author, the typesetter, the publisher)

The computer scientists view of textual content as "unstructured", be it in a webpage or the pages of a scanned text, underscore / reflect the negligence to the processes and labor of writing, editing, design, layout, typesetting, and eventually publishing, collecting and cataloging. (cf here [3]?)

In other words, by "unstructured" it is meant: unstructured in relation to the machine -- that is, not explicitly structured in a format directly amenable to use by automated means. "Structuring" then is a process by which structure is made explicit through the use of standards of markup (such as HTML/XML). In this way, the computer scientist is viewing a text through the eyes of their reading algorithm, and in the process (voluntarily) blinding themselves to the work practices which have produced, and maintain, the given textual resources, choosing to view them as instead somehow "freely given" and available to exploit as a "raw material".
  1. Dynamic Data Mining: Exploring Large Rule Spaces by Sampling; Sergey Brin and Lawrence Page, 1998; p. 2 http://ilpubs.stanford.edu:8090/424/
  2. Tim Berners-Lee: The next web, TED Talk, February 2009 http://www.ted.com/talks/tim_berners_lee_on_the_next_web/transcript?language=en
  3. http://informationobservatory.info/2015/10/27/google-books-fair-use-or-anti-democratic-preemption/#more-279