First Monday
Read related articles on Anonymity, Electronic Mail, Security and Privacy

Fending Off Automated Mass Electronic Mail: or, How to Distinguish Yourself from a Computer by Tibor Beke

Is there a way for e-mail addresses to be openly available, if their owners so desire, on the Net, yet preventing the current practice of harvesting email-addresses by the thousands for unsolicited email advertisements? I propose a solution, and examine the phenomenology of mass e-mail in a somewhat broader context.

Unsolicited commercial mass e-mail, nicknamed "spam", received scant attention in the media till the summer of 1997. The phenomenon in fact started with the appearance of business-related messages in unmoderated Usenet newsgroups (ca. 1990), usually cross-posted to a vast number of discussion groups having nothing to do with the goods offered. In the first stage of development, the reaction of netizens was essentially (the electronic equivalent of) lynching. The second stage witnessed the introduction of ad hoc software tools, filters and "cancellation daemons". They proved to be ineffective, since the definition of "unsolicited" is in the eye of the beholder: it cannot be defined in a programming language and blocked algorithmically, while it can be (and is) generated and distributed by automated routines. The third stage was prompted by the spread of intrusive direct e-mail; it saw lawsuits and (somewhat spurious) arguments invoking the First Amendment in the United States as well as the case of junk fax, outlawed in some places in 1991. The fourth stage is now; bills are being introduced in various states in the United States, as well as in the U. S. Congress, aimed specifically at unsolicited commercial e-mail [ 1 ].

Why is There an Issue? What is Spam?

Loud and ill-formatted e-mail advertisements (often hundreds of lines in length) make multiple appearances, every week, in hundreds of thousands of mailboxes. Their content varies from merely inane to borderline illegal; more often than not, they sport fraudulent return addresses, forged e-mail headers and empty promises of such incidents not being repeated. Joel McNamara, publisher of the informal journal Popular Cryptography, found the following spam keywords to be statistically significant:

friend, huge savings, act now, mlm, free!, works!, here's how, remove@, for more information, "remove" in the subject, order now, now revealed, chain letter, 100% guaranteed, money back, to order, Jesus Christ, send to, for detailed information, free copy, special offer, cost?, To be removed, absolutely free, 1-800, no risk, entrepreneur, added bonus, extra income, don't delay, send check, money-order, dear sir, under 18, 1-900, Visa

See http://www.abuse.net/ for an account of what a tiny ISP is receiving - hourly. Spam is remarkable for its total mass as well as its repetitiousness: during the month of June, 1997, a solicitation to download 30 million e-mail addresses for $149 was sent out to thousands of Internet users at least ten times. There appears to be a tie for the most common thread to these messages: promoting promotion on the Internet, and "make money fast" (followed by abundant messages on sex and weight loss programs). Why has this sort of advertising became prevalent? In a medium where surveys have indicated that that the level of income and education are higher than the median? It is a question worth addressing (see postscript).

There are some who argue that the self-regulation of commercial e-mail marketers is ineffectual [ 2 ].

"Remove lists" are ineffective because the genie is already out of the bottle: many, many collections of e-mail addresses were sold, re-sold, and re-re-sold, together with software for future e-mail address extraction. There is no single central database that netizens can "opt out of". Conversely, many collections contain multiple addresses of the same individuals. In fact, this redundancy is one of the most pervasive and irksome features of spam.

This paper is chiefly concerned with mass e-mailing from the viewpoint of software engineering, rather than legislative cures. I doubt that a purely legal deterrent of spam is the ideal one, for the following reasons.

The same goes for the concept of "prior business contact" (as a prerequisite for the legality of e-mail solicitations) in this protean world of "Click here to sign up", "unsubscribe", "subscribe" and commercial services allowing user customization. What exactly is a prior contact? Each and every case where legal interpretation is required will present legal costs - to be born by the public or by ISPs.

Neither chain mail fraud nor junk fax advertising ceased, despite decades of regulation of the former, and the costs and dangers associated with junk e-mail still remain substantially less than those of the other two. E-mail headers are easy to forge, and the motivation to spam remains constant: "COME VISIT OUR WEBSITE!!!!" - e-mail meets push technology. Spam is likely to be generated as long as it is, with however slight a margin, a worthwhile enterprise. Every day (to paraphrase Barnum) suckers are drawn into the Web of unsolicited mass e-mail (where "sucker" applies both to businesses (those that are novices to electronic commerce) as well as to new potential recipients).

Still, observe that it is a constellation of economic and technological factors that makes unsolicited mass e-mail feasible.

  1. Sending email is exceedingly cheap. Currently, it is around a fraction of a cent per item.

  2. Any user of the Internet has unique ID - a finite sequence of letters, or bits - to serve as address. Currently, e-mail addresses are the most widespread example; but a URL string, or a personal address and PGP key combination fall into the same category.

  3. The full specification of the ID can be gathered in a fast, automated way without their owners' consent; more specifically, even if the owners' aim is only to disclose their addresses for personal messages.

In the case of e-mail, automated extraction can involve a simple-minded program that systematically goes through Usenet newsgroups, Web pages, archives of discussion lists, public user listings etc. seeking strings of the form ``uuuu@xxxx.yyy.zz'' appearing in specific locations. Many of those strings will be invalid addresses; but these can be weeded out (algorithmically) following the first bulk mailing-out, since invalid addresses generate automated error messages of a specific form.

Can one of these three factors be eliminated?

The first observation on the economics of e-mail is, if anything, a blessing not to be fought.

The second observation on the nature of addresses will not change; it is the very essence of the digital realm. The relation between the user and his/her digital ID currently is well-nigh arbitrary. This could plausibly change; for example, a bit string derived from personal characteristics - such as patterns in the retina -- would yield unique correspondence. This does not enter the argument, though.

To some extent, the third observation is a blessing too; and its automated misuse can be outwitted.

Suppose I am a student of mathematics eager to receive feedback on my articles posted on the Internet; I wish to make myself available for human, but not for automated contact. I would post a paragraph on my home page such as

"My email-address is userid@host.domain.id. Take the name of the continent where I live (it has seven letters) and reverse it. Now take every other letter, starting with the first occurrence of 'c'. That's my personal access code."

Or I may include this as my "signature" when submitting to public discussion lists.

Dropping into my mailbox uninvited will then take some moments more than would otherwise - but that seems a small entrance fee for uninvited guests. I may as well include my reply code on business cards or on certain e-mail I send out; and of course, my e-mail addressbook can store access codes of my digitally corresponding colleagues and friends.

Automated extraction will spot "userid@host.domain.id". It may just as well spot the string "personal access code" or the paragraph where it occurs. To parse, interpret and execute the instructions of the paragraph in an automated way is, however, totally hopeless at present. Save for toy examples involving limited vocabulary, limited contexts and limited grammar, there is not even a working prototype of such an algorithm.

It is quite easy to identify (algorithmically, that is) the words above as belonging to the vocabulary of English. Decomposing the displayed text into constituent noun and verb phrases is much harder. For starters, the whole syntax of spoken English is not understood in a purely formal fashion by computational linguists. The example was quite simplistic, of course, but any humanly meaningful statement will do. Finally, the recipe can make allusions to current events, facts of nature and society limited only by human imagination. It will not be algorithmically decodable short of constructing an "electronic homunculus" that matches the (constantly changing!) cognitive world of humans. There is, at present, not even a scientific guess of when that becomes possible. If a mass e-mail marketer should achieve it, more glory to mass e-mail marketing.

The comparison with encryption is amusing and ironic. Schemes such as RSA depend on the fact that certain arithmetic operations take unfeasibly long in the absence of a piece of information (the private key) but become routine in its possession. Its reliability winds up being equivalent to the existence or non-existence of (in a precise sense) fast algorithms to decompose an integer into prime factors. This is a purely mathematical question; it is unsolved.

The scheme suggested above for getting the personal attention of an individual on the Internet relies on concocting a task that is meaningful to an English-speaking adult, but tremendously hard to implement algorithmically. Its reliability winds up being equivalent, essentially, to the practical existence or non-existence of a machine that passes one half of the Turing test; of an algorithm that is - in a purely operational sense of parsing and interpreting text - indistinguishable from a human.

Whether that is possible at all is not without its philosophical overtones. Progress in this direction of artificial intelligence has stalled since the mid-sixties. Recent achievements in computer chess exploit hardware; the actual software incorporates no cognitive modeling. Automated search, indexing and "data mining" engines, as well as what are sometimes called "intelligent agents", rely on various forms of pattern matching and building statistical correlation tables. Neither contributes to the logicist-structuralist side of AI.

Why can't a bulk e-mailer figure out my personal code just as easily as my would-be colleague? It can, but doing so is slow and expensive: it involves human action. Let us take 120,000 addresses (a smallish denomination today) and assume one human minute spent between looking at an address "userid@host.domain.id" and finding the corresponding personal access code (say, by searching the main Web page of domain.id first, then finding the personal Web page - or using any yellow pages service - and following the personal instructions therein). Building such a list would take two thousand man-hours, making it a considerably valuable asset. It would not be "hogged", bartered or lent lightly, making law enforcement actually possible (compare with the 30-million-addresses-for-$149 deal!).

And changing personal access codes is easy. Suppose I am bothered by intrusions into my personal mailbox; I can just think up a new rule for getting my access code. Anyone using the old one will get a polite, automated apology in response, pointing out the new recipe. For individuals I am in extended correspondence with, the process can be automated: send a special command-message instructing the remote mailbox to replace instances of my old access code with the new. Obviously, authentication is an issue here - but not an unsolvable one, thanks to public key cryptosystems.

The scheme proposed, therefore, includes the addition of an identifier to the standard e-mail header, called access code herein, a possible value of which the user can set to define the personal mailbox. Only letters bearing the personal access code will enter the personal mailbox. By contrast, some current legislative proposals, require the advertisers to identify their products by a specific string. If they fail to do so, spam will get mixed in as it does today.

If I wish to keep my mailbox open for multiple copies of invitations to send $5 to various post boxes for different kinds of products or services, I may just enable the default access_code == junk (or "bulk") option on my e-mail reader, or have bulk mail deposited into a specified folder. Or, I may opt for an ISP that refuses to accept junk mail as a matter of policy.

Often, I wish to authorize remote agents - such as mailing lists, or specific promotion agencies - to send me e-mail. The solution is that every such list come with its access-identifier: "chat_about_cat_food_at_u_middletown" or "last_min_deals_from_delta" that, upon subscribing to the list, I instruct my mail reader to accept - or to accept and deposit into a specific folder, and so on. The ultimate ideal is that e-mail with missing access code will not get processed on the Internet. To accommodate the current version of mail headers, however, a default mechanism has to be in effect for a while.

All in all, the mandatory categorizing of e-mail - including the "personal code" as suggested above - combines four features: a convenience for handling one's own incoming mail; a system for prioritizing and routing traffic on the Internet; an economic imperative preventing the misuse of bulk e-mail, to back up and extend the (no doubt necessary) legal imperative; and a non-sensitive technological standard that works across national boundaries.

Postscript

The essence of the idea has been conveyed, but for those interested in finer details, let me attach six specific questions with answers.

Q1: When it comes to being subjected to marketing, aren't users at the mercy of their ISP anyway?

A1: Under the suggested scheme, the ability to devise a "personal access code" resides with the end user, and possibly on the end user's private workstation. In particular, it would be just as time-consuming for an ISP to collect access codes and to address the personal segment of its users' mailboxes as it is for a hypothetical bulk e-mail business from the outside. Insofar as access codes travel in the same packet of data as the message, or insofar as they reside in a storage area accessible by the user's ISP, they are subject to being "snooped" as well as to commercial (mis)use (e.g., reselling by the ISP to third parties). The former is an inevitable feature of all packet-switched communications; the latter belongs in the terms of contract between the user and the ISP. In this respect, it is worth mentioning that mass misuse of personal access by an ISP against its own users is likely to be noticed very quickly. AOL provides an ironic example of such a "bulletin board alliance" of users to double-check their service provider.

Q2: Can all this be implemented here and now?

A2: Suppose one has an e-mail account that (a) is not the automated recipient of any list servers (b) can be configured in a flexible enough way (as most Unix shell accounts at universities can, though not all accounts offered by ISP's). One can then implement the scheme of this paper right away: a message not containing a specific string in (say) the subject line will trigger an automated response that explains the situation and points out (to the imagined human sender) what to do. Most spam comes with fictitious headers, not containing a single valid return address; replying to them triggers an error message, so provisions have to be made for avoiding mail loops. Unfortunately, in the absence of mailers that conveniently keep "personal access codes" together with e-mail aliases, this amounts to quite a rude gesture against fellow humans. A solution will not come before a re-creation of tools for mass communication on the Internet, which presently lends itself to abuse and suboptimality (such as proliferating error messages) far too easily. In this respect, see the succinct list at ftp://koobera.math.uic .edu/www/docs/mailabuse.html of the mathematician D. J. Bernstein (made rather more famous as plaintiff against the U. S. Commerce Department in a case of ITAR export restrictions on encryption software).

Q3: If so basic, why is this feature lacking in today's mailers?

A3: The functionality described in the body of this paper was entirely within the scope of software engineering even at the time of birth of the Internet (Observe X-Priority tags; the ability for users to sort e-mail manually or by arbitrary keywords; human-reconstructible forgery of one's own return address on Usenet; or the fact that every list server, in effect, comes with a unique identifier on the Internet). Yet these features never coalesced into a coherent whole, and the need for individuals to block themselves from algorithms never arose. The basic design of the sendmail program dates from the 1970s, constructed mainly as a technical exercise by a like-minded community of programmers sheltered in university and governmental research environments (Witness the laughable ease of forging e-mail headers!).

Q4: Can we thus eliminate at least some useless traffic on the wires?

A4: The valid issue of distribution of burden was skipped above. To wit, if a message marked "access_code == junk" travels all the way from the mass marketer to the end user, only to be refused by the user's mailbox, the cost and network strain associated with the message is born by the intermediate stations as well. Note, however, that that need happen only once: the return message can trigger the elimination of the user's address from the mass marketer's database (To better motivate the mass marketer's doing so, a fee can be imposed on transmitting refused junk mail). If end users decide to accept such mail after all, they can sign up again directly at the marketer.

Q5: What's the one-sentence difference between this suggestion and the rest that is going on against spam?

A5: The option to demarcate a section of their input channel as "humans only - no autodialled calls, please" is given into the hands of users; the demarcation is enforced by a scientific problem that is very, very hard.

The alternatives offered today are: (a) the possibility for demarcation is given to system administrators (who may or may not install filtering software), enforced by how good filtering is (the case of viruses and anti-viral software comes to mind); (b) demarcation is up to the marketers, enforced by fines, law and lawyers and (c) (today's default, and also the content of the Tauzin bill in the U. S. Congress) demarcation is up to the marketers, enforced by their goodwill.

Issues of commerce and "regulating the Net" are often mixed into the debate. Note that they are in fact red herrings. The problem of spam is strictly that currently no distinction is made between unique (person-to-person) and bulk (automatic broadcast to undifferentiated mass recipients) communication, as far as the end recipient's mailbox is concerned. Put another way, pestering an individual user by automated means is orders of magnitude simpler than automatically avoiding being pestered. My point was that that imbalance must be addressed by means intrinsic to the Internet.

Q6: What's the big deal anyway? I haven't received spam in ... days.

A6: You will. Of the many uses the Internet can be put to, delivering direct entertainment (news, advertisement ...) promises to be one of the most profitable. There exists a long pre-history and refined methodology of this industry, originating mainly from the United States. There is a tremendous incentive on operating system and Web browser designers to come up with software that allows the user to become a passive consumer of information. It is wise to put in a low-level technical breakpoint to allow choice.

Seen another way, the means for communicating to 30 million people on this planet for a mere investment of a few thousand dollars did not exist until a few years ago. At 28.8bps, dispatching bulk e-mail takes considerable time, and one has to allow for the many addresses no longer valid. Still, improvements in hardware and broadening of bandwidth increase, if anything, the relative advantage of the spammer.

Spammer? Let me include here an e-mail message that I received, as thousands others have, I believe, twice within the last month. I present it verbatim, with headers, for review purposes, without the intention of humiliating its author or infringing on his copyright.

-------begin-quote--------------------------------------------------



>From lavelle@potomac.net Mon Dec 15 19:26:29 1997

Received: from MIT.EDU (SOUTH-STATION-ANNEX.MIT.EDU [18.72.1.2])

by math.mit.edu (8.8.7/8.8.7) with SMTP id TAA09031

for ; Mon, 15 Dec 1997 19:26:28 -0500 (EST)

From: lavelle@potomac.net

Received: from transcom.capcon.net by MIT.EDU with SMTP id AA21450; Mon, 15

Dec 97 19:26:28 EST

Received: from potomac.net (a2p10.capcon.net [209.100.0.121]) by

transcom.capcon.net

(8.8.8/8.8.7/1.26) with SMTP id TAA26999;

        Mon, 15 Dec 1997 19:23:51 -0500 (EST)

Date: Mon, 15 Dec 1997 19:23:51 -0500 (EST)

Message-Id: <199712160023.TAA26999@transcom.capcon.net>

To: lavelle@potomac.net

Subject: THE UNIVERSE

Status: RO

                                                THE UNIVERSE



E=MC2 . The equation for the atom bomb. It says that matter and

energy

are the same thing. So then what is that? Matter, look at a brick. Its in

a three dimensional form. Its made of electrons, protons and neutrons

(atoms) and they are moving, so the brick is moving. Energy, sunlight. Its

in a three dimensional form. It comes to us from the sun, therefore it is

moving. 3D and moving Both matter and energy are 3D and moving. I

outproduce Einstein. We already know all matter has gravity. The bending

of light shows that energy has gravity also. So matter and energy are

3D moving with gravity. The universe is made of matter, energy, time and

space. That just stated is the matter and energy part. Time and space.

Take everything in the universe and stop it. Does time progress? No.

Therefore time is the motion and the understanding of all the motion is

the understanding of all of time. Space ends. Space does not go on

forever, it ends. Space is in a three dimensional form. Space moves but

does not have gravity. Space moves like this.   O   /\   +   \/   O

And that is the understanding of all of time. O This is what was first, in the beginning. /\ This is the old kings and queens. + This is democracy. \/ This is socialism. O This is when the Lord Jesus Christ returns.
And that is the understanding of the universe. Glory be to the Father the Son and the Holy Ghost. Revelation chapter 10 & 11; 15-19. It is very important the people receive this information. You may tell someone about this. Thank You Robert Lavelle --------end-quote--------------------------------------------------

The lure that the Internet exerts on fringe groups (it is not so much political as cultural extremes that come to mind) and on fringe economic behavior (untargeted bulk mail, multi-level, pyramid and chain schemes, deceptive advertising) is disproportionately larger than that on organizations well-established in society and traditional media. The Internet is inexpensive, de facto anonymous, involves little to lose (for behavior that is already "marginalized") and yields a visibility that, while does not equal, approaches that granted to the "mainstream" in the digital realm. Strategic, targeted advertisement results in little consumer anger, and in fact seeks to limit its own disruption. Untargeted advertisements on cheap universal channels are, ironically, only worthwhile for advertisers who aim at the least common denominator - causing the present uproar. That the ability to mass e-mail does not preclude seeming dementia is but another motivation for the mild technological improvement in Internet privacy I've suggested.

About the Author

Tibor Beke was born and raised in (and remains a citizen of) Hungary. He received a B.A. in mathematics, summa cum laude, from Princeton University in 1993 and a M.A. in mathematics from the Massachusetts Institute of Technology in 1995. He expects to receive his Ph.D. in mathematics from MIT in June, 1998. His research involves topos theory, which he describes as "somewhere on the intersection of algebraic topology, algebraic geometry and mathematical logic". Tibor is an occasional contributor to Hungarian weeklies on digital issues and was among the first Hungarian Internet activists. In addition, he was a contributor to the National Information Strategy for Hungary, a white paper on informatics commissioned by a conglomerate of IT, telecommunications, and computer firms in Hungary.
E-mail: tbeke@math.mit.edu

Notes

1. See http://www.cauce.org/why.html or the August 1997 issue of Wired for details and further pointers to the Smith, Murkowski, Torricelli and Tauzin bills in the U. S. Congress.

2. Both http://www.cauce.org/stories.html and http://spam.abuse.net/spam/ contain horror stories, some of which involve the cheeky reselling of e-mail addresses of those signed up not to receive unsolicited mail.


Contents Index

Copyright © 1998, ƒ ¡ ® s † - m ¤ ñ d @ ¥