Patterns of online behaviour in the United Kingdom and Japan: Insights based on asynchronous online conversations
First Monday

Patterns of online behaviour in the United Kingdom and Japan: Insights based on asynchronous online conversations



Abstract
This comparative study examines aspects of online behaviour exhibited by participants in online discussion groups in the United Kingdom and Japan. The primary data consists of message board threads gathered from U.K. and Japanese sites with the analysis focusing on hyperlinks contained in the forum messages as well as dates and times of posting extracted from the message heads. A ‘reading’ of hyperlinks is undertaken through consulting N–gram frequencies obtained from each data set, juxtaposed and compared with the help of a coefficient of difference and the chi–square test. Contrasts in Internet surfing patterns, information–gathering preferences, and references to video, audio, pictorial and sexual content are examined; post times are used to compare daily and weekly patterns of posting activity between the two countries. This study also provides an overview of the uses that N–grams have in natural language processing and argues for their analytical potential in sociolinguistic and CMC–related research.

Contents

1. Introduction
2. Methodological considerations
3. The data: Sources and processing
4. Results and discussion
5. Conclusions and tasks for future research

 


 

1. Introduction

Internet forums, or message boards, are a comparatively early form of computer–mediated communication (CMC). More recent forms such as blogs or social networks like Facebook and Twitter have spurred a broad and exciting wave of CMC research [1], but asynchronous multiparty conversation in the shape of message board threads remains a popular medium for online discussion around the world. The data that this communicative milieu potentially has to offer retains its scholarly value for a wide range of disciplines such as discourse analysis, language research or marketing and political research.

However, even as the literature on computer–mediated conversation has steadily grown, as late as 2007 there was still no publicly available corpus of Internet forum interactions [2], although “various studies have investigated forum language and some have constructed corpora for this purpose” [3], which is the case in the present study as well. In response to this lack, Claridge (2007) suggests building a corpus of message board texts while observing basic principles of internationality (inclusion of forums from different English–speaking countries), interactiveness (inclusion of longer conversations as the basic text units in order to make it possible to study interaction) and broad topicality (covering a broad range of conversational topics including those with potential for controversy–laden debate).

Creating large–scale country–specific or international corpora of message board conversations, including multilingual parallel corpora, is certainly an inviting prospect. Numerous topics on Internet forums are shared by many countries and languages — dating, love, war, suicide, practical advice, etc. and this fact can be exploited for inter–linguistic and inter–cultural investigations. Also, where comparative research across online communities or languages is considered, a very important methodological point needs to be appreciated: whereas word or phrasal frequencies will normally be produced per number of words in traditional corpus studies, item frequencies can be normalised using number of posts as a common denominator when the data comes from Internet forums. The posted message in written electronic conversation can be regarded as roughly equivalent to a speech–turn in normal conversation, but with the added analytical advantage of being freely and strictly discrete (even in quasi–synchronous implementations like IRC chat), i.e., there is no possibility for overlapping, interruptions, backchannels and other continuity–related problems.

From this comparative foundation, the present study looks into aspects of online behaviour of message board users in Japan and the U.K. by focusing on two particular kinds of meta–linguistic data extracted from discussion threads — hyperlinks and post–times — not least because they allow for relatively straightforward international comparisons which are largely independent of the language used.

The language of hyperlinks is more or less common around the world. Internet domain conventions and file extensions are universal, and English words appear frequently in the URLs of Web pages in other languages. Even if we encounter semantic content in a Web page’s address which is not in English, it can easily be detected using the N–gram processing techniques outlined in Section 3 of this paper and incorporated in the analysis. All this makes exchanged hyperlinks a very useful and convenient facet of online conversation to focus on when comparing multilingual data, and this paper attempts to show that much can be gleaned about the behaviour of Internet users in different online communities even just from such a partial data source.

Post–times, an invariable part of recorded information on Internet forums around the world, vary even less according to country and language; the biggest possible discrepancy here will be the ordering of dates (e.g., year/month/day in some countries versus the day/month/year format used in others), the representation of hours (a 24–hour clock versus a 12–hour clock) and perhaps whether seconds are recorded as well. These are, of course, no more than trifles and the daily, weekly, monthly, etc. rhythms of posting activity can be easily recorded and compared among online communities; whenever these virtual communities represent more or less clearly definable social groups, we can venture conclusions about the latter as well.

The target countries for this study are the U.K. and Japan and this choice stems from several considerations. Although situated on diametrically opposite ends of the world (on European maps, at least), they are both island nations of comparable size and have similarly high levels of Internet penetration — and 73 and 70 percent respectively [4]. Also, the U.K. has a clearly defined national domain on the Internet, which makes it a more suitable English–speaking country (as opposed to the United States) to compare to a non–English speaking country like Japan. Last but not least, it is of course desirable that the researcher be well familiar with the languages participating in the study.

 

++++++++++

2. Methodological considerations

2.1. N–grams and sociolinguistic research

For the analysis in Sections 4.1 and 4.2, this paper uses calculated frequencies of strings of letters and symbols. In natural language processing (NLP), sub–sequences of n letters or n words from a given text are called “N–grams”. Thus, a two–letter sequence can be called a bigram, a digram or a 2–gram and a three–letter sequence — a trigram; it is common to refer to sequences longer than three by number, so slightly confusing terms like ‘pentagram’ are not in common use. Importantly for corpus research, if a certain string appears multiple times in a large body of data, it will usually have a discernible meaning of some sort, typically representing (a part of) a word or phrase. In this study, N-grams frequencies are calculated on the basis of continuous strings of symbols with no spaces (i.e., hyperlinks), so for instance, if our data includes the string “WhatareNgrams?” the 3–gram “are” or the 4–gram “what” are likely to occur many times in the whole corpus as they represent linguistically meaningful units. The 4–gram reNg will most certainly be very rare. At the same time, the 3–gram hat, although being a word in itself, is a subset of longer words like what or that and we have to be careful in determining the correct frequencies of such ambiguous strings. Normally, this can be done by relatively straightforward subtraction of frequencies.

The rest of this section will examine (somewhat at length) the most typical uses of N–grams and discuss their underexploited potential in sociolinguistic and CMC research.

In scientific literature, N–grams most often appear in the context of language modelling. An N–gram model predicts the next element in a sequence of letters or words, based on the previous n elements. This is done by taking into account the probabilities of different letters appearing after a given sequence. Obviously, such an approach cannot account for the complexity of human language but it greatly simplifies the technical problems of modelling it. The use of N–grams in this regard can be traced back to Claude Shannon’s 1948 paper in the Bell System Technical Journal (cf., Shannon, 1948). Shannon regards natural language as a “stochastic process which produces a discrete sequence of symbols chosen from a finite set” [5] concluding that “a sufficiently complex stochastic process will give a satisfactory representation” of natural language, although what “satisfactory” means is certainly a subject open to discussion.

Typical modern applications of N–grams include areas such as approximate string matching, text prediction, speech recognition, language recognition, parsing, and machine translation. Oakes (1998) describes a technique using bigrams to measure the similarity between a given pair of words through the use of a simple coefficient. If a and b are the total number of 2–grams in words A and B respectively and c is the number of common 2–grams, then the coefficient can be calculated according to the formula 2c/(a+b). For example, the words pediatric and paediatric are divided into overlapping 2–grams, giving pe–ed–di–ia–at–tr–ri–ic and pa–ae–ed–di–ia–at–tr–ri–ic respectively. The words broken up in this way have seven common bigrams (ed, di, ia, at, tr, ri and ic) and a similarity coefficient of 0.82. Calculations based on N–grams such as this find an application in approximate string matching; they can be used to detect misspelled words and replace them with correct spellings from a dictionary database by comparing a suitable numerical measure for possible alternatives and selecting the best candidate.

Speech recognition systems typically start with a speech corpus and a text corpus which serve as training material. N–grams obtained from both are estimated and fed into the language model to be used, with the most popular statistical methods being “N–gram models, which attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words.” [6] Typically, sequences of two or three words are used and their probabilities are estimated from a large corpus (Dagan, 2000). The importance of N–grams in this field is summarized by McAllester and Schapire who write about N–gram models that “these are essentially simple models of the language that do not capture any notions of grammar, meaning, and so on. In spite of the intuitive weakness of these models, they have proved very effective in supporting speech recognition, more effective than models that intuitively seem more sophisticated, and are today used in most if not all standard speech–recognition systems.” [7]

N–grams are also used in statistical machine translation. Statistical translation is an example of an empirical approach to tackling the problem of automated translation and emerged as a “fairly new paradigm to challenge and enrich established methodologies.” (Somers, 2003) The starting point is usually a parallel corpus, i.e., collections of texts and their translations. Thus, statistical machine translation depends on a bilingual corpus, but the translation procedure depends on statistical modelling of the word order of the target language and of source–target word equivalences. The former is where an N–gram model comes in especially handy and if we are to have a language model which involves computing the probability of a word given all of the words that precede it in a sentence we must know, at any point in the sentence “the probability of an object word, Sj, given a history, S1S2 …Sj-1. Because there are so many histories, we cannot simply treat each of these probabilities as a separate parameter… In an N–gram model, two histories are equivalent if they agree in their final n–1 words.” (Brown, et al., 1990) Statistical machine translation certainly has its limitations, but using N–gram models in this field has been surprisingly successful, all the more so because this approach has a “complete lack of linguistic knowledge” (Somers, 2003).

All of the above applications mainly fall in the domains of natural language processing (NLP), information theory and computational linguistics. The overwhelming majority of literature referring to N–grams and their frequencies comes from this field in the context of practically–oriented tasks which aid humans in the implementation of laborious text processing. The usefulness of N–grams in these applications stems from their practical advantages in that “it is easy to formulate probabilistic models for them, they are very easy to extract from a corpus, and above all, they have proved to provide useful probability estimations for alternative readings of the input.” [8]

At the same time however, and as the present study will attempt to demonstrate, N–grams can be efficiently used not only when the problem we pose is how texts can be manipulated or translated, but also when we want to learn about people’s use of language and even their behaviour when the latter is evidenced from textual data. N–gram data is yet to be utilised in sociolinguistic research to a satisfactory degree, even though it can be a very useful prism through which to interrogate large corpora and learn about their underlining linguistic registers and human subjects. One obvious advantage is that N–gram frequencies can be sufficiently informative even when dealing with ‘raw’ or unannotated texts and corpora, which is a boost to research in a very dynamic environment like online conversation. CMC in its written forms around the world is strongly characterized by linguistic and graphic innovation; the use of N–grams as a basic unit of analysis circumvents to a large extent the problem of having to decide what counts as a linguistic unit and what does not.

The underuse of N–grams in linguistic and sociolinguistic corpus research is evidenced by the scant number of studies in these fields which make use of them. A rare example is J. Milton and R. Freeman’s 1996 paper about lexical variation in the writing of Chinese learners of English (cf., Milton and Freeman, 1996). They analyse N–gram distributions in the writing of Hong Kong students, divided in several groups according to the evaluation grades of their compositions. They then compare N–gram frequencies from the Chinese students’ compositions to N–gram frequencies from a separate corpus consisting of scripts of native–speaker students who received the highest grades in a University of Cambridge ‘General Studies’ examination held in the U.K. The study discovers that the number of common (i.e., coinciding) collocations that learners of English use is much greater and has a far greater density than those of native speakers, and that there is a systematic relationship between the number of different collocations and the grade received by students, i.e., phrasal variety grows proportionally with grades. These findings are not surprising but they do show how N–grams can be used to provide objective evidence for differences existing between related textual data samples. Incidentally, Milton and Freeman use the same working definition of the term N–gram as the one adopted in this study, namely “any string of co–occurring words or symbols (including punctuation marks)”. A more recent N–gram–based study is Nishina (2007). In his comparison of text corpora belonging to the different genres of English academic, newspaper and literary texts, he uses 4–grams among other statistical and vocabulary measures to identify differences.

Japanese is a language which benefits from N–gram–based analytical methodology to an even greater degree than English, because of the lack of word separation. Consequently, whenever a linguistic analysis of large texts is undertaken, word segmentation is required — an error prone task, especially when the data comes from CMC environments with their numerous linguistic and graphic idiosyncrasies. N–gram frequencies can efficiently be used to extract units of meaning from a corpus without having to define where a word stops and the next one begins. In spite of these added advantages, the discrepancy between the use of N–grams in disciplines like NLP and information retrieval on the one hand, and language studies on the other is apparent in Japanese academic circles as well. For the last few years, papers submitted to the annual conference of the Japanese Association of Natural Language Processing [9] have included at least one title which explicitly uses N–grams (primarily in the fields of information retrieval and indexing). In contrast, among papers and reports submitted to the Linguistic Society of Japan [10], only one (from 2007) mentions N–grams and that is in reference to automatic language detection of Web pages. No articles published by the Society for Japanese Linguistics [11] or the Japanese Association of Sociolinguistic Sciences [12] contain any mention of N–grams.

In 2006, Google released a very large data set containing English word N–grams and their observed frequency counts through the Linguistics Data Consortium (LDC). Functionally, this kind of information is similar to frequency lists of words such as Leech, et al. (2001) — they do not represent analyses per se but are rather designed to assist other researchers. The difference is, of course, that frequency word lists are obtained from tagged corpora and keep track of word–class, semantic and genre information; what Google’s N–grams provide is an enormous data sample. Google’s Machine Translation Team (GMTT) “processed 1,024,908,267,229 words of running text” and “published the counts for all 1,176,470,663 five–word sequences that appear at least 40 times” (cf., Franz and Brants, 2006). Even after discarding words that appear under 200 times, the data contains 13,588,391 unique words. GMTT’s stated motivation for publishing this vast array of N–grams and their frequencies was that the research community can benefit from access to such massive amounts of data, that this will promote research in the promising direction of large–scale and data–driven approaches, and allow all research groups, large or small, to play together. By today’s standards, this data is indeed massive:

 

Number of tokens:1,024,908,267,229
Number of sentences:95,119,665,584
Number of unigrams:13,588,391
Number of bigrams:314,843,401
Number of trigrams:977,069,902
Number of fourgrams:1,313,818,354
Number of fivegrams:1,176,470,663
Source: LDC (http://www.ldc.upenn.edu) [13]

 

The possible spheres of application of such data as listed by Google’s Machine Translation Team are “statistical machine translation, speech recognition, spelling correction, entity detection, information extraction” (cf., Franz and Brants, 2006). Research on the human side of language or the underlining behaviour of its users is, however, not envisioned. On the other hand, huge though it may be, this data set is certainly not the best tool to throw light on specific sociolinguistic problems. Texts on the web are too heterogeneous with large numbers of them not even human–generated. This is why sociolinguistic research will usually be better served by N–gram frequency information obtained from smaller collections of texts which represent concrete linguistic or social domains. Furthermore, simple frequencies will seldom likely to be sufficiently informative; the analytical possibilities would be vastly increased where corpus comparison is undertaken; comparing frequencies across different data samples provides the perspective needed to evaluate them and therefore gain greater insight into their significance.

A decisive advantage of N–grams is that they can be used to approach any kind of structured text; they will work equally well with conventional texts, numeric data and, importantly for CMC research, meta–linguistic information. Meta–linguistic information on message boards could be defined as all the recorded information within a given post/message besides the actual text (in the traditional sense of the word) of the message, which would include hyperlinks, users’ nicknames, smileys and other graphic inventions, post–times, poster signatures, etc. Crystal (2005) writes of the need to formulate a new branch of linguistic research which he calls ‘Internet linguistics’, defining it as the “synchronic analysis of language in all areas of Internet activity, including e–mail, the various kinds of chatroom and games interaction, instant messaging, Web pages, and associated areas of CMC, such as SMS messaging (texting)”. Including, or rather putting greater emphasis on, meta–linguistic information supplementing online utterances in computer–mediated conversation will allow us to similarly formulate a branch of Internet sociolinguistics, which will no doubt continue to increase in importance as CMC becomes ever more integral a part of human existence.

2.2. Hyperlinks in CMC research

In this study, cited URLs were extracted from a systematically collected body of Internet discussion texts. In theory, one could use software to crawl through an Internet forum site and only retrieve hyperlinks, quickly obtaining a large data sample. However, using a clearly defined corpus of forum messages enables the researcher to normalise frequency counts in the manner described above (i.e., per number of posts) and thus make objective comparisons between different data sets. Furthermore, the use of N–grams allows for efficient semi–automatic ‘reading’ and estimation of semantic information contained within the hyperlinks.

In a comprehensive review of hyperlink studies in Internet research, Park and Thelwall (2003) distinguish two main approaches to hyperlink analysis, namely hyperlink network analysis (HNA) and Webometrics. The former is “based upon the assumption that hyperlinks may be the formalized bridge between hyperlinking and hyperlinked Web sites’ authors, serving as social symbols or signs” and regards hyperlinks between Web sites (or Web pages) as “social and communicational ties”, while the latter “has tended to apply much simpler techniques combined with a more in–depth investigation into the validity of hypotheses about possible interpretations of the results”; an example of such a simple measure would be the “Web impact factor” — the number of incoming links divided by the number of outgoing links for a given Web site (cf., Ingwersen, 1998). Park and Thelwall contrast content–based with hyperlink–based methods of Web analysis and state that, “the relative advantage of hyperlink analysis is that it is able to examine the way in which Web sites form a certain kind of relations with others via hyperlinks” (Park and Thelwall, 2003).

As far as the present study uses hyperlink substrings with identifiable semantic content which is then numerically evaluated in terms of N–gram frequencies, as opposed to counting whole links or measuring Web site interconnectedness in some way, its analytical approach can be regarded as belonging to neither of the above two, although some elements such as looking at top–level domain codes may coincide. Rather, it can be seen as a kind of reading of hyperlinks as text and quantifying all possible units of meaning contained in them, thus making empirical comparison across data samples fast and efficient. The N–gram–based frequential hyperlink analysis proposed here can be said to incorporate a kind of combination of the two methods of Web analysis along the lines of Park and Thelwall’s distinction, i.e., it constitutes a content analysis of hyperlinks. Furthermore, the context in which the majority of links are produced in the special case of message boards is not so much Web site–to–Web site or Web site author–to–author as in much of the prior research on Internet linkage; rather, the main function of hyperlinks here can be better described as Internet user–to–information.

In a previous study using hyperlinks to make international comparisons, which is also an example of the above–mentioned HNA approach, Barnett and Sung (2005) explore the relationship between national culture and the structure of the Internet, “operationalised as the international hyperlink network”. Specifically, they examine the relationship between national culture, measured with Hofstede’s (1991) dimensions, and the centrality and overall structure of the Internet flow using network analysis; their data consist of international hyperlinks and the numbers of bilateral inter–domain hyperlinks among 47 nations (including the two compared here) obtained through the Alta Vista search engine. Some of their findings which are relevant to this study will be mentioned below.

2.3. Temporal dimensions of CMC

Internet conversation systems like message boards and chatrooms will usually give no direct clues as to the location of posters, unless they decide to disclose it in their user profiles or otherwise provide information in their messages. Explicit indication of posters’ originating IP addresses is a rare feature on Internet forums. Some 2Channel boards (cf., Section 3 for a description of the forum sites used in this study) will at most provide an “ID” attached to every message, referred to as a ‘tripcode’, which is a string of symbols encrypted on the basis of the user’s IP address and time of posting. This is done to prevent the so-called jisaku–jien (lit. ‘playing one’s own script’) which typically involves a single person posting multiple messages as if they originated from multiple posters (cf., Suzuki, 2003) thus representing a type of behaviour related to the so–called sock–puppeting or trolling. On the other hand, posters on forum sites which do require registration like Urban75 (cf., Section 3) will often humorously or otherwise avoid disclosing their actual location; examples of mock–locations include ‘high in a tower block’, ‘on the far edge of reason’, ‘a purple tambourine’ and ‘sailing a redemption ark’. In any case, there is no reliable way of determining people’s locations, much less a method that can be used for automatic data extraction.

One thing the observer can ascertain with accuracy, however, is the exact time a message was posted. Data collected from message boards and social networking sites should be seen as a powerful tool for exploring the ‘geography’ of social time. The temporal characteristics of Internet forum use which the present study uncovers are given extra significance by the fact that we are dealing with asynchronous online conversation, i.e., certain patterns have prevailed even though individuals are not constrained by their fellow posters’ simultaneous online availability or lack thereof at any given time.

The relative ease with which temporal data from records of online conversation can be obtained and analysed should encourage the investigation of time patterns of Internet use within various Web communities. However, surveys of this kind can be said to be relatively few at the present time. For previous research similarly looking into temporal characteristics of CMC see, for example, Burr and Spennemann (2004) and Golder, et al. (2006) analysing information obtained from online forums at an Australian university and from Facebook respectively. Burr and Spennemann (2004) draw on a richer set of temporal data than the kind used here, including not only forum posts but forum views as well. They determine patterns of annual, sessional, daily and hourly user behaviours over a period of four years and analyse them from technical (e.g., network traffic) and pedagogical perspectives standpoints in the context of distance learning. In contrast to Burr and Spennemann’s analysis, much less is known about subjects’ identities in this study and the discussion is solely based on a comparison of two (presumably) non–overlapping online communities. Golder, et al. (2006) similarly deal with college students and analyse a very large data sample — “the fully–anonymized headers of 362 million messages exchanged by 4.2 million users of Facebook, an online social network of college students [at the time], during a 26 month interval”. They find a strong weekly temporal pattern to college students’ Facebook use, a grouping of students with similar temporal patterns by school, and a seasonal variation in the proportion of messages sent within a school.

In terms of their organisational function in written online conversation, post–times are important for determining the order of speech turns. It is therefore not particularly significant exactly when a message in the discussion thread was sent but rather whether it comes before or after any other given message. However, temporal posting patterns will clearly be subject to off–line social habits and constraints. At the same time, because of the complex nature of these external social factors and the impossibility to keep track of the exact demographic constitution of online communities (which would in theory give us deeper insight into the relationship between time patterns and social constraints), we have to limit ourselves mostly to discovering persistent differences in temporal rhythms of Internet use among discernible online populations while being cautious interpreters of those findings, at least until we can draw on a much wider body of research in this area.

 

++++++++++

3. The data: Sources and processing

Three Internet forum sites were used for this study: Urban75 Forums and Digital Spy Forums, both U.K.–based, and 2Channel (ni–channeru) — a vast repository of Japanese message boards.

2Channel is “the world’s largest single Internet bulletin board forum and the most widely known single free–access Japanese Internet forum, with over five million people accessing it each month.” (Kaigo and Watanabe, 2007) It is also characterised by a high level of anonymity, which means that “risqué or taboo subjects that are usually not discussed in normal face–to–face communication in Japan are popular topics.” (Kaigo and Watanabe, 2007) 2Channel threads have a flat structure, i.e., hundreds of messages can be displayed on a single Web page, making data collection a simple matter of copy–and–paste operations.

The two U.K. boards were selected for their comparative popularity (in terms of text traffic), availability of numerous topics of a general (i.e., not overly specialised) nature, (mostly) U.K. participation and a type of bulletin–board system software (vBulletin [14]) which allows for comparative ease of data collection and processing (although requiring the use of automated extraction, unlike 2Channel). The size and structure of the data are outlined in Table 1.

 

Table 1: Message boards data: Composition.
 PostsThreadsLinksLinks
(per 1,000 posts)
Digital Spy Forums54,930202,745 
URBAN75 Forums56,246306,887 
U.K. total111,177509,62286.55
2Channel100,1491128,49484.81

 

Popular threads were selected, the indicator being a message count as large as possible for each site. The goal was to include topics of a general kind, so as to limit exchanges with little or no discussion such as image–posting threads, or overly–specialised conversations such as professional discussions or word games. One technical feature of 2Channel is that once a thread reaches a length of 1,000 posts it is automatically archived thus setting an effective upper limit on the message count; consequently the data sample described above contains a larger number of Japanese threads to achieve the targeted number of posts (around 100,000 in each language).

Other than the reasons stated in the introduction, U.K.–based data was preferred as a source of English–language online discussion so as to increase the probability that most posters reside in the same country and time zone thus enabling the researcher to attempt international comparisons of online behaviour. In the data below, the abbreviation “2Ch” will indicate N–gram frequencies obtained from the Japanese forums corpus, while frequencies from Urban75 and Digital Spy combined will be designated as “UKf”.

Among corpus linguistics studies, Leech and Fallon (1992) is methodologically closely related to the present article. This research pair used word frequencies, the chi–squared test and a simple numeric measure (which I have named the ‘Leech–Fallon’ coefficient for my purposes) to compare two collections of texts — the Brown (American English) and LOB (British English) corpora, both dating from the year 1961 — and establish a set of cultural differences between the two largest English–speaking nations. A more recent related study is Oakes and Farrow (2007), in which the chi–squared test is used to find the vocabulary most typical of seven different ICAME [15] corpora, each representing the English used in a particular country. Both pieces of research deal with annotated corpora and vocabulary differences rather than ‘raw’ texts and N–gram data.

In preparation for the analysis in sections 4.1 and 4.2, all hyperlinks were extracted from the data and processed to obtain N–gram frequencies (with 1<=N<=8). The frequencies of all resulting N–grams, calculated per 1,000 posts, were then compared between the U.K. and Japanese corpora in a manner similar to Leech and Fallon (1992) with a measure of statistical significance (chi–square) and a simple coefficient that I will refer to here as the Leech–Fallon coefficient [16] (LF henceforth). LF has been most useful to the author of this article himself in ordering and viewing the N–gram data from different angles, but it will nevertheless be included in some of the tables below. The data layout employed after initial processing therefore has the following structure:

 

Table 2: N–gram data sample.
Note: ρ = probability that observed difference is not statistically significant (according to Pearson’s chi–square test); omission in the tables bellow implies lack of statistical significance;
Χ2 = value of the chi–square statistic;
LF = N–gram(UKf)–N–gram(2Ch)/N–gram(UKf)+N–gram(2Ch);
2Ch = frequency of N–gram in the Japanese corpus (per thousand posts); and,
UKf = frequency of N–gram in the U.K. corpus (per thousand posts).
ρΧ2LFN–gram2ChUKf
.00130.573-0.64ya5.51.2
.00120.554-0.69yah3.30.6
.00120.554-0.69yaho3.30.6
.00120.554-0.69yahoo3.30.6
.00120.554-0.69yahoo.c3.30.6
.00120.554-0.69yahoo.co3.30.6

 

Notice the overlapping character of the N–grams; the frequencies sometimes have to be compared vertically to obtain correct results. In general, an N–gram which is a subset of a longer N–gram will have a higher frequency. In the example above, it is clear that all instances of yahoo in the data are part of longer strings like yahoo.com or yahoo.co.jp, since the frequencies of the subset N–gram and its longer derivatives are the same. Also, wherever the N–gram column contains a string containing more than eight symbols (i.e., N<8), this means that correct frequencies have been determined using similar vertical comparisons of related subsets. N–grams with frequencies of less than 0.1 times per 1,000 posts have not been included in the discussion below.

For the analysis in Section 4.3, the time and date of posting of each message were extracted, the frequency of each hour and day of the week was normalised per 100,000 posts and these were then compared. Times of ’00 ∼ ’59 minutes past the hour have been counted together and included in the group of the previous round hour so, for example, 16:32 would fall into the group of hour 16 in the charts below. Midnight is presented as 24:00 o’clock in the general hour–chart and as 00:00 o’clock in the weekday–specific hour–charts.

Even though it is difficult — or in the case of 2Channel, impossible — to ascertain every poster’s nationality, I will refer to the subjects of this paper as U.K./Japanese posters. Also, a phrase like “posters in Japan” should not necessarily be taken to mean “people posting from within the geographic boundaries of Japan”, but rather “users who post on Japanese Internet forum sites in Japanese”.

The analytical part of this paper which follows is strictly data–driven and inevitably contains a large number of tables illustrating the empirical basis for each point made.

 

++++++++++

4. Results and discussion

4.1. Hyperlinks: Domain analysis

The relative frequencies of hyperlinks in the two data sets show remarkable similarity, although the U.K. forum sites, each much smaller than 2Channel, diverge considerably in this respect:

 

CorpusLinks
per 1,000 posts [17]
UKf86.55
2Ch84.81

 

An interesting detail to note would be the way links are posted on message boards, i.e., how often they are cited in their full form:

 

N–gram2ChUKf
ttp://84.883.2
http://72.983.2

 

Practically all hyperlinks on 2Channel begin with the protocol identifier http://, but posters omit the first letter, h, about 14 percent of the time, as evident from the table above. This is done, supposedly, to avoid a page of advertisements that follows a direct click on a link; ‘dropping the h’ forces users to copy the hyperlink and paste it into their browser’s navigation toolbar thus avoiding the site’s referral system. In UKf, we find that about four percent of the links are given with the protocol identifier omitted altogether — a practice virtually unobserved in the Japanese data. No ftp:// links were found in either data sample.

Although their frequency in UKf and 2Channel is roughly the same, the external/internal composition of hyperlinks on each particular site is very different. The number of links pointing to pages within each forum site (per 1,000 posts) is as follows:

 

Intra–site hyperlinks.
urban75.net3.1
digitalspy.co.uk1.1
2ch.net21.9

 

The observed differences here are great and this is testimony to the size of the world’s largest message board site. This kind of data could in principle serve as a good measure, in combination with other indicators perhaps, of the character of online communities: here, it suggests that Urban75 is more ‘tight–knit’ as an online community than Digital Spy.

4.1.1. Country–code domains

The frequencies (per 1,000 posts) of hyperlinks pointing to the country domains of the U.K. and Japan are:

 

Domain2ChUKf
.jp31.00.2
.uk0.221.1

 

These distributions possess a statistically significant diagonal difference (i.e., .jp representation in 2Ch versus .uk representation in UKf) with ρ≤0.001; consequently, Japanese links can be said to stay within their virtual national space more often — around 37 percent of the time as opposed to 24 percent in the U.K. case. This observation will become more significant in combination with others below.

4.1.2. Top–level domains

The most common Internet top–level domains are featured in our hyperlink data with the following frequencies (.net and .co.uk frequencies originating from the URLs of the message board sites themselves, i.e., urban75.net, 2ch.net and digitalspy.co.uk, have been subtracted):

 

Table 3: Top–level domain distributions in hyperlinks.
ρΧ2LFN–gram2ChUKf
 2.666-0.17.net3.82.7
.00117.9270.44.org3.17.9
.00150.9180.31.com23.444.1
.001160.1650.98.co.uk0.217.7
.001138.018-1.co.jp12.9 
.00154.3350.27.co(m) TOTAL36.561.8

 

Both .org and .com are significantly more prevalent in UKf, while .net is distributed more or less evenly. We also notice that while the home–country commercial domain is featured in hyperlinks more often in the U.K. data, if we examine .co.jp and .co.uk as a proportion of the total number of commercial (i.e., .co(m)) links in 2Ch and UKf respectively, the former is larger at around 35 percent, as opposed to 28 percent for the latter.

4.1.3. Government and academic institutions

Japanese message board posters refer to governmental pages noticeably more often than their U.K. counterparts:

 

N–gram2ChUKf
.go.jp1.5 
.gov.uk 0.4
.gov0.20.8

 

The frequential difference between .go.jp and .gov.uk is significant at the ρ<0.05 level. This cannot be attributed simply to a possible greater abundance of government–domain pages in Japan, as Google’s search engine reveals that the approximate number of pages within .go.jp and .gov.uk is roughly the same (125–126 million pages in each case) [18].

Both groups also refer to U.S. governmental Web pages (.gov) and, not surprisingly, U.K. posters do so much more often. However, more striking is the fact that posters in the United Kingdom seem to visit .gov pages twice as often as pages belonging to their domestic government institutions.

The situation is somewhat similar in links to pages belonging to academic institutions:

 

N–gram2ChUKf
.ac.jp0.6 
.ac.uk 0.4
.edu0.20.8

 

This time, the frequencies of .ac.jp and .ac.uk do not exhibit any significant statistical difference, but the discrepancy becomes more indicative when we take into consideration the fact that the approximate number of Web pages within each country’s academic domain was 37.7 million for .ac.jp and 50.3. million for .ac.uk overall (at the time of writing of this paper). Again, the number of times U.S. academic pages (.edu) are referred to in UKf is twice that of pages within the U.K. itself.

4.2. Hyperlinks: Content analysis

4.2.1. News, blogs and online reference

The frequency distribution of news shows a significant difference between the two corpora, with this N–gram appearing more frequently in Japanese hyperlinks:

 

ρΧ2LFN–gram2ChUKf
.00122.898-0.23news20.412.8

 

If we take a look at the main news sources in our data we find that the greater part of references seem to be concentrated in a few well–known sites in UKf, while on 2Channel the online editions of popular newspapers add up to a much smaller combined frequency.

 

Table 4: References to well–known news sites.
News Web site2ChUKf
in Japanese  
www.asahi.com0.9 
www.yomiuri.co.jp0.7 
www.nikkei.co.jp0.6 
www.mainichi(–msn.co).jp0.6 
www.nhk.or.jp/news0.2 
www.nikkansports.com0.2 
sankei.jp.msn.com0.2 
www.cnn.co.jp0.2 
www.chunichi.co.jp0.1 
www.tokyo–np.co.jp0.1 
in English  
news.bbc.co.uk0.13.6
www.guardian.co.uk 2.7
www.bloomberg.com 0.7
www.independent.co.uk 0.6
www.reuters.com0.10.5
www.dailymail.co.uk 0.4
www.thesun.co.uk 0.2
www.cnn.com 0.2
news.sky.com 0.1
TOTAL49.5

 

This suggests that while Japanese posters quote online news more frequently, they rely less on online newspapers and news agencies directly. A big factor for this is that 2Channel has its own ‘information agency’, so to speak, with many boards devoted to news; posters will frequently refer others to news–citing threads on the site. The N–gram newsplus (being a subset of news), which appears in the URLs of various 2Channel news–related boards, illustrates this point:

 

LFN–gram2ChUKf
-1newsplus5.2 

 

Blog references are also more frequent on 2Channel. The corresponding N–gram distribution exhibits a statistically significant difference:

 

ρΧ2LFN–gram2ChUKf
.053.976-0.21blog42.6
.054.274-1diary0.4 

 

Although it is a small minority, the word diary sometimes appears in Japanese hyperlink URLs and a manual check on the actual links in question reveals that they indeed point to blog pages. The corresponding Japanese word, nikki, was also found but far below the threshold of 0.1 times per 1,000 posts for N–grams used in this study.

The bulk of hyperlinks on message boards are supposedly provided for extra information and we do not find significant differences between 2Ch and UKf in the distributions of the following information–related N–grams:

 

ρΧ2LFN–gram2ChUKf
 2.5010.31.wikipedia1.22.1
 0.91-0.15info1.51.1
 1.4260.3.pdf0.71.3

 

As far as Wikipedia references are concerned, the difference was not calculated to be statistically significant, but it still constitutes an almost two–fold disparity; it is also in line with a previous study by the author, showing that Japanese forum posters cite Wikipedia references less often than their European counterparts [19]. A further pair of N–gram distributions indicates, not surprisingly, that Internet users in both countries consult encyclopaedia articles exclusively in their own language.

 

ρΧ2LFN–gram2ChUKf
.00111.755-1ja.wikipedia1.1 
.00116.9040.91en.wikipedia0.12.1

 

This is a more interesting fact as far as the Japanese case is concerned, as English is a compulsory subject in school education. Incidentally, almost all of the few references to English Wikipedia articles found in 2Ch came from a discussion devoted to the world history of languages.

4.2.2. Visual and audio content

By looking at N–grams related to pictorial, video and audio content, as well as corresponding file extensions, we discover some clear differences, made even more significant by the fact that no threads specifically devoted to exchanging images or music files were included in the data sample.

 

Table 5: Distributions of N–grams indicating different media types.
ρΧ2LFN–gram2ChUKf
   IMAGES  
.00120.880.78.gif0.43.3
.00156.6280.54.jpg5.117
.00120.8740.87album0.22.8
.018.4241flickr 0.9
 0.8730.5gallery0.10.3
.00148.5450.69image1.79.4
.00116.1640.58photo1.14.2
.016.7880.82picture0.11
.00117.3950.48img2.36.5
.001293.0010.79TOTAL5.345.5
   VIDEO  
 2.942-0.67movie0.50.1
.055.604-0.44video1.80.7
 0.9570.13watch4.35.6
 1.950.18youtube3.65.2
 2.072-0.06TOTAL10.211.6
   AUDIO  
.016.949-0.8music0.90.1
 1.494-0.43.mp30.50.2
 0.0090radio0.40.4
 1.069-1audio0.1 
 0.9361.rm 0.1
 0.9361.ram 0.1
.054.27-0.36TOTAL1.90.9

 

By far, the biggest difference is in hyperlinks pointing to online images and pictorial galleries. The data strongly suggests that U.K. forum users provide links to images much more often than their Japanese counterparts. In both countries, video content is linked to much more often than audio content.

4.2.3. Popular Web sites and sexual content

Table 6 shows the frequencies of hyperlink references to pages belonging to some well–known Web sites, Mixi being a large Japanese social networking service [20], and Rakuten — the Japanese equivalent of eBay [21].

 

Table 6: Hyperlink references to well–known Web sites.
ρΧ2LFWeb site2ChUKf
 0.0110Amazon0.50.5
 0.9361AOL 0.1
 1.8721eBay 0.2
 3.7441Facebook 0.4
.018.4241Flickr 0.9
.054.936-0.64GeoCities0.90.2
 0.0040.05Google11.1
 3.206-1Mixi0.3 
 1.878-0.4MSN0.70.3
 2.8081MySpace 0.3
.055.343-1Rakuten0.5 
.00120.554-0.69Yahoo!3.30.6
 1.950.18YouTube3.65.2

 

Notably, Facebook and MySpace are missing from the 2Channel data, while Yahoo! services clearly enjoy a much bigger popularity in Japan.

Interesting is the contrast between the U.K. and Japanese main public broadcasters — the BBC and NHK. The relevant N–gram distributions show that BBC news and resources are much more frequently visited in the United Kingdom than NHK is in Japan:

 

N–gram2ChUKf
bbc.co0.14.7
nhk0.4 

 

It is no secret that the Internet provides multiple opportunities for viewing content of an erotic or pornographic nature. The relevant N–gram distributions obtained from hyperlinks show that users on 2Channel concern themselves with Web pages featuring sexual content much more often than users on Urban75 and Digital Spy, although this comes as no surprise perhaps, given the anonymous nature of the Japanese online forum repository.

 

Table 7: Hyperlink N–grams: Sexual content.
ρΧ2LFWeb site2ChUKf
 0.2710.33sex0.10.2
 1.069-1porn0.1 
 2.137-1xxx0.2 
 6.411-1pink0.6 
0.00111.851-0.78TOTAL1.60.2

 

Let us examine this measure against N–gram frequencies obtained from the actual texts of the messages in all participant corpora (excluding hyperlinks). At first glance, the situation is dramatically reversed:

 

Table 8: Hyperlink N–grams: Sexual content.
Note: asterisks indicate compound forms; for example, porn(ography)* includes words like porn, pornographic and pornography in one group; English N–grams were obtained as strings of n words, while Japanese N–grams are strings of n characters.
N–gramUrban75Digital SpyUKf2ChJapanese N–gramEngl. meaning
sex9.74.6614.368.7textsex
porn(ography)*1.082.023.12.2textsex
    0.1textporn(ography)
    1.2sexsex
Country TOTALS17.4612.2 

 

However, we are quickly forced to make adjustments. For example, the English word sex, unlike its Japanese counterpart text can refer to both ‘intercourse’ and ‘type of gender’. We also discover subset N–grams like porn charges, same sex and Sex Pistols, which either do not belong to the category of explicit sexual talk or treat sexual matters as social problems. This greatly complicates the task of determining the correct frequencies, unless we look at cases where such factors do not play that big a role. Thus, if we only take phrases like have sex, (in its various forms such as have sex, having sex, had sex, etc.) and their Japanese counterparts we obtain a frequential balance much closer to the initial observation that 2Channel contains a greater amount of explicit sexual content:

 

Table 9: Textual N–grams: Sexual content (2).
N–gramUrban75Digital SpyUKf2ChJapanese N–gramEngl. meaning
have sex*0.582.693.273.1texthave sex*
    1.4texthave sex*
    0.9texthave sex*
    0.4sextexthave sex*
    0.3sextexthave sex*
Country TOTALS3.276.1 

 

This shows how, in some cases at least, hyperlinks can inform us of certain differences between online communities in a more straightforward manner than semantic content found in the message texts, especially when we are comparing different languages or unless we are prepared to do much manual checking and adjusting of raw frequencies.

4.3. Time patterns of posting activity

4.3.1. General hourly patterns

The average numbers of messages posted at different hours of the day exhibit significant divergence between our two data sets. Figure 1 clearly suggests that posting activity in the U.K. is very high during the afternoon and evening hours of the day, while Japanese posters are much more active close to and past midnight. Peak times fall within 22:00∼23:00 in the U.K. and 23:00∼24:00 in Japan. The two general hourly graphs below suggest that Japanese (JP) users are later sleepers while U.K. users post more actively during business hours.

 

Figure 1: General hourly posting patterns
Figure 1: General hourly posting patterns.
Note: the y axis indicates normalised number of posts (to 100,000).

 

4.3.2. General weekly patterns

The differences in weekly posting activity look less pronounced, but do show that Japanese posters are more active during weekends. The graph below reinforces the above impression, namely that U.K. posters are relatively more active during periods typically spent working.

 

Figure 2: General weekly posting patterns
Figure 2: General weekly posting patterns.
Note: the y axis indicates normalised number of posts (to 100,000).

 

In both countries, Tuesday is the weekday featuring the busiest posting activity [22].

4.3.3. Hourly patterns by day

Detailed graphs charting hourly activity on each day of the week (Figures 3–9) indicate stable differences in day– and night–time use patterns: U.K. posters are invariably more active during the middle hours of the day while their Japanese counterparts always come on top in the couple of hours immediately preceding midnight. Night–owl users are clearly more numerous in Japan.

 

Figure 3: Posting activity by hour: Monday
Figure 3: Posting activity by hour: Monday.

 

 

Figure 4: Posting activity by hour: Tuesday
Figure 4: Posting activity by hour: Tuesday.

 

 

Figure 5: Posting activity by hour: Wednesday
Figure 5: Posting activity by hour: Wednesday.

 

 

Figure 6: Posting activity by hour: Thursday
Figure 6: Posting activity by hour: Thursday.

 

 

Figure 7: Posting activity by hour: Friday
Figure 7: Posting activity by hour: Friday.

 

 

Figure 8: Posting activity by hour: Saturday
Figure 8: Posting activity by hour: Saturday.

 

 

Figure 9: Posting activity by hour: Sunday
Figure 9: Posting activity by hour: Sunday.

 

 

++++++++++

5. Conclusions and tasks for future research

The main differences that have emerged between the U.K. and Japan, based on this study’s sample of forum posters, are as follows: i. Japanese Web–surfing behaviour is to a greater extent confined within the boundaries of the home country’s Internet space; ii. U.K. forum users seem to concern themselves more with online images than 2Channel users, and less with music–related content; video content seems to be of equal interest; iii. Japanese Internet users seem to rely more on government–provided information and less on Wikipedia articles; they also visit blogs more frequently; iv. Sources of online news appear to be more dispersed in Japan and are concentrated to a smaller number of well–known online newspapers and news agencies in the U.K.; v. Sexual talk and references are more prevalent on 2Channel than on Urban75 and Digital Spy forums; and, vi. U.K. posters are consistently more active during business days and business hours, while Japanese posters are much more active late at night and slightly more so on weekends.

The differences suggested by the analysis of hyperlinks are all based on a crucial assumption, namely, that posted links in forum messages accurately reflect the general Web surfing behaviour of Internet users in each country. Clearly, more research needs to be done to follow up on each of the above points; in fact, the conclusions drawn here should serve as hypotheses for subsequent studies, perhaps utilising different methods or data.

We have seen that the Japanese Internet experience seems to be more closed within national boundaries. This finding corresponds to the above cited Barnett and Sung (2005), whose in–degree and out–degree measures both place Japan behind the U.K. in terms of the number of international hyperlinks leading in and out of each country. The biggest contributing factor for this situation is, no doubt, the language barrier. In an oft–cited work, Giddens (1990) convincingly makes the case for two main types of dis–embedding mechanisms at work in the processes of distantiation and globalisation: symbolic tokens and expert systems. Money is his favourite example of the former but language is not seen as a powerful mechanism in these processes. Giddens states that “language is (not) on a par with money or other disembedding mechanisms.” [23] This is clearly not the case on the Internet, where the primary boundaries limiting users’ movement in virtual space are linguistic ones. Such linguistic boundaries keep Japanese forum posters much more within their national virtual space than their U.K. counterparts, if only on account of the latter’s shared language with the United States.

Many further observations can be made on the basis of quantified semantic information obtained from hyperlinks, but that would necessitate different and more randomized data collection techniques and larger samples. For example, the hyperlink N–gram distribution

 

LFN–gram2ChUKf
1football 0.5
-1soccer0.5 

 

is certainly informative as to the choice of term designating a particular sport and, when it is furthermore combined with the observation

 

LFN–gram2ChUKf
0.16sport1.31.8
1talksport 0.5

 

would make it tempting to try and interpret the remarkable coincidence in these N–gram frequencies [24]. However, the discussion threads that have gone into this study’s data samples play too big a role in determining such thematic frequencies for any interpretations of this kind to be reliable.

More randomized data collection techniques and a different approach to choosing asynchronous online conversation sources are likewise required if we are to attempt international comparisons of purely linguistic (rather than meta–linguistic) material such as phrases and words based on N–gram frequencies. Ideally, equal numbers of posts on identical topics would be collected for the target languages/communities. This is, of course, practically impossible to achieve, so appropriate sampling techniques must be utilised. Anonymity also has to be taken into account — Web etiquette in the form of greetings, expressions of gratitude can more or less easily be compared among different languages, but that has to be done for message boards allowing for similar levels of user transparency.

The value of time patterns of use, obtained from Internet forums will clearly be enhanced if it is combined with research into offline social rhythms, working patterns, sleeping habits, etc. This does not diminish the utility of online temporal data, which is very easy to obtain and entails fewer concerns about invading people’s privacy or lifestyle. A larger and purposefully structured data sample would also allow for a meaningful analysis of seasonal posting patterns. On the basis of the above observations, one hypothesis to test would be whether posting levels in Japan are higher than in the U.K. during the summer holiday season or around year–end festivities, i.e., at times of low working activity. Furthermore, online social environments have emerged recently where access via mobile phone and access via desktop computer can be differentiated and the time patterns thereof compared; the temporal dynamics of such environments can reasonably be expected to be different from the PC–dominated message boards.

In the kind of N–gram analysis put forward above, no single observation can be relied upon as sound proof of certain states of affairs, but when the data is examined from numerous related angles some clear differences emerge. When empirical facts are discovered, they become hypotheses in their own right to be tested by future research. It seems, for the time being at least, that ad hoc studies of forum interactions will be the norm as opposed to the kind of systematic corpus building that Claridge (2007) envisages. However, as this paper has suggested, by using efficient tools like N–gram frequencies to analyse raw data, corpus research into online conversation can at least proceed at a more dynamic pace and not be left too far behind the ever–increasing amount and ever–accelerating morphing of computer–mediated interactions. End of article

 

About the author

Milen Martchev is currently visiting researcher at the Graduate School of Social Studies at Hitotsubashi University, Tokyo, where he obtained his Ph.D. degree in 2008. His research focuses on linguistic innovations and behavioural patterns emerging from the medium of online conversation as well as the development of efficient data–processing techniques for analysing large corpora of electronic interactions.
Direct comments to milen [dot] martchev [at] gmail [dot] com

 

Notes

1. Cf., for example, Argamon, et al., 2007; Bumgarner, 2007; Barnes, 2006; Thelwall, 2008; and, Huberman, et al., 2009 in First Monday; or, Huffaker and Calvert, 2005; Trammel, et al., 2006; Ellison, et al., 2007; Miura and Yamashita, 2007; Pedersen and Macafee, 2007; Schmidt, 2007; Hargittai, 2008; Lewis, et al., 2008; Stefanone and Jang, 2008; and, Zywica and Danowski, 2008, in the Journal of Computer–Mediated Communication.

2. Claridge, 2007, p. 88.

3. Ibid.

4. Source: Internet World Stats, April 2009. (http://www.internetworldstats.com).

5. Shannon, 1948, p. 5.

6. Lamel and Gauvain, 2003, p. 315.

7. McAllester and Schapire, 2003, p. 272.

8. Dagan, 2000, p. 463.

9. http://www.nak.ics.keio.ac.jp/NLP/.

10. http://wwwsoc.nii.ac.jp/lsj2.

11. http://wwwsoc.nii.ac.jp/jpling/.

12. http://www.jass.ne.jp/.

13. LDC Catalog No. LDC2006T13, at http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.

14. A professional community forum solution by Jelsoft Enterprises (http://www.vbulletin.com/).

15. The International Computer Archive of Modern and Medieval English, at http://icame.uib.no.

16. LF is equal to the difference of the frequencies of an N–gram in each corpus divided by their sum. To be precise, the coefficient used here is the negative of the actual measure used by Leech and Fallon, for subjective reasons regarding the intuitive perception of the data layout.

17. All subsequent frequencies are also per thousand posts.

18. Google Search results queries like site:.go.jp, site:.gov.uk, site:.ac.jp, and site:.ac.uk were recorded on three separate dates in March and April 2009, and this paper bases its statements on the average values of those observations.

19. From at least two European countries. As part of his PhD dissertation project, the author similarly compared the composition of hyperlinks on Japanese and Bulgarian forum sites.

20. http://mixi.jp.

21. http://www.rakuten.co.jp.

22. Incidentally, this was also the case in the said PhD (unpublished) project conducted by the author on Bulgarian Internet Forum activity.

23. Giddens, 1990. p. 23.

24. “Talksport” is a U.K. commercial sports and talk radio service, a lengthy discussion on which was included in the data sample; subtracting the frequency of talksport from that of sport gives a frequency of 1.3 which is exactly equal to that obtained from Japanese hyperlinks.

 

References

Shlomo Argamon, Moshe Koppel, James Pennebaker and Jonathan Schler, 2007. “Mining the Blogosphere: Age, gender and the varieties of self–expression,” First Monday, volume 12, number 9 (September), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2003/1878, accessed 20 July 2009.

George Barnett and Eunjung Sung, 2005. “Culture and the structure of the international hyperlink network,” Journal of Computer–Mediated Communication, volume 11, issue 1, at http://jcmc.indiana.edu/vol11/issue1/barnett.html, accessed 20 July 2009.

Susan Barnes, 2006. “A privacy paradox: Social networking in the United States,” First Monday, volume 11, number 9 (September), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1394/1312, accessed 20 July 2009.

Brett Bumgarner, 2007. “You have been poked: Exploring the uses and gratifications of Facebook among emerging adults,” First Monday, volume 12, number 11 (November), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2026/1897, accessed 20 July 2009.

Leslie Burr and Dirk Spennemann, 2004. “Patterns of user behaviour in university online forums,” International Journal of Instructional Technology and Distance Learning, volume 1, number 10, pp. 11–28, and at http://itdl.org/Journal/Oct_04/Oct_04.pdf, accessed 20 July 2009.

Claudia Claridge, 2007. “Constructing a corpus from the Web: Message boards,” In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer (editors). Corpus linguistics and the Web. Amsterdam: Rodopi, pp. 87–104.

David Crystal, 2005. “The scope of Internet linguistics,” paper presented at the American Association for the Advancement of Science annual conference, at http://www.davidcrystal.com/DC_articles/Internet2.pdf, accessed 20 July 2009.

Ido Dagan, 2000. “Contextual word similarity,” In: H.L. Somers, Robert Dale and Hermann Moisl (editors). Handbook of natural language processing. New York: Marcel Dekker, pp. 459–475.

Nicole Ellison, Charles Steinfield and Cliff Lampe, 2007. “The benefits of Facebook ‘friends’: Social capital and college students’ use of online social network sites,” Journal of Computer–Mediated Communication, volume 12, issue 4 (July), pp. 1143–1168, and at http://jcmc.indiana.edu/vol12/issue4/ellison.html, accessed 20 July 2009.

Alex Franz and Thorsten Brants, 2006. “All our n–gram are belong to you,” Google Developer Blog, at http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html, accessed 20 July 2009.

Anthony Giddens, 1990. The consequences of modernity. Stanford, Calif.: Stanford University Press.

Scot Golder, Dennis Wilkinson and Bernando Huberman, 2006. “Rhythms of social interaction: Messaging within a massive online network,” In: Communities and Technologies 2007: Proceedings of the Third Communities and Technologies Conference (27 June), and at http://arxiv.org/PS_cache/cs/pdf/0611/0611137v1.pdf, accessed 20 July 2009.

Eszter Hargittai, 2008. “Whose space? Differences among users and non–users of social network sites,” Journal of Computer–Mediated Communication, volume 13, issue 1, pp. 276–297, at http://jcmc.indiana.edu/vol13/issue1/hargittai.html, accessed 20 July 2009.

Gert Jan Hofstede, 1991. Cultures and organizations: Software of the mind. London: McGraw-Hill.

Bernardo Huberman, Daniel Romero and Fang Wu, 2009. “Social networks that matter: Twitter under the microscope,” First Monday, volume 14, number 1 (January), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/2063, accessed 20 July 2009.

David Huffaker and Sandra Calvert, 2005. “Gender, identity, and language use in teenage blogs,” Journal of Computer–Mediated Communication, volume 10, number 2, at http://jcmc.indiana.edu/vol10/issue2/huffaker.html, accessed 20 July 2009.

Peter Ingwersen, 1998. “The calculation of Web impact factors,” Journal of Documentation, volume 54, number 2, pp. 236–243.http://dx.doi.org/10.1108/EUM0000000007167

Muneo Kaigo and Isao Watanabe, 2007. “Ethos in chaos? Reaction to video files depicting socially harmful images in the Channel 2 Japanese Internet forum,” Journal of Computer–Mediated Communication, volume 12, number 4, at http://jcmc.indiana.edu/vol12/issue4/kaigo.html, accessed 20 July 2009.

Lori Lamel and Jean–Luc Gauvain, 2003. “Speech recognition,” In: Ruslan Mitkov (editor). The Oxford handbook of computational linguistics. Oxford: Oxford University Press, pp. 305–322.

Geoffrey Leech and Roger Fallon, 1992. “Computer corpora — What do they tell us about culture?” International Computer Archive of the Modern English (ICAME) Journal, volume 16, pp. 29–50; and, chapter 16, In: Geoffrey Sampson and Diana McCarthy (editors). Corpus linguistics: Readings in a widening discipline. London: Continuum.

Geoffrey Leech, Paul Rayson and Andrew Wilson, 2001. Word frequencies in written and spoken English: Based on the British National Corpus. Harlow: Longman.

Kevin Lewis, Jason Kaufman and Nicholas Christakis, 2008. “The taste for privacy: An analysis of college student privacy settings in an online social network,” Journal of Computer–Mediated Communication, volume 14, issue 1, pp. 79–100, at http://www3.interscience.wiley.com/cgi-bin/fulltext/121527993/PDFSTART, accessed 20 July 2009.

David McAllester and Robert Schapire, 2003. “Learning theory and language modeling,” In: Gerhard Lakemeyer and Bernhard Nebel (editors). Exploring artificial intelligence in the new millenium. San Francisco: Morgan Kaufmann, pp. 271–287.

John Milton and Robert Freeman, 1996. “Lexical variation in the writing of Chinese learners of English,” In: Carol Percy, Charles Meyer and Ian Lancashire (editors). Synchronic corpus linguistics: Papers from the Sixteenth International Conference on English Language Research on Computerized Corpora (ICAME 16). Amsterdam: Rodopi, pp. 121–131.

Asako Miura and Kiyomi Yamashita, 2007. “Psychological and social influences on blog writing: An online survey of blog authors in Japan,” Journal of Computer–Mediated Communication, volume 12, issue 4, pp. 1452–1471, and at http://jcmc.indiana.edu/vol12/issue4/miura.html, accessed 20 July 2009.

Michael Oakes, 1998. Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.

Michael Oakes and Malcolm Farrow, 2007. “Use of the chi–squared test to examine vocabulary differences in English language corpora representing seven different countries,” Journal of Literary and Linguistic Computing, volume 22, number 1, pp. 85–99.http://dx.doi.org/10.1093/llc/fql044

Han Woo Park and Mike Thelwall, 2003. “Hyperlink analyses of the World Wide Web: A review,” Journal of Computer–Mediated Communication, volume 8, issue 4, at http://jcmc.indiana.edu/vol8/issue4/park.html, accessed 20 July 2009.

Sarah Pedersen and Caroline Macafee, 2007. “Gender differences in British blogging,” Journal of Computer–Mediated Communication, volume 12, issue 4, pp. 1472–1492, and at http://jcmc.indiana.edu/vol12/issue4/pedersen.html, accessed 20 July 2009.

Jan Schmidt, 2007. “Blogging practices: An analytical framework,” Journal of Computer–Mediated Communication, volume 12, issue 4, pp. 1409–1427, and at http://jcmc.indiana.edu/vol12/issue4/schmidt.html, accessed 20 July 2009.

Claude Shannon, 1948. “A mathematical theory of communication,” Bell System Technical Journal, volume 27, pp. 379–423 and 623–656, and at http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf, accessed 20 July 2009.

Harold Somers, 2003. “Machine translation: Latest developments,” In: Ruslan Mitkov (editor). The Oxford handbook of computational linguistics. Oxford: Oxford University Press, pp. 512–528.

Michael Stefanone and Chyng–Yang Jang, 2008. “Writing for friends and family: The interpersonal nature of blogs,” Journal of Computer–Mediated Communication, volume 13, issue 1, pp. 123–140, and at http://jcmc.indiana.edu/vol13/issue1/stefanone.html, accessed 20 July 2009.

Atsufumi Suzuki, 2003. Utsukushii Nihon no Keijiban: Intaanetto Keijiban no Bunkaron [Beautiful Japanese Message Boards: A Cultural Study of Internet Forums]. Tokyo: Yosensha.

Mike Thelwall, 2008. “Text in social networking Web sites: A word frequency analysis of Live Spaces,” First Monday, volume 13, number 2 (February), at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2117/1939, accessed 20 July 2009.

Kaye Trammel, Alek Tarkowski, Justyna Hofmokl and Amanda Sapp, 2006. “Rzeczpospolita blogów [Republic of Blog]: Examining Polish bloggers through content analysis,” Journal of Computer–Mediated Communication volume 11, issue 3, pp. 702–722, and at http://jcmc.indiana.edu/vol11/issue3/trammell.html, accessed 20 July 2009.

Jolene Zywica and James Danowski, 2008. “The faces of Facebookers: Investigating social enhancement and social compensation hypotheses; Predicting Facebook and offline popularity from sociability and self–esteem, and mapping the meanings of popularity with semantic networks,” Journal of Computer–Mediated Communication, volume 14, issue 1, pp. 1–34, and at http://www3.interscience.wiley.com/cgi-bin/fulltext/121527995/PDFSTART, accessed 20 July 2009.

 

Forum Web sites

2Channel, at http://www.2ch.net/, accessed 20 July 2009.

Digital Spy Forums, at http://www.digitalspy.co.uk/forums/, accessed 20 July 2009.

Urban75 Forums, at http://www.urban75.net/vbulletin/, accessed 20 July 2009.

 


Editorial history

Paper received 20 July 2009; accepted 12 September 2009.


Creative Commons License
“Patterns of online behaviour in the United Kingdom and Japan: Insights based on asynchronous online conversations” by Milen Martchev is licensed under a Creative Commons Attribution–Non–Commercial 3.0 United States License.

Patterns of online behaviour in the United Kingdom and Japan: Insights based on asynchronous online conversations
by Milen Martchev.
First Monday, Volume 14, Number 10 - 5 October 2009
http://firstmonday.org/ojs/index.php/fm/article/view/2605/2304





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.