First Monday

Spinning words as disguise: Shady services for ethical research? by Joseph Reagle and Manas Gaur



Abstract
Ethical researchers who want to quote public user-generated content without further exposing these sources have little guidance as to how to disguise quotes. Reagle (2021b) showed that researchers’ attempts to disguise phrases on Reddit are often haphazard and ineffective. Are there tools that can help? Automated word spinners, used to generate reams of ad-laden content, seem suited to the task. We select 10 quotations from fictional posts on r/AmItheButtface and “spin” them using Spin Rewriter and WordAi. We review the usability of the services and then (1) search for their spins on Google, and (2) ask human subjects (N=19) to judge them for fidelity. Participants also disguise three of those phrases and these are assessed for efficacy and the tactics employed. We recommend that researchers disguise their prose by substituting novel words (i.e., swapping infrequently occurring words, such as “toxic” with “radioactive”) and rearranging elements of sentence structure. The practice of testing spins, however, remains essential even when using good tactics; a Python script is provided to facilitate such testing.

Contents

Introduction
Background
Experiment 1: Locating automated spinning
Experiment 2: Surveying researchers
Discussion
Limitations and applicability
Future work
Conclusion

 


 

Introduction

Reddit is known as the “front page of the Web,” claiming “52M+ daily active users” and “100K+ communities” (Reddit, 2021). Millions of Redditors, including minors and other vulnerable populations, have thousands of subreddits to discuss extraordinarily specific and sometimes sensitive topics, including sexuality, health, violence, and drug use.

Given the public prominence, breadth, and depth of Reddit’s user-generated content, researchers use it as a data source. Proferes, et al. (2021) identified 727 such studies published between 2010 and May 2020. A fraction of these papers use what Bruckman (2002) characterized as heavy disguise, wherein usernames are elided and phrases reworded so that it’s difficult for others to locate the source. Proferes, et al. (2021) found that just 18 (2.5 percent) of their studies claimed to “paraphrase” Redditors in their reports.

The minority of researchers who claim to disguise their sources note that users’ health, relationships, employment, and legal standing are jeopardized by extra exposure. Additionally, users need not be personally identified to feel embarrassed, to be harassed, or to be forced to abandon a long-held pseudonym. Researchers themselves are at risk if their practice has fallen short of approved IRB policy or other regulations, such as health privacy regulations. Even when the use of public sources is outside institutional human subjects review, researchers might face embarrassment or repercussions if a source complains. Disguising sources prose can mitigate these harms.

Unfortunately, many attempts at ethical disguise fail. Reagle (2021b) interrogated 22 Reddit research reports, with 19 claiming to reword phrases to disguise their sources, with only eight of those succeeding. Some researchers simply failed to reword phrases. One researcher collected and reported data they said they would not. Two others failed to scrub their reports of locatable information after they opted for heavier disguise during their reports’ review and editing. Some failed to introduce enough change into their phrases. We have no evidence Redditors were affected by these failures, but the potential for their harm (e.g., embarrassment, harassment, or changes to employment and legal status) and to the researchers’ reputations exist.

Wordspinners disguise text by substituting synonyms and rearranging, condensing, or expanding prose so as the result appears novel. Could researchers be helped by these services that automatically disguise (or “spin”) prose?

Spin Rewriter and WordAi are typically used to build content farms for “search engine optimization” (SEO) without Google detecting the spun content as copied. Spin Rewriter “takes a single article and turns it into dozens of 100% unique, human-quality articles. All these unique articles will let you rank higher, and for more profitable keywords” (NuGet, 2021). Google is likely to see a web of such interlinked articles as authoritative content, driving search traffic to the farm and increasing its advertising revenue. Like Spin Rewriter, WordAi alters sentences with some understanding of semantics, and “this high level of rewriting ensures that Google and Copyscape can’t detect your content while still remaining human readable!” (WordAi, 2021).

Perhaps these shady services, typically used by plagiarists and spammers, can be used for the ethical disguise of researchers’ online sources. We test how difficult it is to locate spun phrases, the quality of the resulting prose, and compare this with human efforts.

 

++++++++++

Background

Reddit and sensitive topics

Reddit was founded in June 2005 as a pseudonymous-friendly Web site for users to share and vote for links they had read (i.e., “I read it.”) Reddit’s development as a forum of forums, where users could trivially create subreddits, each with their own moderators, led the Web site to succeed over its link-sharing peers such as Digg and Delicious. It also led to problematic content and behavior in the first half of Reddit’s life.

Like Twitter and Wikipedia, Reddit serves an extraordinary corpus of mostly public data. That is, while there are private and quarantined subreddits, the vast majority of content is public: transparently accessible to any Web browser or search engine. More so than Wikipedia and much of Twitter, Reddit hosts discussions of a personal character. Subreddits on sexuality, health (including mental health and eating disorders), interpersonal abuse and violence, and drug use and cessation have been topics of research. Reddit is a compelling and accessible venue, but with sensitive — even if public — information.

Disguising sources to mitigate their location

We speak of disguising public sources to prevent them from being located.

Bruckman (2002) identified a spectrum of disguise, from none to heavy. Under light disguise, for example, “an outsider could probably figure out who is who with a little investigation.” Under heavy disguise, some false details are introduced and verbatim quotes are avoided if a “search mechanism could link those quotes to the person in question.” If the heavy disguise is successful, “someone deliberately seeking to find a subject’s identity would likely be unable to do so.” Introducing false or combined details about a source has been referred to as fabrication, a tactic of heavy disguise. Fabrication can conflict with traditional notions of research rigor and integrity. Markham (2012) argues that if done with care, fabrication can be the most ethical approach. If not done with care, however, fabrication can lead to suspicions of fraud (Singal, 2016).

In human subjects research, such as healthcare, de-identification “involves the removal of personally identifying information in order to protect personal privacy” (EDUCAUSE, 2015). Anonymized is sometimes used synonymously with de-identified, or can have a stronger connotation of data being rendered incapable of being re-identified (Lubarsky, 2017) We avoid anonymized because it is far too an assured word given the known cases of failure (Bradbury, 2021; Ohm, 2010). And in public data contexts, there might not be personally identifiable information given the use of pseudonyms. Reagle (2021b) provides a more complete review of alternative terms — and the ethical question of the following section — but we speak of testing the locatability of disguised phrases, especially those spun by automated services. When sources are located, user accounts could be further de-identified by adversaries.

Should researchers disguise public sources?

If and when to disguise is an ongoing conversation among researchers and ethicists. For example, Sharf [1] argued that researchers should seek the consent of public sources and “implied consent should not be presumed if the writer does not respond.” Rodham and Gavin (2006) responded “that this is an unnecessarily extreme position to take” and wrote, “messages which are posted on such open forums are public acts, deliberately intended for public consumption.” There is a substantive disagreement here, but there are also issues of definition. Sharf was studying a breast cancer e-mail list (“public” because the list is “open” for anyone to join), whereas Rodham and Gavin’s sense (i.e., “intended for public consumption”) permits the content to be transparently accessed by third-party search and archival services. These two senses of “public” can have different ethical implications.

Additionally, whether researchers should disguise is dependent on site-specific considerations, be it at Wikipedia, a mostly-pseudonymous encyclopedia (Pentzold, 2017), at 4chan, a highly-anonymous discussion board (Zelenkauskaite, et al., 2021), at sites where “we are studying people who deserve credit for their work” (Bruckman, et al., 2015), or public sites where people, nonetheless, discuss sensitive topics or share images (Andalibi, et al., 2017; Ayers, et al., 2018; Chen, et al., 2021; Dym and Fiesler, 2020; Fiesler and Proferes, 2018; Haimson, et al., 2016; Sowles, et al., 2017). Additionally, Web sites have affordances that affect how sources can be located, such as novel search capabilities or external archives (Reagle, 2021b).

We take no position on if researchers should disguise their sources. Rather, we focus on tactics available to researchers who want to disguise their quotes, because they often do a poor job of it (Ayers, et al., 2018; Reagle, 2021b).

Locating sources

Concerned researchers have started to assess how often usernames, quotations, and media are included in research reports.

Ayers, et al. (2018) analyzed 112 health-related papers discussing Twitter and found 72 percent quoted a tweet, “of these, we identified at least one quoted account holder, representing 84%.” When usernames were disclosed, in 21 percent of the papers, all were trivially located. Ayers, et al. wrote that these practices violate International Committee of Medical Journal Editors (ICMJE) ethics standards because (1) Twitter users might protect or delete messages after collection, and (2) revealing this information has no scientific value.

Proferes, et al. (2021) performed a systematic overview of 727 research studies that used Reddit data and were published between 2010 and May 2020. “Sixty eight manuscripts (9.4%) explicitly mentioned identifiable Reddit usernames in their paper and 659 (90.7%) did not. Two hundred and seven papers (28.5%) used direct quotes from users as part of their publications, 18 papers used paraphrased quotes, noting they were paraphrased (2.5%) and 502 (69.1%) did not include direct quotes.” [2]

In the studies noted earlier, researchers who paraphrase quotes are found to be rare and laudable. But are such paraphrases effective? Reagle (2021b) interrogated 22 Reddit research reports: three of light disguise, using verbatim quotes, and 19 of heavier disguise, claiming to use reworded phrases. They concluded that disguising sources is effective only if done and tested rigorously because they located all of the verbatim sources (3/3 reports) and many of the reworded sources (11/19 reports). Researchers who elided the forum and year or collected data over multiples thereof were less likely to have their sources located. Conversely, if locating a source is like finding a needle in a haystack, “reports that focus on a single subreddit (as stated or inferred) in a single year winnow away much of the hay,” making the search easier (Reagle, 2021b). A few researchers admirably tested their disguises in Google, though their success was dependent on the specificity of their queries.

Spinning words

Unlike some writing tools, such as QuillBot (2021), that can rewrite prose for clarity’s sake, spinners are designed to be used at scale — creating dozens of variations — while avoiding detection. That is, their spins should be natural sounding and true to the original source without being detected as a copy.

Spin Rewriter was launched in 2011 by Aaron Sustar (2016). In addition to an interactive human interface, it provides a WordPress (blogging platform) plugin and an API with libraries for C#, JavaScript, PHP, and Python (Spin Rewriter API SDK, 2021). As of May 2021, the service costs US$47 a month or US$77 a year with a special discount. WordAi was launched in 2012 by Alex Cardinell (2012). It too offers an interactive Web site and API for US$49.95 a month or US$299.40 a year.

To understand how word spinning works, consider this Spin Rewriter example and the following source prose:

What’s worse is that friend doesn’t seem to understand what the problem is with some stranger coming into our living space and basically wags his tail every time he sees this stranger.

Spin Rewriter, as does WordAi, provides spintax so that the client can see the range of available variation, using curly brackets {..} to represent grouping and pipes | to represent alternatives. In this case:

What’s {worse| even worse} is that {friend| buddy| pal| good friend} {doesn’t| does not} {seem| appear} to {understand| comprehend} what the {problem| issue} is with some {stranger| complete stranger} {coming into| entering| entering into} our {living space| home} and {basically| essentially| generally} wags his tail {every time| each time| whenever} he sees this {stranger| complete stranger}.

One such instance of spun prose is:

What’s worse is that friend does not appear to understand what the issue is with some stranger coming into our living space and generally wags his tail every time he sees this complete stranger.

We will assess spins for their (non)locatability and fidelity to meaning and fluency. Locatability should be inversely related to computational metrics of “lexical dissimilarity”: “how much has the paraphrase changed the original sentence?” [3] The more dissimilar the spin is from its source phrase, the less likely a search engine will return the source. However, dissimilarity is a static function of the phrases; locatability is the result of human searches of a dynamic Web using ever-changing indexes and algorithms. Our use of fidelity combines the “adequacy” of meaning preservation and the “fluency” of the result [4]. “Semantic completeness” is an alternative, and preferred, term to “adequacy” [5].

Our impressions of these services as tools for ethical disguise are as follows.

The SpinRewriter Web site was easy to use. The spun content was comprehensible and had few tells of artificial generation. Given these services operate on the shady edges of the Web, where services require credit card information during a trial period and might create spurious charges, it was a relief that the Spin Rewriter account was easily canceled — but not without many promotional e-mail messages to return to the service.

WordAi was not as polished and the results had a noticeable tell: the capitalization of words was odd, such as: “Fortunately a Couple of Days Back That the door was left open ...” (We manually corrected these when shown to human subjects as we wondered if we made a mistake in configuration, the corrections were easy to do, and the corrections made the comparisons more interesting.) Cancellation required that customer service be contacted — though a representative said this canceling via the Web site should soon be available. (We did not test this new “5.0” WordAi, which was released in the last week of June 2021.)

There is significant computer science literature on “paraphrasing” text using artificial intelligence techniques (Androutsopoulos and Malakasiotis, 2010; Celikyilmaz, et al., 2021). But word spinning, as an accessible service to non-technical users, is little discussed in the literature. We’d like to see any researcher avail themselves of such a tool, even if they lack the technical expertise (Zelenkauskaite and Bucy, 2016) to understand or implement semantic modeling and transformations. In the educational context, Kannangara (2017) found that word spinners rarely improve prose quality and successfully evade plagiarism detection. The following experiments test these findings from the perspective of a researcher disguising online sources in their reports.

 

++++++++++

Experiment 1: Locating automated spinning

On Reddit, a post is followed by comments within a thread. Posts and comments are, generically, messages. We used five posts tagged as fictional on the subreddit r/AmItheButtface. This subreddit is “the cool, relaxed, bastard nephew of /r/AmItheAsshole” (r/AmItheButtface, 2020). It allows fictional posts, which are often scenarios from popular media (e.g., a TV character looking for his underwear) or the antics of toddlers and pets (e.g., a cat who enjoys swatting knickknacks off the mantle). Consequently, our phrases have the form of personal and possibly sensitive advice disclosures but are labeled as fictional by their authors.

For each of the five posts, we selected a phrase from the post and from a comment, yielding 10 quotes altogether.

Users of spinners can configure the spins, varying the amount of fidelity and structural changes performed, as well as providing custom word lists to the spinners. We typically opted for the default settings. At Spin Rewriter, we selected “Most readable: only use synonyms that are definitely correct.” Though we experimented with “Very readable” at WordAi, we opted for the default “Readable” as it provided more varied prose with no loss in fidelity. Playing with the options for rearranging sentences and paragraphs didn’t seem to be of consequence given the source phrases were short.

We developed the reddit-search.py GPLv3-licensed script (Reagle, 2021a) to help locate phrases within the first page of search results. The script iterates through the phrases in a spreadsheet, building search engine queries and opening the results in browser tabs for manual scrutiny. It queries the search engines at Google, Reddit, and RedditSearch/Pushshift (Baumgartner, et al., 2020). To ease the testing of disguise, the script can automatically check the search results for the sources’ URLs if provided in the spreadsheet’s url column.

For the present study, only Google results are reported because the Reddit-specific engines were ineffective against these non-verbatim queries.

Ethical policy

Redditors of r/AmItheButtface did not consent to the use of their prose. Though we mention the subreddit and provide a few quotes that could be searched for by readers of the present report, we elide usernames and dates. We believe lack of consent and light disguise are appropriate: posts were in a public forum, from obvious pseudonyms, marked as fictional by their authors, and had existed for more than five months without deletion at the time of capture.

This policy is part of Institutional Review Board application #20-08-30 by the first author and “approved” as DHHS Review Category #2: “Exempt ... No further action or IRB oversight is required as long as the project remains the same.” The second author joined the project later and only had access to the data within this report itself.

Results

Table 1 includes 10 source phrases, their spins, a metric of their dissimilarity from the source, and whether the spins were found. WMD measures the semantic differences between two sentences, where higher numbers indicate greater difference and zero means identical (Kusner, et al., 2015). WMD is also provided in Table 4.

 

Table 1: Reddit phrases and automated spins
SourcePhrasesWMDFound
1. RedditLuckily a few days ago the door was left slightly open when my mom was out and I went inside and found that mouse.  
Spin RewriterFortunately a few days ago the door was left somewhat open when my mom was out and I went within and found that mouse.0.393 
WordAiFortunately a Couple of Days Back That the door was left open when my Mother was out and I went inside and Discovered that mouse.0.472 
2. RedditMaybe she was just surprised and needed a few minutes to think about how to reciprocate this thoughtful gift you gave her.  
Spin RewriterMaybe she was just stunned and needed a few minutes to consider how to reciprocate this thoughtful present you provided her.0.347 
WordAiPerhaps she was surprised and wanted a Couple of minutes to Consider how to exude this thoughtful gift you gave her.0.307 
3. RedditWhat’s worse is that friend doesn’t seem to understand what the problem is with some stranger coming into our living space.  
Spin RewriterWhat’s even worse is that buddy does not appear to understand what the problem is with some stranger coming into our home.0.244Google
WordAiWhat is worse is that that friend does not Appear to know what the issue is with a stranger coming to our living area.0.303Google
4. RedditI think you should have made your opinion known on the first night. You let ambiguity regarding your feelings develop.  
Spin RewriterI believe you must have made your viewpoint known on the first night. You let uncertainty concerning your feelings establish.0.441 
WordAiThat I believe that you need to have made your opinion known about the very first night. You allow ambiguity regarding your emotions grow.0.368 
5. RedditWhen I first got into her bedroom, I quickly rummaged through the piles of clothes sitting on the floor to confirm my suspicions.  
Spin RewriterWhen I initially entered into her bed room, I quickly searched through the stacks of clothing sitting on the floor to verify my suspicions.0.324 
WordAiWhen I got into her bedroom, then I quickly rummaged through the piles of clothing sitting on the ground to verify my feelings.0.364 
6. RedditShe has her boundaries for a reason — she only wants people she can trust in her room, people who won’t go digging through her shit & taking things.  
Spin RewriterShe has her borders for a factor — she just wants individuals she can trust in her space, individuals who will not go digging through her shit & taking things.0.524 
WordAiShe’s her bounds for a reason — she just needs people she can trust in her room, individuals that will not go digging through her shit & doing matters.0.372 
7. RedditAt this point, I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.  
Spin RewriterAt this point, I had definitely no choice but to press the glowing button on the Xbox and put an end to the insanity.0.230 
WordAiNow, I had no option but to press on the glowing button on the Xbox and put a stop to the insanity.0.450 
8. RedditYour parents are toxic. Your mom sounds like a narcissist. Your dad just stood by while she said that to you? He’s enabling her. Don’t let them gaslight you.  
Spin RewriterYour moms and dads are hazardous. Your mama seems like a narcissist. Your daddy just waited while she stated that to you? He’s enabling her. Don’t let them gaslight you.0.298 
WordAiYour parents are poisonous. Your mother sounds like a narcissist. Your daddy just stood while she explained that for you? He is enabling her. Do not let them gaslight you.0.207 
9. RedditAnyway, I told her to stop coming around, but she wouldn’t stop. I ended up having to call the cops on her to keep her away.  
Spin RewriterAnyway, I told her to stop occurring, but she would not stop. I wound up having to call the police officers on her to keep her away.0.270 
WordAiAnyhow, I advised her to quit coming about, but she would not stop. I ended up needing to call the cops on her to maintain her away.0.302 
10. RedditDying in hospitals is a new thing; before that, for centuries and centuries, people have died at home.  
Spin RewriterPassing away in healthcare facilities is a new thing; before that, for centuries and centuries, individuals have died at home..0.370Google
WordAiDying in hospitals is a brand new item; earlier this, for centuries and centuries, people have died at home.0.232 

 

The tactics employed on these short phrases by the spinners are simple: single-word substitutions. In phrase 8 we see that Spin Rewriter (awkwardly) replaced “parents” with “moms and dads” but this rare multi-word substitution is still a single substitution rather than a substitution across many words that is comprehensive of semantics.

Google located both spins of phrase 3; we suspect there were not enough words with applicable synonyms. For phrase 10, Google located the Spin Rewriter version. This was because WordAi was more aggressive: replacing “new thing” with the awkward “new item.” This diverted Google but sounds artificial. Fidelity and variation are often balanced against the other.

Generally, we found Spin Rewriter’s prose was more fluent, especially given WordAi’s odd capitalization in a few examples. Is this impression shared by others? And how do human subjects spin phrases?

 

++++++++++

Experiment 2: Surveying researchers

In May 2021, 20 people completed an online survey via a Google Form. We solicited participants from the Third Annual Obfuscation Workshop and on the e-mail list of the Association of Internet Researchers (AoIR). One person withdrew at the final stage of selecting “submit” or “withdraw,” for unknown reasons, and their data was not included (N=19).

Both of the solicited communities include people interested or engaged in the practice of ethical disguise. However, this is not a representative sample of those who use ethical disguise in their research reports. Even so, their responses do lead to useful insights about how researchers might spin sources’ phrases and the efficacy of those tactics. The form asked participants to fill in their occupation, which can be summarized as:

•9 researchers
       ◦6 of which specified a type of social scientist
       ◦1 of which specified data analyst
•5 students
       ◦3 of which specified Ph.D. level
•1 each of designer, digital content creator, educator, and engineer

Responses are indexed to the row in the resulting Google Form spreadsheet. For example, R02 is the first response, given that the first row has column headings. The two phases of the survey consisted of participants (1) performing their spins of three example phrases, and (2) judging the performance of Spin Rewriter and WordAi. For the experiment, subjects performed their own spins before exposure to automated examples — though this order was reversed in the sections below.

Ethical policy

Participants assented via a consent form that was the first page of the online Google Form. No identifying information was provided in this report or the publicly available data.

This policy was part of the same approved application mentioned in Experiment 1.

Experiment 2.1: Judging automated spinning

Phrases from r/AmItheButtface (1–3, in Table 1) were spun with Spin Rewriter and WordAi, with the latter’s odd capitalization corrected. Subjects were asked: “Given an original quote and two disguised versions, select the one you think is better with respect to non-discoverability and fidelity. Select ‘equivalent’ if you think them so.” The results are show in Table 2. One subject wrote they did not understand the “equivalent” option and expressed a preference for each spinner.

 

Table 2: Subjects’ preferred spins.
PhraseWordAiSpin RewriterEquivalent
12125
2865
3469
Total142419

 

Spin Rewriter is favored by subjects, primarily on the strength of the first phrase’s spins. (An analysis of variance shows a statistically significant preference for Spin Rewriter on phrase 1: F = 7.56, p = 0.001 < 0.05.) Even so, a fair amount of people expressed no preference between the two.

Experiment 2.2: Tactics of human disguise

Subjects (N=18, R04 abstained from this portion) were asked to disguise phrases 6–8: “The following quotes are from pseudonymous and fictional posts on an advice forum. Disguise (fuzz) them to the degree that you think they will not be discovered via a search engine while maximizing fidelity to the original.” The (colored) boxes of dotted and dashed lines were not visible to participants; instead, we provide them to ease analysis and understanding.

phrase 6
       She has her boundaries for a reason — she only wants people she can trust in her room, people who won’t go digging through her shit & taking things.
phrase 7
       At this point, I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.
phrase 8
       Your parents are toxic. Your mom sounds like a narcissist. Your dad just stood by while she said that to you? He’s enabling her. Don’t let them gaslight you.

In the resulting data (reddit-mask-survey-spins.csv) and its coding (reddit-mask-survey-spins-coded.xlsx) we see the following spinning tactics.

  1. ungendered nouns and pronouns
  2. single-word substitutions
  3. multiple-words substitutions
  4. rearranged sentence structure
  5. removed elements of sentence structure

R12, for example, replaced the gendered “mom” and “dad” with “parent.” If you recall, Spin Rewriter clumsily did the opposite, replacing “parents” with “moms and dads.”

R12’s phrase 8
       Your parents are toxic. Your parent sounds like a narcissist. Your other parent just stood by while she said that to you? (...) enabling her. Don’t let them gaslight you.

R19 replaced the gendered pronoun “she” with the singular “they” while accidentally preserving gender at the end of their phrase.

R19’s phrase 6
       They have their boundaries for a reason. They only want people they can trust in their room, people who will not go digging through her things.

No one chose to reverse the genders of those discussed.

On phrase 7, R02’s spin showed multi-words substitutions and the rearrangement of the three elements: (a) lack of choice; (b) pressing a button; and, (c) ending madness. R02 replaced “end the madness” with “stop the craziness” and moved that element to the middle of the sentence.

Original phrase 7
       At this point, I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.

R02’s phrase 7
       At that very moment there was only one way to stop the craziness: to push the lighted X-box switch.

As seen in Table 3, most all subjects transformed the phrases with multi-words substitutions. A third of that rearranged the positions of major elements of the phrase. Only a few subjects exclusively used single-word substitutions or removed elements of the phrase.

 

Table 3: Subjects’ spin tactics.
PhraseUngenderSinglesMultiplesRearrangementsRemovals
6111665
7 01753
8211784

 

Two of the spins exemplify the tension of balancing fidelity (of prose) against fecundity (of variation).

Across all phrases, R08 aggressively minimized the prose. For phrase 7, they maintained the element of reaching a moment without choice, but removed the elements of turning off a game and ending “madness.”

R08’s phrase 7
       You’ve got to know when to say No.

Across all phrases, R12 tended to maintain the original prose with some words trimmed via ellipses (three times in phrase 6; once in phrases 7 and 8). In phrase 7, they removed “At this moment,”.

R08’s spin should not be locatable, but it strays far from the original meaning. R12’s spin maintains much fidelity but might be locatable.

R12’s phrase 7
       (...) I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.

Experiment 2.3: Locating human disguises

We used a Python script developed to facilitate the search for Reddit sources (Reagle, 2021a) and reviewed all hits on the first page of Google results (between 7–20 results). After confirming the original versions of phrases 6, 7, and 8 could be located, we tested 54 human disguises (3 phrases by 18 participants) and the six automated disguises from Table 1. Table 4 shows the located spins (see reddit-mask-survey-spins.csv for all data).

 

Table 4: Located disguises for phrases from Table 1.
PhraseSubjectSpunWMDFound
6R12has (...) boundaries for a reason (...) only wants people (...) can trust in (...) room, people who won’t go digging through her shit and taking things.0.306Google
6R19They have their boundaries for a reason. They only want people they can trust in their room, people who will not go digging through her things.0.294Google
7R15Now I had to push the button on the console and stop the madness.0.478Google
8R06Your mom is a narcissist, and your dad just stood by when she said that to you! They are toxic, don’t let them get the better of you.0.479Google
8R11Your mom and dad are toxic. She seems to be narcissistic and if your dad just stood by while she said those things, then he’s enabling her. They are gaslighting you, don’t let them do it.0.472Google
8R14You need to not let your parents gaslight you. They seem toxic and I can’t believe your dad just stood by while your narcissistic mother said those things to you.0.349Google
8R16Your dad just standing by while she said that shows that he is enabling your narcissistic mother, and both of them are toxic. You shouldn’t let your parents gaslight you.0.313Google

 

Recall, from Table 1, that the automated spins of phrases 3 and 10 were located via Google searches. Here, we located the sources of the of phrases 6, 7, and 8. (This sequence is a coincidence, nothing more.)

It’s impossible to know why Google returned a source in the first page of results for these disguises: it’s a complex and opaque algorithm. (And as Google’s algorithm changes, so could these results.) However, when considering the human tactics mentioned above, we suspect:

• R12’s ellipses provide no disguise for phrase 6 because they are ignored by Google. R12’s other spins survived because R12 also did limited substitutions.
• R19’s use of singular “their” alone was not a sufficient disguise for phrase 6.
       ◦ We tested correcting the concluding “her” to a “their,” but it makes no difference.
• R15’s spin was not sufficient for phrase 7 for reasons unknown; other spins with similar tactics sufficed.
• Four of the subjects’ phrase 8 spins failed as disguise because the phrase is long and has unusual words (i.e., “toxic,” “narcissistic,” and “gaslight”).
       ◦ Even R14’s spin, which moved the “gaslight” element to the start of the sentence and otherwise admirably rearranged the sentence failed.
       ◦ Replacing “toxic” with “radioactive” suffices, probably because “radioactive” is a novel term that distracts Google.
• Removing the subreddit from the query removed the disguised phrases from the first page of Google’s results. The exception is R12’s spin of phrase 6, which only used ellipses, and remained on the first page, moving from the 1st to 15th result.

 

++++++++++

Discussion

Rewording prose can be part of effective disguise, especially the combination of:

  1. multiple-words substitutions, focusing on novel words, such as “radioactive” in place of “toxic,” as well as proper nouns and names;
  2. altering or removing elements of sentence structure.

No single tactic, however, is sufficient, and successful disguise is at risk when the source is novel and when the scope of the search is narrow. As Reagle (2021b) noted, it is easy to find a shiny needle (i.e., unusual words) in a small amount of hay (i.e., a given subreddit in a given year).

For researchers who want to disguise their sources, automated spinners are viable starting tools. Despite their limitations, Spin Rewriter did well, as did WordAi aside from the odd capitalization of some words. The spinners would still need some configuration and experimentation, but their use is more about scale and cost than quality. If a monthly or annual fee provides a time and cost-saving to the researcher, spinners are worth considering. QuillBot (2021) is another fee-based service, and yields similar quality spins for free on phrases with less than 700 characters.

Ultimately, even good automated or human spins can fail as effective disguise. R15’s spin in Table 4 is an example of this: we located the source despite multiple-words substitutions and rearranged structure. The most important practice is testing spins to see if their queries yield their sources on the first page of search results.

 

++++++++++

Limitations and applicability

This work is relatively small in scale, and searching for and assessing spun phrases is subjective and idiosyncratic. The current work is across 10 phrases, using two automated spinners, and nineteen human subjects. We used Reddit posts that were tagged as fictional, focused on Google searches, limited queries to exact and inexact variations with some experimentation, and scrutinized only the first page of results.

Because the intention is to limit others from locating research sources, the choices and efforts made here likely exceed those of most members of the public. Testing additional phrases, spinners, or search engines would not likely increase the insights we gained. (Bing and DuckDuckGo were used to search for the disguised phrases Google found in Table 4; they found nothing, they’re likely no match for Google.)

Despite these limitations, we believe our recommendations are suitable for disguising instances on platforms other than Reddit. Of course, site-specific considerations are important. In their review of Reddit research, Reagle (2021b) also searched for phrases using Reddit itself and Pushshift’s RedditSearch. Site-specificity, however, does not negate our general suggestions of substituting novel words and rearranging sentence structure. What it means is that researchers should test their disguises against whatever other indexes and search services are relevant to their sites of study.

An important limitation is that the present study is static, and the field of study is dynamic. Forums, like Reddit, often make changes that affect their features and how legible they are to external services. Google, and other search engines, are continually updating their algorithms, affecting what users can find. And the larger information infrastructure evolves. For example, as an undergrad, one of us frequented the Internet’s Usenet (est. 1980), a massive decentralized discussion forum the predated the World Wide Web (est. 1991) and Reddit (est. 2005). As a student, he thought he was posting to a relatively ephemeral venue as messages were deleted on most servers after a few months — storage was limited. A Web-based archive of much of Usenet was made available by Deja News in 1995; they were bought and integrated into Google Search in 2001. Old posts had a visibility and lifespan not previously conceived. Perhaps one-day RedditSearch/Pushshift will support inexact/elastic searches rivaling Google. This anecdote shouldn’t be taken as an excuse to do nothing. Rather, it means we should be as informed and rigorous as possible and be careful of the assumptions we make.

 

++++++++++

Future work

Our intention is to make recommendations to practitioners of ethical disguise: we test extant services, recommend specific tactics, and offer a script for testing disguises. Yet, more work is needed on the technical and applied fronts.

First, we make little use of the Word Mover Distance (WMD) metrics in Tables 1 and 4. WMD and other measures of difference between phrases should be assessed for their ability to predict the efficacy of a disguise. Again, a disguise’s “lexical dissimilarity” from its source does not guarantee non-locatability, but perhaps there is a threshold below which a disguise is likely to be insufficient.

Second, techniques beyond those offered by word spinners should be explored, extended, and applied to ethical disguise. Perhaps rival techniques exist at the intersection of semantic modeling, knowledge graphs, natural language understanding, and reinforcement learning. Moreover, by leveraging the metrics described in this research, we envision a self-supervised tool for creating disguised phrases. A successful tool would maximize non-locatability of sources and the fidelity to the source quotation (i.e., semantic completeness and fluency.)

Using shady services for ethical purposes has an ironic appeal, but there’s room for techniques specific to ethical disguise, openly specified and perhaps provided as a service by research or disciplinary associations.

 

++++++++++

Conclusion

Researchers who disguise their online sources would benefit from understanding successful disguise tactics and a tool for testing the efficacy of the results.

In addition to avoiding the “small haystack” of using phrases from too few subreddits over too short a time (Reagle, 2021b), automated word spinners could be a part of an ethical toolkit. The best spinners advertise that they test the results of their algorithms to avoid detection, and so researchers might use this shady practice to better their reporting of online sources.

We selected 10 phrases from fictional posts on r/AmItheButtface and “spun” them using Spin Rewriter and WordAi. The results were then (1) searched for on Google; and, (2) judged for fidelity by human subjects (N=19). The spinning services fared relatively well.

The subjects were also asked to spin three of those phrases, which we assessed for efficacy and the tactics employed. Reagle (2021b) found that altering or removing mention of the source forum and date (e.g., r/AmItheButtface in 2020) limits the likelihood of finding it. We recommend that when it comes to rewording phrases, researchers use multiple-word substitutions — especially of novel words (e.g., “radioactive” in place of “toxic”) — and alter or remove elements of sentence structure.

The practice of testing spins by the researcher, however, is necessary. We offer a GPLv3-licensed Python script toward this end (Reagle, 2021a). End of article

 

About the authors

Joseph Reagle is an Associate Professor of Communication Studies at Northeastern University.
Direct comments to: joseph [dot] 2011 [at] reagle [dot] org

Manas Gaur is a graduate researcher in the Artificial Intelligence Institute at the University of South Carolina.
E-mail: mgaur [at] email [dot] sc [dot] edu

 

Acknowledgements

We thank Nicholas Proferes for comments on an early draft.

 

Notes

1. Sharf, 1999, p. 253.

2. Proferes, et al., 2021, p. 14.

3. Liu et al., 2010, p. 928.

4. Ibid.

5. McCarthy, et al., 2009, p. 683.

 

References

N. Andalibi, P. Ozturk, and A. Forte, 2017. “Sensitive self-disclosures, responses, and social support on Instagram: The case of #Depression,” CSCW ’17: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 1,485–1,500.
doi: http://dx.doi.org/10.1145/2998181.2998243, accessed 26 December 2021.

I. Androutsopoulos and P. Malakasiotis, 2010. “A survey of paraphrasing and textual entailment methods,” Journal of Artificial Intelligence Research, volume 38, pp. 135–187.
doi: http://dx.doi.org/10.1613/jair.2985, accessed 26 December 2021.

J.W. Ayers, T.L. Caputi, C. Nebeker, and M. Dredze, 2018. “Don’t quote me: Reverse identification of research participants in social media studies,” npj Digital Medicine, volume 1, article number 30.
doi: https://doi.org/10.1038/s41746-018-0036-2, accessed 26 December 2021.

J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, 2020. “The Pushshift Reddit dataset,” Proceedings of The International AAAI Conference on Web and Social Media, volume 14, pp. 830–839.
doi: https://ojs.aaai.org/index.php/ICWSM/article/view/7347, accessed 26 December 2021.

D. Bradbury, 2021. “De-identify, re-identify: Anonymised data’s dirty little secret,” The Register (16 September), at https://www.theregister.com/2021/09/16/anonymising_data_feature/, accessed 26 December 2021.

A. Bruckman, 2002. “Studying the amateur artist: A perspective on disguising data collected in human subjects research on the Internet,” Ethics and Information Technology, volume 4, number 3, pp. 217–231.
doi: https://doi.org/10.1023/A:1021316409277, accessed 26 December 2021.

A. Bruckman, K. Luther, and C. Fiesler, 2015. “When should we use real names in published accounts of Internet research?” In: E. Hargittai and C. Sandvig (editors). Digital research confidential: The secrets of studying behavior online. Cambridge, Mass.: MIT Press, pp. 243–258.
doi: https://doi.org/10.7551/mitpress/9386.003.0013, accessed 26 December 2021.

A. Cardinell, 2012. “How absolutely anyone can make $25/hour with spinning,” WordAi blog (10 October), at https://wordai.com/blog/how-absolutely-anyone-can-make-25hour-with-spinning/, accessed 26 December 2021.

A. Celikyilmaz, E. Clark, and J. Gao, 2021. “Evaluation of text generation: A survey,” arXiv: 2006.14799v2 (18 May), at https://arxiv.org/abs/2006.14799, accessed 26 December 2021.

Y. Chen, K. Sherren, M. Smit, and K.Y. Lee, 2021. “Using social media images as data in social science research,” New Media & Society (18 August).
doi: http://dx.doi.org/10.1177/14614448211038761, accessed 26 December 2021.

B. Dym and C. Fiesler, 2020. “Ethical and privacy considerations for research using online fandom data,” Transformative Works and Cultures, volume 33.
doi: http://dx.doi.org/10.3983/twc.2020.1733, accessed 26 December 2021.

EDUCAUSE, 2015. “Guidelines for data de-identification or anonymization” (24 July), at https://www.educause.edu/focus-areas-and-initiatives/policy-and-security/cybersecurity-program/resources/information-security-guide/toolkits/guidelines-for-data-deidentification-or-anonymization, accessed 26 December 2021.

C. Fiesler and N. Proferes, 2018. “‘Participant’ perceptions of Twitter research ethics,” Social Media + Society (10 March).
doi: https://doi.org/10.1177/2056305118763366, accessed 26 December 2021.

O.L. Haimson, N. Andalibi, and J. Pater, 2016. “Ethical use of visual social media content in research publications,” Research Ethics Monthly (20 December), at https://ahrecs.com/ethical-use-visual-social-media-content-research-publications/, accessed 26 December 2021.

D.N. Kannangara, 2017. “Quality, ethics and plagiarism issues in documents generated using word spinning software,” MIER Journal of Educational Studies, Trends and Practices, volume 7, number 1, pp. 24–32.
doi: https://doi.org/10.52634/mier/2017/v7/i1/1441, accessed 26 December 2021.

M.J. Kusner, Y. Sun, N.I. Kolkin, and K.Q. Weinberger, 2015. “From word embeddings to document distances,” ICML’15: Proceedings of the 32nd International Conference on International Conference on Machine Learning , volume 37, pp. 957–966.

C. Liu, D. Dahlmeier, and H.T. Ng, 2010. “PEM: A paraphrase evaluation metric exploiting parallel texts,” Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 923–932, and at https://aclanthology.org/D10-1090/, accessed 26 December 2021.

B. Lubarsky, 2017. “Re-identification of ‘anonymized’ data,” Georgetown Law Technology Review, at https://perma.cc/86RR-JUFT, accessed 26 December 2021.

A. Markham, 2012. “Fabrication as ethical practice: Qualitative inquiry in ambiguous Internet contexts,” Information, Communication & Society, volume 15, number 3, pp. 334–353.
doi: https://doi.org/10.1080/1369118x.2011.641993, accessed 26 December 2021.

P.M. McCarthy, R.H. Guess, and D.S. McNamara, 2009. “The components of paraphrase evaluations,” Behavior Research Methods, volume 41, number 3, pp. 682–690.
doi: https://doi.org/10.3758/brm.41.3.682, accessed 26 December 2021.

NuGet, 2021. “The only article spinner that truly understands the meaning of your content,” Spin Rewriter (1 January), at https://www.spinrewriter.com/, accessed 26 December 2021.

P. Ohm, 2010. “Broken promises of privacy: Responding to the surprising failure of anonymization,” UCLA Law Review, volume 58, number 2, pp. 1,701–1,777, and at https://www.uclalawreview.org/broken-promises-of-privacy-responding-to-the-surprising-failure-of-anonymization-2/, accessed 26 December 2021.

C. Pentzold, 2017. “‘What are these researchers doing in my Wikipedia?’: Ethical premises and practical judgment in internet-based ethnography, Ethics and Information Technology, volume 19, number 2, pp. 143–155.
doi: http://dx.doi.org/10.1007/s10676-017-9423-7, accessed 26 December 2021.

N. Proferes, N. Jones, S. Gilbert, C. Fiesler, and M. Zimmer, 2021. “Studying Reddit: A systematic overview of disciplines, approaches, methods, and ethics,” Social Media + Society (26 May).
doi: https://doi.org/10.1177/20563051211019004, accessed 26 December 2021.

QuillBot, 2021. “Paraphrasing tool” (28 September), at https://quillbot.com/, accessed 26 December 2021.

r/AmItheButtface, 2020. “r/AmItheButtface,” Reddit (22 December), at https://www.reddit.com/r/AmItheButtface/wiki/index, accessed 26 December 2021.

J. Reagle, 2021a. “Tools for scraping and analyzing Reddit,” GitHub (8 June), at https://github.com/reagle/reddit, accessed 26 December 2021.

J. Reagle, 2021b. “Disguising Reddit sources and the efficacy of ethical research” (under review).

Reddit, 2021. “Reddit by the numbers” (27 January), at https://www.redditinc.com/press, accessed 26 December 2021.

K. Rodham and J. Gavin, 2006. “The ethics of using the Internet to collect qualitative research data,” Research Ethics, volume 2, number 3, pp. 92–97.
doi: http://dx.doi.org/10.1177/174701610600200303, accessed 26 December 2021.

B.F. Sharf, 1999. “Beyond netiquette: The ethics of doing naturalistic discourse research on the Internet,” In: S. Jones (editor). Doing Internet research: Critical issues and methods for examining the net. Thousand Oaks, Calif.: Sage, pp. 243–256.
doi: http://dx.doi.org/10.4135/9781452231471.n12, accessed 26 December 2021.

J. Singal, 2016. “3 lingering questions from the Alice Goffman controversy,” The Cut (15 January), at https://www.thecut.com/2016/01/3-lingering-questions-about-alice-goffman.html, accessed 26 December 2021.

S.J. Sowles, M.J. Krauss, L. Gebremedhn, and P.A. Cavazos-Rehg, 2017. “‘I feel like I’ve hit the bottom and have no idea what to do’: Supportive social networking on Reddit for individuals with a desire to quit cannabis use,” Substance Abuse, volume 38, number 4, pp. 477–482.
doi: http://dx.doi.org/10.1080/08897077.2017.1354956, accessed 26 December 2021.

Spin Rewriter API SDK, 2021. “API code samples” (14 April), at https://www.spinrewriter.com/cp-api-code-samples, accessed 26 December 2021.

A. Sustar, 2016. “Happy 5th birthday, Spin Rewriter!” (14 September), at https://www.spinrewriter.com/blog/happy-5th-birthday-spin-rewriter, accessed 26 December 2021.

WordAi, 2021. “The smartest article rewriter ever” (27 April), at https://wordai.com/, accessed 26 December 2021.

A. Zelenkauskaite and E.P. Bucy, 2016. “A scholarly divide: Social media, big data, and unattainable scholarship,” First Monday, volume 21, number 5, at https://firstmonday.org/article/view/6358/5511, accessed 26 December 2021.
doi: http://dx.doi.org/10.5210/fm.v21i5.6358, accessed 26 December 2021.

A. Zelenkauskaite, P. Toivanen, J. Huhtamäki, and K. Valaskivi, 2021. “Shades of hatred online: 4chan duplicate circulation surge during hybrid media events,” First Monday, volume 26, number 1, at https://firstmonday.org/article/view/11075/10029, accessed 26 December 2021.
doi: https://doi.org/10.5210/fm.v26i1.11075, accessed 26 December 2021.

 


Editorial history

Received 29 October 2021; revised 17 November 2021; accepted 11 December 2021.


Creative Commons License
This paper is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Spinning words as disguise: Shady services for ethical research?
by Joseph Reagle and Manas Gaur.
First Monday, Volume 27, Number 1 - 3 January 2022
https://firstmonday.org/ojs/index.php/fm/article/download/12350/10588
doi: https://dx.doi.org/10.5210/fm.v27i1.12350