A robust gender inference model for online social networks and its application to LinkedIn and Twitter
First Monday

A robust gender inference model for online social networks and its application to LinkedIn and Twitter by Athanasios Kokkos and Theodoros Tzouramanis

Online social networking services have come to dominate the dot com world: Countless online communities coexist on the social Web. Some typically characteristic user attributes, such as gender, age group, sexual orientation, are not automatically part of the profile information. In some cases user attributes can even be deliberately and maliciously falsified. This paper examines automated inference of gender on online social networks by analyzing written text with a combination of natural language processing and classification techniques. Extensive experimentation on LinkedIn and Twitter has yielded accuracy of this gender identification technique of up to 98.4 percent.


1. Introduction
2. Related work
3. Methodology
4. Architectural and implementation issues
5. Evaluation
6. Comparison with related work
7. Conclusion



1. Introduction

In the age of the omnipresent social Web, personal information is exchanged, shared and transformed in a multitude of ways, as the amount of time in which hundreds of millions of people spend being interconnected to various online communities and social networks on a daily basis is ever increasing. In these environments, individuals reveal all manners of details about themselves to the point of exposing on a public platform their innermost thoughts and feelings, personal information, preferences, beliefs and concerns. A significant amount of attention has been aimed at extracting relevant information from the these sources.

This study focuses on inferring users’ personal attributes in the context of online social networking platforms. This gender prediction model relies on two different statistical and probabilistic algorithms that take into account psycholinguistic dimensions of text in order to determine gender. This problem could be classified under the wider umbrella of authorship profiling (Argamon, et al., 2009), better understood by using demographic features in conjunction with text mining techniques. The model described in this paper uses minimal features to classify text; it has worked well in experimentation with a large trial group.

The scheme described in this paper performs efficiently even for those online social networks used for more formal interaction, such as LinkedIn. For performance comparison of this technique was also applied to Twitter, securing high levels of accuracy. This approach can be easily modified to analyze any form of text because it makes use of content–based sociolinguistic features to a much greater extent than n–gram features.

This methodology could be adopted to diverse applications. It could be useful as part of a larger user identity authentication service consisting of numerous sophisticated and complex data mining components that would serve social network providers as an efficient protection mechanism (Chaski, 2005).

After this introduction, Section 2 analyses the advantages and limitations of related work. Section 3 describes a new methodology for the inference of the gender on an online social network by the application of advanced data analysis techniques of text. Section 4 describes the implementation of this prototype text mining system to infer gender of LinkedIn and Twitter users. Section 5 examines issues related to the evaluation of this technique. Section 6 provides insights into the efficiency of this scheme compared to alternative approaches. The last section summarizes the results and suggests directions for further research.



2. Related work

Social networks have made tremendous gains in popularity, resulting in massive amounts of sensitive information about users. Research has focused on information leakage and disclosure on these networks (Chew, et al., 2008; Carter and Mistree, 2009; Mislove, et al., 2010; Xu, et al., 2008; Zhelev and Getoor, 2009).

Privacy on social networks is largely threatened by attribute disclosure. Attribute disclosure occurs when an individual is able to determine values of undisclosed user attributes. Previous research has examined the inference of personal attributes, such as age, gender, profession and religious persuasion. For example, Lanza and Svendesen (2007) and Zhelev and Getoor (2009) focused on public profiles, friendship links and group memberships to infer undisclosed private information. Becker and Chen (2009) illustrated how social relations can be easily exploited to infer private or hidden attributes with the sole use of friendships links. More specifically, for each targeted attribute, an algorithm selects the most frequent value of this attribute among a user’s friends, if the number of friends sharing this attribute exceeds a given threshold. User demographics could be inferred through social relationships alone, with a probability greater than 50 percent. This work also introduced an inference reduction algorithm to reduce this privacy risk by removing some friendship links.

Mislove, et al. (2010) explored a technique to predict a private attribute by means of combining a social network graph with a similar attribute of a number of other users not directly connected to a specific user. With as few as 20 percent of users with known attributes, they were able to infer the remaining common attributes at an 80 percent accuracy rate.

Xu, et al. (2008) proposed an algorithm that determined private attributes by mining solely social relations. He, et al. (2006) utilized Bayesian networks to measure the probability that a specific user may have a given attribute value. By using LiveJournal (http://www.livejournal.com/), a high rate of accuracy was achievable even in online communities where a majority of users keep their personal data undisclosed.

This research certainly indicates that privacy can be easily compromised though an exploitation of information on social relationships, by utilizing correlative techniques that link classes of personal data.

Cheng, et al. (2011) conducted a series of experiments with three different statistical and probabilistic techniques (Support Vector Machine–SVM, Bayesian Logistic Regression and Decision Tree), yielding a gender prediction accuracy of 85.1 percent. Peersman, et al. (2011) proposed a text categorization approach for the prediction of age and gender using chat text from NetLog. This technique was limited to token and character–based features, yielding an accuracy of 88.8 percent. Filippova (2012) focused on gender prediction of YouTube users, on the basis of comments and other evidence, reporting accuracy levels around 90 percent, depending on age.

Rao, et al. (2010) and Rao and Yarowsky (2010) experimented with stacked SVMs–based classification algorithms (Joachims, 1998) over a rich set of features, both lexical (n–gram–based) and sociolinguistic features. Gender prediction accuracy was 72.33 percent, based on an analysis of 500 Twitter accounts. Burger, et al. (2011) used a variety of different machine learning algorithms to build gender classification models. The deployed classifier, relying only on a single tweet, performed at a 76 percent accuracy level. Human prediction performance was not too far off, in comparison, scoring at 68.7 percent.

Algorithms in Miller, et al. (2012) relied on perception and naïve Bayes stream machine learning algorithms, making use of a n–gram features to represent tweets training and testing data. The perception algorithm achieved the highest accuracy level among all, with a maximum scoring of 99.3 percent, when the length of tweets is at least 75 words. Deitrick, et al. (2012) used an extensive list of 53 n–gram features and only a limited dataset of 1,484 training tweets, securing a 98.5 percent level of accuracy. Fink, et al. (2012) inferred the gender of Twitter users by implementing an SVM classification model based on word unigrams, hash tags and psychometric properties, derived from a text analysis application called Linguistic Inquiry and Word Count or LIWC (Pennebaker, et al., 2007). Accuracy was noted at 80 percent in predicting gender by using unigrams alone.

Bamman, et al. (2012) and Zamal, et al. (2012) studied the influence of immediate neighbours’ attributes in predicting attributes for Twitter users. For gender, the accuracy of their techniques was 88 percent and 80.2 percent, respectively. Bergsma, et al. (2013) examined clusters of Twitter users based on preferences and location, using this information for gender prediction. Accuracy of this method was 90.2 percent. For comparison, a manual inspection of 120 Twitter profiles carried out by human experts yielded an accuracy rate for gender prediction of 88.3 percent. Liu and Ruths (2013) incorporated self–reported names of users into a Twitter gender classifier, achieving an accuracy of 87.1 percent. Finally, Liu, et al. (2012) identified three Twitter accounts dedicated to broadcasting information about traffic, public transportation issues, and cycling in Toronto, using this community for demographic inference. They found a maximum gender prediction accuracy rate of 86.8 percent.

This paper focuses on predicting gender in online social networks. A classification model is implemented and applied to Twitter and LinkedIn, platforms which do not reveal information about gender. This research is closest to that of Argamon, et al. (2009) which used an algorithm and blog posts to achieve an accuracy of 76.1 percent.



3. Methodology

3.1. Theoretical background

The proposed methodology relies on studies in cognitive psychology, computational linguistics and computer forensics that argue that humans display unique, or near unique, behavioural patterns. Psycholinguistic techniques, analyzing properties of text, and machine learning techniques, classifying text, can be combined to infer gender, age, language, educational background, religious affiliation and sexual orientation.

Research in the field of human psychology (for example, Broveman, et al., 1972; Crawford, 1995; Eagly, 1987; Eagly and Steffen, 1984; Greenwald and Banaji, 1995) has demonstrated that women and men adopt different and almost unique gender–based behavioural patterns in communication [1]. Other studies have examined the emotional content and tone of text and shown that certain words can function as markers of emotional, psychological and cognitive states. These markers may act as indicators of gender. For example, Eagly and Steffen (1984) and Greenwald and Banaji (1995) compared feminine and masculine behavioural and attitudinal dimensions to distinguish gender. In summary, the language of men uses patterns that are characterized by more marked expressions of independence and assertions of hierarchical power. This language includes strongly assertive, aggressive and self–promoting features as well as rhetorical questions, authoritative orientation and challenges. In contrast, women tend to express themselves with a more emotional language, including the use of more frequently emotionally intensive adverbs and affective adjectives, such as really, quite, adorable, charming and lovely. Women use more attenuated assertions, apologies, questions, personal orientation and support. Broverman, et al. (1972) and Crawford (1995) have also argued that men are more proactive, directing communication at solving problems, while women are more reactive to the contributions of others. Women are more expressive of certain emotions and more concerned about maintaining intimacy in their relationships; men are found to be better at controlling their non–verbal expressions, and to be more concerned with maintaining autonomy in relationships. These significant differences between women and men are present in written communication patterns (Rubin and Greene, 1992). In addition, there are style markers — based on characters, words, syntax and structure — that could be used for gender identification (Argamon, et al., 2009).

This research exploits both content–based features — words related to specific feelings that can act as markers of emotional, psychological and cognitive states — and traditional features — markers of writing styles — that act as strong gender indicators.

In the case of LinkedIn, content–based psycholinguistic and traditional gender features are extracted from the user’s Summary. This attribute provides a professional profile of a user. For Twitter, these tools are used to examine tweets. Each sample will be transformed into a multidimensional features vector, with each feature contributing to a gender classification. Then a machine learning text classifier will be constructed to predict gender.

3.2. Selection of features set

Corney, et al. (2002) and Cheng, et al. (2011) recognized that the word–based features and function words are linguistic features that are crucial to gender inference. Content–based features can achieve efficient results in distinguishing gender. In this study, a combination of several features was selected to build a mixed set of 423 style markers, described in Table 1.


Table 1: A view of the features set.
Character–based features (4)
1. Total number of characters (C)
2. Total number of lower–case characters (a–z)/C
3. Total number of upper–case characters/C
4. Total number of white–space characters/C
Word–based features (21)
3. Total number of words (N)
4. Average length per word (in characters)
5. Vocabulary richness (total different words/N)
6. Number of net abbreviation/N
7. Number of words longer than 6 characters/N
10–25. Psycholinguistic features (16 measures that indicate emotional state). More details in Table 2.
Syntactic features (4)
26. Number of question marks (?)/C
27. Number of multiple question marks (???)/C
28. Number of exclamation marks (!)/C
29. Number of multiple exclamation marks (!!!)/C
Gender–preferential language features (9)
30. Number of words ending in able/N
31. Number of words ending in al/N
32. Number of words ending in ful/N
33. Number of words ending in ible/N
34. Number of words ending in ic/N
35. Number of words ending in ive/N
36. Number of words ending in less/N
37. Number of words ending in ly/N
38. Number of words ending in ous/N
Function word–based features (385)
39–41. Number of articles/N
42–117. Number of pronouns/N
118–164. Number of auxiliary verbs/N
165–187. Number of conjunctions/N
188–299. Number of interjections/N
300–423. Number of adverbs and prepositions/N


On a more detailed level, the character–based features subset consists of four features: the total number of characters (C); the total number of lower–case characters/C; the total number of upper–case characters/C; and, the total number of white–space characters/C. From the range of word–based features described in the literature, five statistical metrics (lines 5 to 9 in Table 1) were chosen and 16 psycholinguistic features (extracted from PsychPage, 2014; lines 10–25 in Table 1). Psycholinguistic features include words related to positive or pleasant emotional states (such as open, happy, alive, good, love, interested, positive and strong) as well as words related to negative or unpleasant emotional states (such as angry, depressed, confused, helpless, indifferent, afraid, hurt and sad). These are usually significant indicators of emotional states and reliable pointers to gender (Argamon, et al., 2009; Cheng, et al., 2011; Mauss and Robinson, 2009; Shields, et al., 2006). Sixteen psycholinguistic features are listed in more detail in Table 2. Every feature is related to a number of words that in turn correspond to emotions. The value of each feature can be computed by measuring the frequency of use of words pertaining to corresponding emotional and psychological states.


Table 2: Psycholinguistic features subset (PsychPage, 2014).
Emotional & psychological stateWords corresponding to specific feelings
Positive or pleasant feelings
OpenUnderstanding, confident, reliable, easy, amazed, sympathetic, interested, satisfied, receptive, accepting, kind
HappyGreat, gay, joyous, lucky, fortunate, delighted, overjoyed, gleeful, thankful, important, festive, ecstatic, satisfied, glad, cheerful, sunny, merry, elated, jubilant
AlivePlayful, courageous, energetic, liberated, optimistic, provocative, impulsive, free, frisky, animated, spirited, thrilled, wonderful
GoodCalm, Peaceful, at ease, comfortable, pleased, encouraged, clever, surprised, content, quit, certain, relaxed, serene, free and easy, bright, blessed, reassured
LoveLoving, considerate, affectionate, sensitive, tender, devoted, attracted, passionate, admiration, warm, touched, sympathy, close, loved, comforted, drawn toward
InterestedConcerned, affected, fascinated, intrigued, absorbed, inquisitive, nosy, snoopy, engrossed, curious
PositiveEager, keen, earnest, intent, anxious, inspired, determined, excited, enthusiastic, bold, brave, daring, challenged, optimistic, reinforced, confident, hopeful
StrongImpulsive, free, sure, certain, rebellious, unique, dynamic, tenacious, hardy, secure
Negative or unpleasant feelings
AngryIrritated, enraged, hostile, insulting, sore, annoyed, upset, hateful, unpleasant, offensive, bitter, aggressive, resentful, inflamed, provoked, incensed, infuriated, cross, worked up, boiling, fuming, indignant
DepressedLousy, disappointed, discouraged, ashamed, powerless, diminished, guilty, dissatisfied, miserable, detestable, repugnant, despicable, disgusting, abominable, terrible, in despair, sulky, a sense of loss
ConfusedUpset, doubtful, uncertain, indecisive, perplexed, embarrassed, hesitant, shy, stupefied, disillusioned, unbelieving, sceptical, distrustful, misgiving, lost, unsure, uneasy, pessimistic, tense
HelplessIncapable, alone, paralyzed, fatigued, useless, inferiors, vulnerable, empty, forced, hesitant, despair, frustrated, distressed, woeful, pathetic, tragic, in a stew, dominated
IndifferentInsensitive, dull, nonchalant, neutral, reserved, weary, bored, preoccupied, cold, disinterested, lifeless
AfraidFearful, terrified, suspicious, anxious, alarmed, panic, nervous, scared, worried, frightened, timid, shaky, restless, doubtful, threatened, cowardly, quaking, menaced, wary
HurtCrushed, tormented, deprived, pained, tortured, dejected, rejected, injured, offended, afflicted, aching, victimized, heartbroken, agonized, appalled, humiliated, wronged, alienated
SadTearful, sorrowful, pained, grief, anguish, desolate, desperate, pessimistic, unhappy, lonely, grieved, mournful, dismayed


Four syntactic features (lines 26 to 29 in Table 1) were chosen to reflect writing style at the sentence level, since women and men make use of punctuation in different ways (Argamon, et al., 2003; Sterkel, 1988). Nine features were included to measure the use of emotionally intensive adverbs and adjectives, such as really, lovely, adorable, marvelous, and aggressive. As it can be seen in lines 30 to 38 of Table 1, the frequency of function words is measured through the presence of suffixes such as -able, -ive and -less.

Finally 385 function words were included (see Appendix) for the role they play in distinguishing writing styles according to gender (Chung and Pennebaker, 2007; Newman, et al., 2008). As lines 39 to 423 of Table 1 show, these words were clustered into six groups (articles; pronouns; auxiliary verbs; conjunctions; interjections; and, adverbs and prepositions). The frequency of each word was calculated by dividing its number of appearances in a sample of text by the total number (N) of words in the sample.

3.3. Classification model selection

In order to select the most efficient classification method, we performed experiments with different models using the same combination of features to identify the most optimal algorithm. We applied both a Bayesian Network classifier and a decision tree model with the features set described earlier and the same training and testing datasets. The results demonstrated that the SVM classification algorithm was most accurate in determining gender. Alternative classifiers were inferior in gender identification, failing to reach accuracy greater than 65 to 70 percent, even after increasing the size of the training dataset. These results confirm earlier conclusions (Diederich, et al., 2000) indicating that SVM was the most suitable classifier for problems like gender prediction.

3.4. Part–of–speech tagging

Before extracting function words from text produced by users, part–of–speech tagging (Charniak, et al., 1993) was performed by training a part–of–speech tagger. Part–of–speech tagging is a process whereby tokens are labelled as adjectives, adverbs, conjunctions, determiners, nouns, prepositions, pronouns and verbs. A part–of–speech tagger places a tag (a short description of the part of speech) to every token in a sample of text. The purpose of a tag is to provide a grammatical class for each word and predictive features indicating the behaviour of other words in the text. Since some words may fall under more than one syntactic label, a simple check in a dictionary is not an option. Therefore the most likely part of speech for every word has to be chosen (Charniak, 1993). There are many different ways of performing this tagging, with the best tag set depending on the application. Therefore, the tag set is predefined by an appropriately selected training data set such as the Brown Corpus (Francis and Kucera, 1979). The Brown Corpus is a corpus of English text, one million words in length, made up of 500 samples of 2,000 or more words. For this research a Hidden Markov model (Kupiec, 1992) was used for part–of–speech tagging. The Hidden Markov tagger uses stochastic methods and probabilities to tag words in a sentence. The success of this approach is significantly affected by the selection of an appropriate training dataset to accurately construct probability estimations for the model. This process is described below.



4. Architectural and implementation issues

First we will survey the selection process for training data, required to strengthen the robustness of the proposed gender classification model. Second, we will describe the implementation of a prototype gender inference system.

4.1. Training datasets

Twitter training dataset: The Twitter platform was crawled randomly, collecting 10,000 tweets written in English, with 5,000 tweets written by women and 5,000 created by men. The gender of the authors of the tweets was subsequently verified manually by means of external information gathered on the Web (photos, personal home pages and verified presence in other social network platforms). The dataset was used by the text mining module to train the SVM classifier that predicts the gender of Twitter users on the basis of the textual content. The tweets in the training dataset were wiped clean of URLs as well as of topics (“#topic”) and Twitter user names (“@name”) in order to obtain a suitable corpus of text for the classification process.

Reuters–21578 newsgroup dataset: This dataset (Carnegie Group and Reuters, 1987), consists of thousands of articles from a variety of international newswires, consisting of hundreds to thousands of words in length. The articles were sorted into categories according to the gender of journalists and then 1,000 articles were randomly picked, with 500 by female authors and 500 by male authors. The dataset was selected to train an SVM classifier that would predict the gender of LinkedIn users based on their profile’s Summary attribute. Excluded in this dataset were articles that contained quotes from different authors. Using this dataset was the most efficient means to train the classifier, rather than sample LinkIn summaries and manually verify gender.


Architecture of the proposed system
Figure 1: Architecture of the proposed system.


4.2. Proposed architecture

As Figure 1 illustrates, the system’s data cleaning module is responsible for cleaning input (training and testing) data by removing unnecessary items that do not contribute to the part–of–speech tagging and classification tasks, such as usernames, hash tags and URLs.

For each text sample, the Features Extraction Engine computes a 423-dimensional numerical vector. This module incorporates two processing tools: a part–of–speech tagger and a Hidden Markov model–based tagger. The part–of–speech tagger was constructed with the LingPipe toolkit (LingPine, n.d.), providing a part–of–speech tag parser for NLTK, the open source natural language toolkit (Bird, 2006) of the Brown Corpus. Training text was formatted in ASCII, with sentences delimited, tokens delimited and tags separated from tokens by a forward slash. This corpus was used to train our part–of–speech tagger and then to load it into the probabilistic (stochastic) Hidden Markov model–based tagger in order to extract features of interest. These form, together with style markers, the 423 features set described earlier.

The produced features vectors were used to build an efficient classification module, comprised of a linear SVM classifier for Twitter and an analog classifier for LinkedIn, both constructed with the LingPipe toolkit. These two binary classifiers then comprised the ‘Gender Binary Classification’ component of the system. This component determined the gender of input text.

On LinkedIn, a single text sample was available for each targeted author. For Twitter, more than one tweet existed. The algorithm generated a prediction for each sample and deployed an error–correcting mechanism to select the most likely gender. For example, if 10 tweets were available, with seven indicating that the author was female while the three other tweets were identified as male, the system concluded that the author was most likely female.



5. Evaluation

5.1. Experimentation setup

To evaluate the accuracy of the gender prediction tool, public data samples in English from LinkedIn and Twitter were used. More specifically, the Summary attribute of 1,000 randomly selected users' profiles on LinkedIn and 1,000 English tweets of randomly selected users profiles on Twitter were crawled and obtained, excluding tweets containing only hyperlinks and tweets containing one single word of text. The gender of the authors was manually examined and validated [2]. None of the tweets appeared in the Twitter training dataset.

In the testing phase with Twitter, accuracy is defined as:




For LinkedIn, accuracy is defined as:




where gender is correctly predicted if validated by manual human inspection.

5.2. Performance evaluation

On the basis of 1,000 tweets, 922 correct gender predictions were obtained, for a 92.2 percent accuracy rate for the Twitter classifier. The classifier was not successful in 78 cases. For these cases, the tweets were of poor content, not providing sufficient useful text. The accuracy rate is significant given the diminished size of tweets.

On LinkedIn the gender classifier could make use of a smaller features–set of lesser dimensions since it was not necessary to include in the set the features of Table 1 that are related to emotional language because both the training and the testing datasets consisted of texts in descriptive linguistic content of a neutral character. On the basis of this modification, the proposed model predicted correctly the gender of 984 LinkedIn users which means a 98.4 percent accuracy.

5.3. Discussion

The experimental results for Twitter clearly indicate that the task of achieving gender inference from a single tweet is more difficult compared to analyzing a LinkedIn summary. The rate of accuracy increases in relation to the number of words in a sample: with more words, the classifier can more easily process and extrapolate. Since the average length of a LinkedIn summary is much greater than the average length of a tweet, it follows that the success rate for LinkedIn will consistently be higher.



6. Comparison with related work

Table 3 summarizes results of earlier work in terms of gender inference accuracy (percentage) for Twitter users. For every case, this table lists the highest reported accuracy levels, even if this is achievable only under restricted conditions [3]. The technique described in this paper outperforms most of the earlier work in predicting gender in Twitter and is one of the first to analyze gender in selected text from LinkedIn.


Table 3: Accuracy for gender inference in Twitter and LinkedIn.
Note: N/A=Not addressed.
Gender inferenceAccuracy for TwitterAccuracy for LinkIn
Bamman, et al., 201288.0%N/A
Bergsma, et al., 201390.2%N/A
Burger, et al., 201176.0%N/A
Deitrick, et al., 201298.5%N/A
Fink, et al., 201280.6%N/A
Human inspection
Burger, et al., 2011
Human expert inspection
Bergsma, et al., 2013
Liu, et al., 201286.8%N/A
Liu and Ruths, 201387.1%N/A
Miller, et al., 201299.3%N/A
Rao, et al., 2010; Rao and Yarowsky, 201072.3%N/A
Zamal, et al., 201280.2%N/A
This paper92.2%98.4%




7. Conclusion

This study uses a new approach to infer gender in online social networks. Experimentation with text from Twitter demonstrated higher levels of accuracy than most earlier efforts. An examination of profile summaries from LinkedIn using profile summaries produced a robust accuracy performance of 98.4 percent. A wide spectrum of possibilities remain to be exploited with this technique in social networks. Future applications will further explore textual data from social networks in order to extrapolate other hidden users’ attributes. End of article


About the authors

Mr. Athanasios Kokkos received a B.Sc. in information technology from the Technological Educational Institute of Thessaloniki and a M.Sc. in information and communication systems security from the University of the Aegean, in Greece. Currently he is a Ph.D. candidate at the University of the Aegean. His research interests include data security, data privacy protection and social networks.
E–mail: ath [dot] kokkos [at] aegean [dot] gr

Dr. Theodoros Tzouramanis received a five–year B.Eng. in electrical and computer engineering and a Ph.D. in informatics, both from the Aristotle University of Thessaloniki, in Greece. Currently he is an assistant professor and the director of the Database Laboratory in the Department of Information and Communication Systems Engineering at the University of the Aegean. His research interests include access methods and query processing algorithms for temporal, spatial, image, multimedia databases; databases for time–evolving spatial data; GIS and online social networks data management; and, data privacy, security and forensics.
Web: http://www.icsd.aegean.gr/ttzouram/
Direct comments to: ttzouram [at] aegean [dot] gr



1. Caveat: It is important to note that this study does not mean to take a stand with regard to the debate around gender issues and the authors are fully aware that they could be seen to tread on shifty ground. The notion of gender as a women–men binary attribute in the stereotypically traditional way was adopted because it provides a suitable platform to demonstrate the approach taken by the work carried out in this research context. The terms woman/female and man/male are used interchangeably in this paper.

2. Manual inspection of data corresponded to various gender cues to determine the gender of users.

3. For example, in Miller, et al. (2012), accuracy was reported up to 99.3 percent but only when tweets were at least 75 words in length.



S. Argamon, M. Koppel, J.W. Pennebaker and J. Schler, 2009. “Automatically profiling the author of an anonymous text,” Communications of the ACM, volume 52, number 2, pp. 119–123.
doi: http://dx.doi.org/10.1145/1461928.1461959, accessed 29 August 2014.

S. Argamon, M. Koppel, J. Fine and A.R. Shimoni, 2003. “Gender, genre, and writing style in formal written texts,” Text, volume 23, number 3, pp. 321–346.
doi: http://dx.doi.org/10.1515/text.2003.014, accessed 29 August 2014.

D. Bamman, J. Eisenstein and T. Schnoebelen, 2012. “Gender in Twitter: Styles, stances, and social networks,” arXiv (6 October), at http://arxiv-web3.library.cornell.edu/abs/1210.4567v1, accessed on 29 August 2014.

J. Becker and H. Chen, 2009. “Measuring privacy risk in online social networks,” W2SP 2009: Web 2.0 Security and Privacy 2009, at http://www.w2spconf.com/2009/papers/s2p2.pdf, accessed on 29 August 2014.

S. Bergsma, M. Dredze, B. Van Durme, T. Wilson and D. Yarowsky, 2013. “Broadly improving user classification via communication–based name and location clustering on Twitter,” Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL–HLT), pp. 1,010–1,019, at http://www.aclweb.org/anthology/N/N13/N13-1121.pdf, accessed on 29 August 2014.

S. Bird, 2006. “NLTK: The natural language toolkit,” COLING–ACL ’06: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72.
doi: http://dx.doi.org/10.3115/1225403.1225421, accessed on 29 August 2014.

I.K. Broverman, S.R. Vogel, D.M. Broverman, F.E. Clarkson and P.S. Rosenkrantz, 1972. “Sex–role stereotypes: A current appraisal,” Journal of Social issues, volume 28, number 2, pp. 59–78.
doi: http://dx.doi.org/10.1111/j.1540-4560.1972.tb00018.x, accessed on 29 August 2014.

J.D. Burger, J. Henderson, G. Kim and G. Zarrella, 2011. “Discriminating gender on Twitter,” EMNLP ’11: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1,301–1,309.

Carnegie Group and Reuters, 1987. “Reuters–21578,” at http://www.daviddlewis.com/resources/testcollections/reuters21578/, accessed on 29 August 2014.

J. Carter and B.F.T. Mistree, 2009. “Gaydar: Facebook friendships expose sexual orientation,” First Monday, volume 14, number 10, at http://firstmonday.org/article/view/2611/2302, accessed on 29 August 2014.
doi: http://dx.doi.org/10.5210/fm.v14i10.2611, accessed on 29 August 2014.

E. Charniak, 1993. Statistical language learning. Cambridge, Mass.: MIT Press.

E. Charniak, C. Hendrickson, N. Jacobson and M. Perkowitz, 1993. “Equations for part–of–speech tagging,” Proceedings of the AAAI, pp. 784–789, at http://www.aaai.org/Papers/AAAI/1993/AAAI93-117.pdf, accessed on 29 August 2014.

C. Chaski, 2005. “Who’s at the keyboard? Authorship attribution in digital evidence investigations,” International Journal of Digital Evidence, volume 4, number 1, pp. 1–13, at http://www.utica.edu/academic/institutes/ecii/publications/articles/B49F9C4A-0362-765C-6A235CB8ABDFACFF.pdf, accessed on 29 August 2014.

N. Cheng, R. Chandramouli and K.P. Subbalakshmi, 2011. “Author gender identification from text,” Digital Investigation, volume 8, number 1, pp. 78–88.
doi: http://dx.doi.org/10.1016/j.diin.2011.04.002, accessed on 29 August 2014.

M. Chew, D. Balfanz and B. Laurie, 2008. “(Under)mining privacy in social networks,” Proceedings of the Web 2.0 Security and Privacy Workshop (W2SP), at http://w2spconf.com/2008/papers/s3p2.pdf, accessed on 29 August 2014.

M. Corney, O. de Vel, A. Anderson and G. Mohay, 2002. “Gender–preferential text mining of e–mail discourse,” Proceedings of the 18th Annual Computer Security Applications Conference, pp. 282–289.
doi: http://dx.doi.org/10.1109/CSAC.2002.1176299, accessed on 29 August 2014.

M. Crawford, 1995. Talking difference: On gender and language. London: Sage.

C.K. Chung and J.W. Pennebaker, 2007. “The psychological functions of function words,” In: K. Fiedler (editor). Social communication. new York: Psychology Press, pp. 343–359, at http://www.homepage.psy.utexas.edu/HomePage/Class/Psy301/Pennebaker/HRtraining/ChungPennebaker2007.pdf, accessed on 29 August 2014.

W. Deitrick, Z. Miller, B. Valyou, B. Dickinson, T. Munson and W. Hu, 2012. “Gender identification on Twitter using the modified balanced winnow,” Communications and Networks, volume 4, number 3, pp. 189–195.
doi: http://dx.doi.org/10.4236/cn.2012.43023, accessed on 29 August 2014.

J. Diederich, J. Kindermann, E. Leopold and G. Paass, 2000. “Authorship attribution with support vector machines,” Applied Intelligence, volume 19, numbers 1–2, pp. 109–123.
doi: http://dx.doi.org/10.1023/A:1023824908771, accessed on 29 August 2014.

A.H. Eagly, 1987. Sex differences in social behavior: A social–role interpretation. Hillsdale, N.J.: L. Erlbaum Associates.

A.H. Eagly and V. Steffen, 1984. “Gender stereotypes stem from the distribution of women and men into social roles,” Journal of Personality and Social Psychology, volume 46, number 4, pp. 735–754.
doi: http://dx.doi.org/10.1037/0022-3514.46.4.735, accessed on 29 August 2014.

K. Filippova, 2012. “User demographics and language in an implicit social network,” EMNLP–CoNLL ’12: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1,478–1,488.

C. Fink, J. Kopecky and M. Morawski, 2012. “Inferring gender from the content of tweets: A region specific example,” Sixth International AAAI Conference on Weblogs and Social Media, pp. 459–462, at https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4644, accessed on 29 August 2014.

W.N. Francis and H. Kucera, 1979. “Brown Corpus manual: Manual of information to accompany A Standard Corpus of Present–Day Edited American English, for use with digital computers,” at http://www.hit.uib.no/icame/brown/bcm.html; address to download the Brown Corpus, at http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml, accessed on 29 August 2014.

A.G. Greenwald and M.R. Banaji, 1995. “Implicit social cognition: Attitudes, self–esteem, and stereotypes,” Psychological Review, volume 102, number 1, pp. 4–27.
doi: http://dx.doi.org/10.1037/0033-295X.102.1.4, accessed on 29 August 2014.

J. He, W.W. Chu and Z. Liu, 2006. “Inferring privacy information from social networks,” In: S. Mehrotra, D.D. Zeng, H. Chen, B. Thuraisingham and F.–Y. Wang (editors). Intelligence and security informatics. Lecture Notes in Computer Science, volume 3975, pp. 154–165.
doi: http://dx.doi.org/10.1007/11760146_14, accessed on 29 August 2014.

T. Joachims, 1998. “Text categorization with support vector machines: Learning with many relevant features,” In: C. Nédellec and C. Rouveirol (editors). Machine learning: ECML–98. Lecture Notes in Computer Science, volume 1398, pp. 137–142.
doi: http://dx.doi.org/10.1007/BFb0026683, accessed on 29 August 2014.

J. Kupiec, 1992. “Robust part–of–speech tagging using a hidden Markov model,” Computer Speech & Language, volume 6, number 3, pp. 225–242.
doi: http://dx.doi.org/10.1016/0885-2308(92)90019-Z, accessed on 29 August 2014.

E. Lanza and B.A. Svendsen, 2007. “Tell me who your friends are and I might be able to tell you what language (s) you speak: Social network analysis, multilingualism, and identity,” International Journal of Bilingualism, volume 11, number 3, 275–300.
doi: http://dx.doi.org/10.1177/13670069070110030201, accessed on 29 August 2014.

LingPipe, n.d. “Part–of–speech tutorial,” at http://alias-i.com/lingpipe/demos/tutorial/posTags/read-me.html, accessed on 29 August 2014.

W.Liu and D.Ruths, 2013. “What’s in a name? Using first names as features for gender inference in Twitter,” 2013 AAAI Spring Symposium Series, pp. 10–16, at http://www.aaai.org/ocs/index.php/SSS/SSS13/paper/view/5744, accessed on 29 August 2014.

W. Liu, F. Al Zamal and D. Ruths, 2012. “Using social media to infer gender composition of commuter populations,” Sixth International AAAI Conference on Weblogs and Social Media, at https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4784, accessed on 29 August 2014.

I.B. Mauss and M.D. Robinson, 2009. “Measures of emotion: A review,” Cognition and Emotion, volume 23, number 2, pp. 209–237.
doi: http://dx.doi.org/10.1080/02699930802204677, accessed on 29 August 2014.

Z. Miller, B. Dickinson and W. Hu, 2012. “Gender prediction on Twitter using stream algorithms with n–gram character features,” International Journal of Intelligence Science, volume 2, number 24, pp. 143–148.
doi: http://dx.doi.org/10.4236/ijis.2012.224019, accessed on 29 August 2014.

A. Mislove, B. Viswanath, K.P. Gummadi and P. Druschel, 2010. “You are who you know: Inferring user profiles in online social networks,” Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 251–260.
doi: http://dx.doi.org/10.1145/1718487.1718519, accessed on August 2014.

M.L. Newman, C.J. Groom, L.D. Handelman, and J.W. Pennebaker, 2008. “Gender differences in language use: An analysis of 14,000 text samples,” Discourse Processes, volume 45, number 3, pp. 211–236.
doi: http://dx.doi.org/10.1080/01638530802073712, accessed on 29 August 2014.

C. Peersman, W. Daelemans and L.Van Vaerenbergh, 2011. “Predicting age and gender in online social networks,” SMUC ’11: Proceedings of the Third International Workshop on Search and Mining User–Generated Content, pp. 37–44.
doi: http://dx.doi.org/10.1145/2065023.2065035, accessed on 29 August 2014.

J.W. Pennebaker, C.K. Chung, M. Ireland, A. Gonzales and R.J. Booth, 2007. “The development and psychometric properties of LIWC2007,” at http://homepage.psy.utexas.edu/HomePage/Faculty/Pennebaker/Reprints/LIWC2007_LanguageManual.pdf, accessed on 29 August 2014.

PsychPage, 2014. “List of feeling words,” at http://www.psychpage.com/learning/library/assess/feelings.html, accessed on 29 August 2014.

D. Rao and D. Yarowsky, 2010. “Detecting latent user properties in social media,” Proceedings of the Neural Information Processing Systems (NIPS) MLSN Workshop.

D. Rao, D. Yarowsky, A. Shreevats and M. Gupta, 2010. “Classifying latent user attributes in Twitter,” SMUC ’10: Proceedings of the Second International Workshop on Search and Mining User–Generated Content, pp. 37–44.
doi: http://dx.doi.org/10.1145/1871985.1871993, accessed on 29 August 2014.

D. Rubin and K. Greene, 1992. “Gender–typical style in written language,” Research in the Teaching of English, volume 26, number 1, pp. 7–40.

S.A. Shields, D.N. Garner, B. Di Leone, and A.M. Hadley, 2006. “Gender and emotion,” In: J.E. Stets and J.H. Turner (editors). Handbook of the Sociology of Emotions. New York: Springer, pp. 63–83.
doi: http://dx.doi.org/10.1007/978-0-387-30715-2_4, accessed on 29 August 2014.

K.S. Sterkel, 1988. “The relationship between gender and writing style in business communications,” International Journal of Business Communication, volume 25, number 4, pp. 17–38.
doi: http://dx.doi.org/10.1177/002194368802500402, accessed on 29 August 2014.

W. Xu, X. Zhou and L. Li, 2008. “Inferring privacy information via social relations,” ICDEW 2008: IEEE 24th International Conference on Data Engineering Workshop, pp. 525–530.
doi: http://dx.doi.org/10.1109/ICDEW.2008.4498373, accessed on 29 August 2014.

F. Al Zamal, W. Liu and D. Ruths, 2012. “Homophily and latent attribute inference: Inferring latent attributes of Twitter users from neighbors,” Sixth International AAAI Conference on Weblogs and Social Media, pp. 387–390, at https://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4713, accessed on 29 August 2014.

E. Zhelev and L. Getoor, 2009. “To join or not to join: The illusion of privacy in social networks with mixed public and private user profiles,” WWW ’09: Proceedings of the 18th International Conference on World Wide Web, pp. 531–540.
doi:http://dx.doi.org/10.1145/1526709.1526781, accessed on 29 August 2014.


Appendix: Function words

a, an, the

all, everybody, his, most, other, that, what, your, another, everyone, I, much, others, theirs, whatever, yours, any, everything, it, myself, ours, them, which, yourself, anybody, few, its, neither, ourselves, themselves, whichever, yourselves, anyone, he, itself, no, one, several, these, who, anything, her, little, nobody, she, they, whoever, both, hers, many, none, some, this, whom, each, herself, me, nothing, somebody, those, whomever, each, other, him, mine, one, someone, us, whose, either, himself, more, one another, something, we, you

Auxiliary verbs
are, can, didn’t, hadn’t, haven’t, might, shouldn’t, won’t, aren’t, cannot, do, ’d, ’ve, mightn’t, was, ’ll, ain’t, can’t, don’t, has, is, mustn’t, wasn’t, would, ’re, could, does, hasn’t, isn’t, shall, were, wouldn’t, be, couldn’t, doesn’t, ’s, shan’t, weren’t, ’d, been, did, had, have, may, should, will

and, or, though, now, that, if, while, in order that, in case, because, yet, unless, even though, now that, whereas, even if, nor, so, when, although, only if, whether or not, until

adios, bah, dear, ha–ha, howdy, oops, tush, whoosh, ah, begorra, doh, hail, hoy, ouch, tut, wow, aha, behold, duh, hallelujah, huh, phew, tut–tut, yay, ahem, bejesus, eh, heigh–ho, humph, phooey, ugh, yikes, ahoy, bingo, encore, hello, hurray, pip–pip, uh–huh, yippee, alack, bleep, eureka, hem, hush, pooh uh–oh, yo, alas, boo, fie, hey, indeed, pshaw, uh–uh, yoicks, all, hail, bravo, gee, hey presto, jeepers creepers, rats, viva, yoo–hoo, alleluia, bye, gee, whiz, hi, jeez, righto, voila, yuk, aloha, cheerio, gesundheit, hip, lo and behold, scat, wahoo, yummy, amen, cheers, goodness, hmm, man, shoo, well, zap, attaboy, ciao, gosh, ho, my, word, shoot, whoa, aw, crikey, great, ho, hum, now, so long, whoopee, ay, cripes, hah, hot dog, ooh, touch, whoops

Adverbial and prepositional words
aboard, astride, down, of, through, worth, on to, in front of, about, at, during, off, throughout, according to, onto, in lieu of, above, athwart, except, on, till, ahead to, out from, in place of, absent, atop, failing, onto, to, as to, out of, in spite of, across, barring, following, opposite, toward, aside from, outside of, on account of, after, before, for, out, towards, because of, owing to, on behalf of, against, behind, from, outside, under, close to, prior to, on top of, along, below, in, over, underneath, due to, pursuant to, versus, alongside, beneath, inside, past, unlike, except for, regardless of, concerning, amid, beside, into, per, until, far from, subsequent to, considering, amidst, besides, like, plus, up, in to, as far as, regarding, among, between, mid, regarding, upon, into, as well as, apart from, amongst, beyond, minus, round, via, inside of, by means of, around, but, near, save, with, instead of, in accordance with, as, by, next, since, within, near to, in addition to, aslant, despite, notwithstanding, than, without, next to, in case of


Editorial history

Received 4 February 2014; revised 17 August 2014; accepted 18 August 2014.

Copyright © 2014, First Monday.
Copyright © 2014, Athanasios Kokkos and Theodoros Tzouramanis. All Rights Reserved.

A robust gender inference model for online social networks and its application to LinkedIn and Twitter
by Athanasios Kokkos and Theodoros Tzouramanis.
First Monday, Volume 19, Number 9 - 1 September 2014
doi: http://dx.doi.org/10.5210/fm.v19i9.5216

A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.