Follow the Money: A Study of Cashtags on Twitter Martin Hentschel, Microsoft, hemartin@microsoft.com Omar Alonso, Microsoft, omar.alonso@microsoft.com May 12, 2014 Abstract The popularity of Twitter goes beyond trending topics, world events, and popular hashtags. Recently a new way of sharing financial information is taking place in social media under the name of cashtags, stock ticker symbols that are prefixed with a dollar sign. In this study we present an analysis of cashtags on Twitter. Specifically, we investigate how widespread cashtags are, what stock symbols are tweeted more often, and which users tweet about cashtag. We analyze relationships among cashtags and study hashtags in the context of cashtags. Finally, we compare tweet performance to stock market performance. We conclude that cashtags, in particular in combination with other cashtags or hashtags, provide new insights into stocks and companies. Introduction The increased popularity of Twitter as one of the most important sources of real-time information on the Internet makes it a great platform to broadcast time-sensitive information. While users tweet about a wide range of topics and events, certain writing styles like hashtags are now well understood as a topical marker or context. Similarly, the use of cashtags is growing as a mechanism to denote a financial theme in a tweet. Figure 1: Screenshot of a tweet with a cashtag. Cashtags are stock ticker symbols that are prefixed with a dollar sign. For example, to tweet about Microsoft stock, you would use $MSFT; for Apple and Google, you would use $AAPL and $GOOG. Figure 1 shows an example of a tweet with the cashtag $MSFT. Following the same strategy as with hashtags, Twitter made cashtags clickable in July 2012. A click on a cashtag results in a search for tweets containing this cashtag. Cashtags, however, have been used before Twitter created this feature, mainly driven by third-party services like StockTwits (http://stocktwits.com). In this study, we explore in more detail cashtags as a way of communicating financial information in Twitter. In contrast to previous work, we are not interested in stock market prediction or modeling stock behavior based on sentiment. Our goal is to understand the main characteristics of cashtags and provide insights into stock market data on Twitter. In particular, we study the following properties of cashtags in this article:  1. The distribution of cashtags on Twitter, most-mentioned companies and business sectors. 2. The characteristics of users tweeting about cashtags. 3. The relationship between cashtags and between cashtags and hashtags. 4. The connection of tweet performance and market performance. Related work Existing research explores financial data on Twitter to predict the stock market. Typically, this work focuses on extracting sentiment from tweets and using these to predict trading volume (Bordino et al., 2014), stock market returns (Loughlin and Harnisch, accessed 12 May 2014; Zhang and Skiena, 2010; Bollen et al., 2011), and volatility (Oliveira et al., 2013). Bar-Haim et al. identify expert investors and develop a trading strategy based on user expertise (Bar-Haim et al., 2011). TweetTrader.net is a research project where users explicitly annotate tweets with sentiment in a game-like environment (Sprenger, 2011). All of this work revolves around extracting financial information from Twitter, while only few mention the term cashtag explicitly (Oliveira et al., 2013; Bar-Haim et al., 2011; Sprenger, 2011). To the best of our knowledge, there exists no prior work about the properties of cashtags themselves. Spam detection is necessary to filter tweets sent by spammers. There is a large body of work on spam detection on Twitter. Yardi et al. were among to first to study spam on Twitter specifically (Yardi et al., 2009). In their study, the authors examine features for spam detection such as user age, tweet frequency, friend-follower ratio, and user clusters. Follow-up work studies features such as text content and timing of posts (Chu et al., 2010), network distance and connectivity (Song et al., 2011), keywords and URLs (McCord and Chuah, 2011), message similarity and friend name similarity (Stringhini et al., 2010), and image content (Jin et al., 2011) to discriminate spammers from regular users. All of this work can be leveraged to detect and filter spam tweets and spam users before analyzing financial data on Twitter. Dataset The dataset used in our study consists of all public tweets in English language from April 2013 which contain at least one cashtag referring to a stock listed on NASDAQ or the New York Stock Exchange (NYSE). To compare the distribution of cashtags in April 2013 to previous years, we also analyzed tweets with the same cashtags from the months April 2011 and April 2012. We further filtered tweets using a proprietary spam filtering technique similar to the techniques described in the related works section. In April 2013, there were 2,691 different stock symbols (companies) listed at NASDAQ and 3, 263 stock symbols listed at the NYSE. The list of stock symbols of both stock exchanges was downloaded from http://www.nasdaq.com. In addition, these lists contain a mapping from stock symbol to company name as well as a classification of these companies into business sectors (e.g., technology or finance). In the last section of this study we compare tweet performance to stock market performance. Stock market data (i.e., the closing price in US dollar per stock) was downloaded from MSN Money at http://money.msn.com. Cashtag usage on Twitter In this section we analyze basic properties of cashtags on Twitter. First, we look at how widespread tweets with cashtags are used on Twitter. Second, we analyzed the coverage of stock symbols. Third, we report on the top cashtags used in tweets. And finally, we calculate a ranking of business sectors. Distribution of cashtag tweets First, we study the distribution of tweets with cashtags. The following table contains the total number of tweets with cashtags from NASDAQ and the NYSE. Furthermore, the table also contains an approximation of the ratio of cashtags tweets vs. all tweets. This approximation was calculated using a 1% sample of cashtag tweets and comparing it to a 1% sample of all tweets in English of the respective months. Table 1: Distribution of tweets with cashtags on Twitter. We make the following observations from Table 1: (1) The absolute number of cashtag tweets increased over the last years. (2) However, the fraction of tweets that contain cashtags remained fairly constant. Only 0.012% of all tweets in English (every 8000th tweet) contain a stock symbol from NASDAQ or NYSE. Even though Twitter made cashtags clickable in July 2012, the fraction of tweets that contain stock symbols did not increase. Coverage of stock symbols In April 2013, 2, 658 different NASDAQ stocks were mentioned via cashtags on Twitter. That is a coverage of 98.8% of all NASDAQ stocks. Similarly, 2,680 different NYSE stocks were mentioned via cashtags, which is a coverage of 82.1%. Figure 2: Distribution of cashtag mentions. Figure 2 shows the distribution cashtag mentions. In other words, how many times was a cashtag mentioned on Twitter. We see that most cashtags were mentioned between 11 and 100 times. The distribution is similar for NASDAQ and NYSE stocks. Top cashtags Next, we analyze which cashtags are tweeted about most often. Table 2 shows the top 10 NASDAQ stocks that were tweeted most often in April 2013. In Table 2, Tweets (all) shows the number of all tweets that were tweeted in April 2013, including spam tweets. Tweets (no spam) only shows tweets that were classified as no spam. Table 2: Top 10 cashtags (NASDAQ, contains spam). Interestingly, there are cashtags that have an extremely high spam rate of up to 99.9% (emphasized in Table 2). These cashtags are almost exclusively tweeted about by spam accounts on Twitter. The most spammy stock symbols even appear in the list of top 10 most tweeted stock symbols of each stock exchange: $LXRX (Lexicon Pharmaceuticals, Inc.) and $SNTS (Santarus, Inc.) for NASDAQ, and $B (Barnes Group, Inc.) for the NYSE (not shown in Table 2). We did not further analyze why these cashtags were tweeted so often, perhaps these are effort to manipulate stock trading models that are based Twitter data. Table 3: Top 10 cashtags (no spam). In the remainder of this study, we disregard spam tweets and only analyze tweets that are classified as no spam. Table 3 shows the top 10 most tweeted stock symbols from NASDAQ and NYSE in April 2013 without spam. We make the following observations: (1) By a large margin, the number one stock discussed on Twitter in April 2013 was Apple. Only then followed Google, Facebook, Microsoft, and Netflix. Interestingly, all these companies are in the technology sector. (2) NASDAQ stocks are tweeted more often than stocks from the NYSE. However, the top tweeted NYSE stock, Walgreen Co., ranks number five overall. Clearly, there is an interest in stocks from both stock exchanges. Business sectors Table 4: Tweets per business sector. Looking at business sectors, we see that technology companies are discussed most often on Twitter, followed by consumer services companies and finance companies. Table 4 shows the distribution of business sectors. The classification of stocks into business sectors is provided by NASDAQ. Cashtag-tweeting Twitter users Next, we will give an overview of the users that tweet about stock symbols on Twitter. Here, we not only focus on personal Twitter accounts but on all Twitter accounts, including company accounts and news agency accounts. Specifically, we will show the following: (1) the top Twitter accounts by the number of tweets, (2) the top Twitter accounts by number of followers, and (3) the distribution of tweets over the follower count of a Twitter user. Table 5: Top 10 Twitter accounts by number of tweets. Table 6: Top 10 Twitter accounts by number of followers. The top 10 Twitter accounts by the number of tweets are solely companies or news agencies (Table 5). These are mostly automated Twitter accounts or semi-automatic account that combine automated updates with tweets that were triggered manually (e.g., an author published a new website article). The top 10 Twitter accounts by number of followers are mostly news agencies (e.g., Reuters, Wall Street Journal) and accounts from celebrities (e.g., MC Hammer, Drizzy). This is shown in Table 6. The list of cashtag-tweeting Twitter accounts sorted by number of followers differs completely from list of Twitter accounts sorter by number of tweets. Table 7: Tweets containing cashtags from people with more than 1 million followers. Table 7 gives some examples of tweets from accounts with +1 million followers that wrote only a single tweet containing a cashtag. Mostly, these are tweeted news articles, but there is also a tweet with a personal opinion about the PC business (from Om Malik, @om) and on tweet which is an outlier (from Drizzy, @Drake). Drizzy’s tweet is an outlier because the term $AP is not used as a cashtag in this context. Mitigating outliers is shortly discussed at the end of this study. Figure 3: Distribution of tweets over followers. Finally, we look at the distribution of tweets over the number of followers (Figure 3). Most tweets with cashtags are sent from Twitter accounts with 101–1, 000 followers, which is because Twitter accounts with this range of followers are more common than Twitter accounts with more followers. Stocks from the NYSE are tweeted more often from Twitter accounts with high follower counts. The reason is that companies listed on NASDAQ (e.g., Apple, Google, Microsoft) most often produce consumer-oriented products which are discussed more often on Twitter, also by users with lower follower counts. Cashtag relationships In this section we analyze relationships between cashtags, and between cashtags and hashtags. We do so by analyzing co-occurrences. We say that a cashtag co-occurs with another cashtag if they appear in the same tweet. Similarly, a cashtag co-occurs with a hashtag if the cashtag and hashtag appear in the same tweet. Analyzing co-occurrences reveals insights into relationships among companies (via cashtag co-occurrences) and allows to group companies into categories (via cashtag/hashtag co-occurrences). Co-occurrences of cashtags First we study co-occurrences of cashtags with other cashtags. That is, we analyze which cashtags were mentioned together in tweets most often. For example, the tweet "Google Now comes to iPhone, challenging Apple’s Siri: http://cnb.cx/Y8Gl0H $GOOG $AAPL" mentions the cashtags $AAPL and $GOOG and we count this as one co-occurrence of Apple and Google. We do this for every possible pair of cashtags. If there are more than two stock symbols in one tweet (e.g., the tweet "15 Ways Technology Is Reinventing Society $GOOG $AAPL $FB by @meganrosedickey http://read.bi/167sWL6"), then each pair of stock symbols is counted separately. In this example, we count the pairs $GOOG $AAPL, $GOOG $FB, and $AAPL $FB each as one co-occurrence. The most mentioned pair of cashtags in April 2013 was $AAPL (Apple) and $GOOG (Google) with 3, 216 co-occurrences. In April 2013, Google introduced Google Now for Apple’s iPhone, which led to many tweets simultaneously mentioning cashtags of both companies. The second most mentioned pair was $AAPL and $QQQ (PowerShares QQQ Trust) with 2,093 co-occurrences. QQQ is a stock-exchange-traded fund, which contains Apple stock. QQQ’s market performance was dragged down by the market performance of Apple in April 2013, resulting in many tweets. The relationship to Apple also explains why QQQ appears in the top 10 list of most mentioned stocks in April 2013 (see Table 3). Figure 4: Distribution of co-occurrences. In general, co-occurrence counts follow a power-law distribution. Most of the cashtag pairs co-occur only once, while few pairs co-occur many times. The distribution of co-occurrences is shown in Figure 4. One cashtag pair was mentioned 3, 216 times (the lower right point in the plot) while 14,300 unique pairs were mentioned exactly once (the upper left point). Figure 5: Co-occurrence graph of cashtags (≥80 co-occurrences, NASDAQ only). Using co-occurrences of cashtags, we can construct a co-occurrence graph. The graph consists of nodes, which represent cashtags, and edges which represent co-occurrences. The co-occurrence graph is shown in Figure 5. The graph shows NASDAQ stock symbols that were mentioned together more than or equal to 80 times. (This threshold was picked for better presentation.) The thickness and darkness of the edges indicates the number of co-occurrences: the thicker and darker, the more a cashtag pair was mentioned. We make the following observations: 1. Technology companies form a large, tightly connected cluster. The cluster consists of 26 companies of which 23 are technology companies. The exceptions are Starbucks (SBUX), PowerShares QQQ Trust (QQQ), and Amazon (AMZN) which surely has a technology affiliation as well. There are also smaller, independent clusters. For example, BIIB, AMGN, CELG, GILD, TTWO form an independent cluster; all of which are biotech companies (with the exception of TTWO, which is an entertainment software company). Other connected companies are AVEO and DCTH (pharmaceuticals and medical companies), SSYS and XONE (computer equipment), CFFN and NDAQ (savings and brokers), and DRRX and PTIE (pharmaceuticals). 2. From the graph, it is easy to determine the main competitors of each company. For example, Microsoft is in close competition with Apple and Google, and less so with Intel (INTC)--here tweets are more about the dependence of Intel to Microsoft--RIM (BBRY), and Dell (DELL). Interestingly, Microsoft is the only company with a connection to Oracle. (Again, the graph only includes cashtag pairs with more than 80 mentions.) 3. An interesting pair is Google and Vringo, Inc. (VRNG). This pair appears because both companies were in a patent lawsuit in April 2013. In summary the co-occurrence graph reveals not only knowledge which was expected in the first place (e.g., Google and Apple being close competitors), but also interesting new knowledge (e.g., QQQ being strongly connected to Apple, Vringo being in a lawsuit with Google). The co-occurrence graph visualizes such relationships between companies distinctively. Co-occurrences of cashtags and hashtags Next we study co-occurrences of cashtags and hashtags. Here we analyze which cashtags co-occur most often with which hashtags. Interestingly, this allows to group companies into categories based on keywords (hashtags). For this study we extracted hashtag/cashtag pairs from all tweets in April 2013. We computed a co-occurrence measure C using co-occurrence counts and cashtags counts. The co-occurrence count CO is the number times the cashtag/hashtag pair was mentioned. The cashtag count CT is the number of times the cashtag appeared in total. Finally, the co-occurrence measure C is defined as: (Insert formula1.jpg here) We divide the co-occurrence count with the logarithm of the cashtag count to accommodate for the skewness of cashtags. Otherwise, if we only ranked cashtags by co-occurrence count, often-mentioned cashtags would be correlated to many hashtags because they have high co-occurrence counts with many hashtags. Dividing by the logarithm of the cashtag count mitigates this problem. Table 8: Top 10 cashtags co-occurring with hashtags #cloud, #biotech, and #retail; including co-occurence count CO, cashtag count CT, and co-occurrence measure C. Table 8 shows the top 10 cashtags co-occurring with hashtags #cloud, #biotech, and #retail. For hashtag #cloud, the most related companies are exclusively companies with a strong cloud business, including Amazon, HP, and Microsoft. As a side note, Twitter users seem to confuse Hewlett Packard’s stock symbol HPQ with HP. HP is the stock symbol of Helmerich & Payne, a company specialized in the drilling of oil and gas wells. Here we assume that tweets containing $HP and #cloud are referring to Hewlett Packard instead of Helmerich & Payne. For hashtag #biotech, we exclusively get biotech companies as related companies, and for hashtag #retail we get retail companies. The main advantages of analyzing co-occurrences of cashtags and hashtags are that is it possible to (1) automatically group companies into categories, (2) have a fine-grained grouping of companies into categories, and (3) capture changing business models of companies over time. NASDAQ, NYSE, and other stock exchange already provide a categorization of companies into business sectors. However, this categorization is static and coarsely grained (see Table 4). Using hashtag/cashtag co-occurrences allows us to have a more fine-grained grouping of companies into categories (e.g., there exists no category of cloud computing in NASDAQ as of today) and capture evolving business models of companies (e.g., Amazon is officially listed as consumer services company in the catalog/specialty distribution industry— without mentioning their cloud business model). These insights can help people make more informed stock-trading decisions. Tweet performance vs. market performance In this section we compare tweet performance of cashtags versus market performance of stocks. Specifically, we compare tweet volume (the number of cashtags tweets per day) to the closing price of the corresponding stock for the same day. The goal is to analyze if there is a relationship between tweet performance and market performance. Table 9: Tweets per day, April 2013. In Table 9 we show the top 20 tweeted stocks from NASDAQ and plot, for every stock, the number of tweets per day. Interestingly, almost every stock spikes at some day during the observed time period of April 2013. Even Apple, the most tweeted stock, shows a clear spike with 8, 990 tweets on April 23, which is 8.4 times more than the median of tweets per day for Apple for this month. Other stocks like Netflix (NFLX), First Solar (FSLR), or Starbucks (SBUX) spike at 11.7x, 20.9x, and 10.4x (the maximum divided by the median of the month). Typically, each spike corresponds to a news-worthy event that is distributed on Twitter. For Apple, the spike on April 23, 2013 corresponded to the release of the second quarter results. As we will see, sometimes market performance is related to such spikes and sometimes it is not. We focus on four individual stocks and compare tweet performance to market performance: Netflix (NFLX), First Solar (FSLR), Starbucks (SBUX), and Tripadvisor.com (TRIP). These stocks have the highest spikes of tweets per day compared to their median tweets per day. Tripadvisor.com spikes at a factor of 4.93, which is only the 16th highest spike. However, we included Tripadvisor.com to highlight the difficulties when analyzing cashtag tweets on Twitter. Figure 6: Closing price ($) and tweet volume per day for April 2013. Figure 6 overlays tweet volume (number of tweets per day) with market performance (closing price per day in US dollar) of the mentioned stocks. In the following we will study each stock in detail. Netflix Figure 6a shows Netflix’s closing price and number of tweets for every day in April 2013. Tweets per day spiked on April 22 and similarly the stock price went up 15%. On April 22, Netflix released its quarterly earnings, which were above estimates. Tweets were covering the earnings report as well as stock performance. For example, @SAI tweeted "NETFLIX EXPLODES AFTER HOURS BEATING ESTIMATES $NFLX by @officialKLS http://read.bi/17eiT4Q." First Solar Figure 6b shows tweet vs. market performance for First Solar ($FSLR). On April 9, tweets spiked for $FSLR and similarly the stock price shot up 18%. One that day, First Solar released a press statement revising its financial guidance to the better and publicizing a new world record for solar panel efficiency. Twitter and the market picket it up. The following tweets are examples of what happened on that day. @edgunther tweeted "S3 eff? RT @FirstSolar: First #Solar Sets CdTe Module Efficiency World Record, Launches Series 3 Black Module http://bit.ly/17oFz3l $FSLR", @CNBCnow tweeted "First Solar shares have been halted; $FSLR up 18%. QUOTE: http://cnb.cx/16KXJew." Starbucks While the previous examples suggest a correlation between tweet performance and market performance, the following examples paint another picture. Even though there are clear spikes in tweets per day, the stock prices do not follow. Figure 6c overlays tweets per day and closing price for Starbucks. On April 25, tweet volume shows a clear spike. On that day, Starbucks released its quarterly earnings. They were discussed and linked on Twitter, but because they were just as estimated, they had no dramatic effect on the stock price. Tripadvisor.com Figure 6d shows that there is a spike for Tripadvisor.com on April 20. However, the spike on that day had nothing to do with the company at all. The somewhat famous Twitter account @RihannaDaily with 361,000 followers (at the time) tweeted on April 20: "$trip club$ and dollar bill$...". This tweet was retweeted more than 100 times the same day and led to the spike in Figure 8—a false positive. This shows that an analysis of cashtags has to made robust against false positives, for example by pruning similar tweets, determining the percentage of URLs in tweets, or relying on tweets from trusted users only (e.g., leveraging Twitter’s verified users). In summary these examples suggest that there is a correlation between tweet volume and stock market performance, but not always. Clearly, more research is needed and sentiment analysis may play an important role here. Conclusions We conducted a large study that explored cashtags on Twitter. Cashtags are stock-ticker symbols prefixed with a dollar sign. We showed that in April 2013, approximately every 8000th tweet (or 0.012% of all tweets) contained a cashtag of a company listed at NASDAQ or the NYSE. Some of these stock symbols are abused by spammers. Therefore it is critical to filter out spam when analyzing stock symbols. Furthermore, there are cases where users confuse cashtags (e.g., using $HP instead of $HPQ to refer to Hewlett Packard), which requires care when analyzing financial data on Twitter. Twitter accounts that tweet about cashtags are news agencies and journals, which tweet frequently using cashtags, but also individuals, which tweet less frequently about cashtags. Accounts with 101–1, 000 followers tweet most often about cashtags. Relationships between cashtags and between cashtags and hashtag reveal interesting insights. We measured relationships by counting co-occurring cashtags and co-occurring cashtag/hashtag pairs in tweets. Co-occurrences of cashtags reveal main competitors of companies. Co-occurrences of cashtags with hashtags allows to group companies into fine-grained clusters. Finally, we analyzed the connection between tweet volume and market performance. We showed that sometimes tweet volume and market performance are indeed related. However, there are example where tweet volume and market performance are uncorrelated. To conclude, cashtags provide a distinct way of analyzing financial data on Twitter. Cashtag tweets can derive new insights about stocks and companies. We expect new experiences that leverage cashtags to provide users with novel ways of consuming financial information. References Roy Bar-Haim, Elad Dinur, Ronen Feldman, Moshe Fresko, and Guy Goldstein. Identifying and following expert investors in stock microblogs. In Conference on Empirical Methods on Natural Language Processing, pages 1310–1319, 2011. Johan Bollen, Huina Mao, and Xiao-Jun Zeng. Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8, 2011. Ilaria Bordino, Nicolas Kourtellis, Nikolay Laptev, and Youssef Billawala. Stock trade volume prediction with Yahoo Finance user browsing behavior. In International Conference on Data Engineering, 2014. Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. Who is tweeting on Twitter: human, bot, or cyborg? In Annual Computer Security Applications Conference, pages 21–30, 2010. Xin Jin, Cindy Xide Lin, Jiebo Luo, and Jiawei Han. SocialSpamGuard: A data mining-based spam detection system for social media networks. Proceedings of the VLDB Endowment, 4(12):1458–1461, 2011. Chris Loughlin and Erik Harnisch. The viability of StockTwits and Google Trends to predict the stock market. http://stocktwits.com/research/Viability-of-StockTwits-and-Google-Trends-Loughlin Harnisch.pdf, accessed 12 May 2014. Michael C. McCord and Mooi-Choo Chuah. Spam detection on Twitter using traditional classifiers. In International Conference on Autonomic and Trusted Computing, pages 175– 186, 2011. Nuno Oliveira, Paulo Cortez, and Nelson Areal. Some experiments on modeling stock market behavior using investor sentiment analysis and posting volume from Twitter. In International Conference on Web Intelligence, Mining and Semantics, page 31, 2013. Jonghyuk Song, Sangho Lee, and Jong Kim. Spam filtering in Twitter using sender-receiver relationship. In International Symposium on Recent Advances in Intrusion Detection, pages 301–317, 2011. Timm O. Sprenger. TweetTrader.net: Leveraging crowd wisdom in a stock microblogging forum. In International AAAI Conference on Weblogs and Social Media, 2011. Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. Detecting spammers on social networks. In Annual Computer Security Applications Conference, pages 1–9, 2010. Sarita Yardi, Daniel Romero, Grant Schoenebeck, and danah boyd. Detecting spam in a Twitter network. First Monday, 15(1), 2009. Wenbin Zhang and Steven Skiena. Trading strategies to exploit blog and news sentiment. In International AAAI Conference on Weblogs and Social Media, 2010. © Martin Hentschel, Omar Alonso 2014 All Rights Reserved