First Monday

Mapping YouTube: A quantitative exploration of a platformed media system by Bernhard Rieder, Oscar Coromina, and Ariadna Matamoros-Fernandez



Abstract
Over the past 15 years, YouTube has emerged as a large and dominant social media service, giving rise to a ‘platformed media system’ within its technical and regulatory infrastructures. This paper relies on a large-scale sample of channels ( n =36M+) to explore this media system along three main lines. First, we investigate stratification and hierarchization in broadly quantitative terms, connecting to well-known tropes on structural hierarchies emerging in networked systems, where a small number of elite actors often dominate visibility. Second, we inquire into YouTube’s channel categories, their relationships, and their proportions as a means to better understand the topics on offer and their relative importance. Third, we analyze channels according to country affiliation to gain insights into the dynamics and fault lines that align with country and language. Throughout the paper, we emphasize the inductive character of this research by highlighting the many follow-up questions that emerge from our findings.

Contents

1. Introduction
2. Methodology and data
3. Findings
4. Discussion and conclusion

 


 

1. Introduction

Social media platforms play important roles on a global scale and, consequently, have been studied empirically in many different ways. The overwhelming majority of studies, however, rely on issue or user samples to investigate particular slices of social media reality. Due to limited data access, painting an ‘overall’ picture of a platform was always difficult for independent researchers and the recent ‘APIcalypse’ (Bruns, 2019) has further exacerbated the situation. In the past, researchers could rely on a limited number of newspapers or television channels to assess what kind of media contents are on offer and how their production is structured in organizational terms. But the emergence of a ‘hybrid media system’ (Chadwick, 2013), where traditional actors and networked platforms enter into complex constellations, further adds to the difficulty to assess what is out there .

One of the actors that take a central position in this ‘high-choice media environment’ (van Aelst, et al. , 2017) is YouTube. Since launching as a Web site for sharing videos in 2005 and becoming part of Google one year later, YouTube has become a dominant platform that hosts millions of channels and billions of videos, reaching an audience of more than two billion active users every month [ 1 ]. Long in the shadow of Facebook and Twitter when it comes to data-driven research, YouTube has moved into the center of scholarly interest over the last years, most notably around questions such as extreme political content (Ribeiro, et al. , 2020) and misinformation (Bounegru, et al. , 2020). This literature is concerned with the implications of YouTube’s algorithms in matters of politics and culture (Airoldi, et al. , 2016; Rieder, et al. , 2018) and follows recent media controversies on the site’s role in processes of radicalization, mischief, and abuse. Scholars writing for lay audiences have called YouTube the ‘great radicaliser’ (Tufekci, 2018), a ‘far-Right propaganda machine’ (Lewis, 2020), and a platform that inflicts ‘infrastructural violence’ on children (Bridle, 2017). Beyond these qualifications of YouTube as a threat to democracy, qualitative research has historically offered a more positive side of the platform by examining the quotidian practices of its wide range of amateur and professional users ( e.g. , Abidin, 2019, 2018; Bishop, 2019; Lange, 2007; Sayago, et al. , 2012). Broader theoretical takes ( e.g. , Kessler and Schäfer, 2009; Gillespie, 2010), anthologies ( e.g. , Burgess and Green, 2018, 2009; Lange, 2019; Lovink and Niederer, 2008; Snickars and Vonderau, 2009), and special issues (Arthurs, et al. , 2018) have added to a growing body of research focusing on the importance of the video platform for everyday life, entertainment, politics, and the economy. Given its worldwide reach as cultural mediator, YouTube has also attracted scholars researching ‘issues of globalization and cultural difference’ [ 2 ] and, in particular, how tensions between local and global structures challenge established ideas about media globalization (Cunningham and Craig, 2016).

While most empirical YouTube research has focused on specific content creators, genres, texts, and subcultures, an interest to ‘map’ the platform in order to account for what is on offer has driven research since the beginning. Paolillo (2008) examined the social network structure of YouTube early on and Burgess and Green (2009) provided the first broad picture of YouTube’s popular culture in the late 2000s by conducting a content analysis of the most popular videos. In the second edition of their book, however, the authors acknowledge that their empirical approach could not be replicated today since YouTube has evolved from being a Web site for sharing videos to a global media company whose commercial interests are centered on the monetization of channels (Burgess and Green, 2018). And with success: YouTube recently announced for the first time that it generated US$15 billion from advertising in 2019, roughly 10 percent of Google’s overall revenue (Statt, 2020). The quest for profitability has triggered important changes in terms of design, content, and audiences. Reacting to a series of scandals, the company has for example begun to set up stricter rules on what counts as ‘advertiser-friendly’ content, which function as a ‘deterrence to creating risky, edgy or experimental content’ [ 3 ]. The resulting ‘professionalization’ of YouTube’s content creators has attracted scholarly attention, with Cunningham and Craig (2017) coining the term ‘social media entertainment’ to describe the type of content that is popular on YouTube. At the same time, it is clear that videos are produced and uploaded by a wide variety of actors, ranging from amateurs engaging in intimate sharing of their everyday experiences, to star YouTubers with millions of subscribers, to established television networks and music labels that use the platform to distribute their content to mass audiences, and in particular younger viewers.

A series of recent papers (Bärtl, 2018; Paolillo, et al. , 2019) have attempted to provide overall characterizations of YouTube in its current complexity, including available content and user dynamics [ 4 ]. These attempts to broadly describe and analyze an online platform in empirical terms are not only producing interesting insights in their own right but also provide valuable scope and context to other researchers’ work. Our own project follows similar goals and this paper serves to document both our methodology and a number of key findings that take aim at forms of stratification and segmentation marking YouTube, that is, at the hierarchies and dividing lines that carve through the channel landscape. Relying on the Web-API that YouTube provides to external developers, we were able to implement a sampling strategy based on network crawling that resulted in a very large collection of channel data ( n =36M+). Guided by the principles of ‘exploratory data analysis’ (Tukey, 1977), we rely on this sample to shed light on a broad question: What is on YouTube and how is the channel landscape structured? Following the platform’s increasingly important division into channel ‘tiers’ ( cf. , Kumar, 2019), we focus on three analytical directions. First , we investigate stratification and hierarchization in broadly quantitative terms, connecting to well-known tropes on structural hierarchies emerging in networked systems, where a small number of elite actors often dominate visibility ( e.g. , Hindman, 2009). While Pareto distributions and power laws are nothing new, we seek to present and discuss our findings in ways that provide concrete reference points for scholars interested in YouTube, rather than abstract assessments of inequality. Second , we inquire into YouTube’s channel categories, their relationships, and their proportions as a means to better understand the topics on offer and their relative importance. Third , we analyze channels according to country affiliation to gain insights into the dynamics and fault lines that align with country and language. Throughout the paper, we emphasize the inductive character of this research by highlighting the many follow-up questions that emerge from our findings.

These three directions together seek to provide a broad picture of YouTube as host to an increasingly substantial ‘platformed media system’ in its own right, that is, to a large and complex media ecology that has developed within YouTube’s technical and regulatory infrastructures. Enabled, guided, and coerced by the company, this media system has fermented a ‘protoindustry of social media entertainment’ [ 5 ] that can reach large audiences and build viable businesses in the process. The next section presents our methodology in detail and discusses the analytical possibilities and limitations it affords.

 

++++++++++

2. Methodology and data

Conceptually, we situate our work within the frame of what statistician John Tukey (1977) called ‘exploratory data analysis’, which rather ‘seeks an approximate answer to the right question, which is often vague, than an exact answer to the wrong question’ [ 6 ]. Much like qualitative approaches such as grounded theory (Glaser and Strauss, 1967), exploratory data analysis can be conceived as inductive , that is, rather than using a preset theoretical framework to formulate narrow research questions and hypotheses, it generates questions, ideas, and theories iteratively in conversation with the data. Tukey’s goal was certainly not to promote carelessness or undirected data crunching, but to argue that many of the empirical phenomena we encounter are not yet sufficiently well understood to set them into the strictures of confirmatory hypothesis testing. This applies very much to large-scale platforms like YouTube, where our understanding is hampered by size and complexity. As we will see, a central outcome of our work is the formulation of new questions and problems that were hardly visible before, adding to the list of directions for follow-up research. Some of this research is already in preparation and this paper serves as a methodological introduction for forthcoming work as well as a broad attempt to investigate stratification and segmentation on YouTube.

2.1. Data collection

Looking at the basic setup or information architecture of YouTube, the two main ‘units’ that structure the platform are videos and channels. Unlike Twitter, which issues a random sample of tweets in real time (Morstatter, et al. , 2014; Gerlitz and Rieder, 2013), YouTube provides no easy way to create representative datasets. Most authors have focused on channels as entry points. Bärtl (2018) proposes an approach to creating such a sample of channels ( n =19,025) that uses randomly generated search strings to retrieve channel data ( e.g. , the author used random key word searches such as ‘why’ or ‘gol’ to collect channels that contain the letters of the queries in their name). However, the exclusion of non-Latin characters is only one reason why it seems doubtful that this method can achieve its goal of providing an accurate picture of what is on YouTube. Popularity bias and other vagrancies of YouTube’s API are another one, but the main problem is the non-random character of language itself: the distinct phonetic patterns of human languages mean that channel names or video titles are not spread equally over the alphabet and will cluster around particular character patterns. Another recent paper (Paolillo, et al. , 2019) uses a combination of searching, browsing, and crawling to create a large collection of channels ( n =549,383), but their idiosyncratic method is neither systematic, nor do the authors discuss the biases it may contain. They erroneously state that ‘channel subscriptions are treated as private by the API’ [ 7 ], missing the most important source of connectivity in the network of channels. To be clear, both of these approaches yield interesting results in the spirit of iterative exploration, but also have clear limitations. In the following section, we propose an approach that seeks to deal with these limitations by constructing a much larger sample of channels through crawling. As with all strategies that lack privileged access to a platform’s database, this method comes with its own set of shortcomings.

In the late 1990s, attempts to crawl the Web in its entirety were highly popular ( e.g. , Albert, et al. , 1999; Broder, et al. , 2000), but such efforts are now mostly [ 8 ] limited to commercial companies like Google, Exalead, or Ahrefs. Software like the Digital Methods Initiative’s Issuecrawler (Rogers, 2002) or Hyphe (Jacomy, et al. , 2016), developed at SciencePo’s Médialab, provides researchers with the capabilities to create networks of smaller sub-sections of the Web and similar approaches have been adopted for crawling YouTube, for example to detect extremist content ( e.g. , Agarwal and Sureka, 2015). One of the perks of crawling is that the collected data have a relational component, making it possible to analyze them both as populations through traditional statistical methods and as networks with the help of concepts like density, centrality, or clustering coefficient. While our work is not part of a larger ‘science of networks’ (Watts, 2004), we will discuss certain network properties of our dataset and apply graph-based methods further down. The main problem with crawling, however, is to know whether a crawl is complete and, if not, what is missing. This question will be discussed in more detail in the following section.

On the broadest level, our approach is basically an attempt to crawl a significant portion of YouTube and it builds on years of experience with topic- and community-based channel crawling via the YouTube Data Tools (Rieder, 2015). We relied on two types of connections between channels as conduits: featured channels allow creators to highlight other channels on their profile and channel subscriptions point to the idea that creators are also users that watch content through the same account. Both can also be seen as means to gain visibility on the platform, either through ‘traditional’ networking or as input into algorithmic ranking and recommendation. While the latter is largely speculation, we will see throughout this paper that ‘algorithmic imaginaries’ (Bucher, 2017) and ‘algorithmic gossip’ (Bishop, 2019) may be important sources for explaining why channel owners are doing what they do. For both featured channels and subscriptions limitations apply: channels may simply not feature or subscribe to other channels and channel subscriptions may be set to private. An even more fundamental interrogation concerns the status of ‘channel’ itself since any user who chooses to activate their channel feature is technically a channel. For our project, which is particularly interested in ‘public-facing’ channels that seek or have already found a large audience, the question was thus how to define such a channel. One way to do so would be to exclude channels that have only uploaded a limited number of videos, but since we are interested in publicness and channel professionalization, we decided on subscriber count as criterion, not least because subscriber numbers are the central component of YouTube’s tiered governance [ 9 ]: when activating an account’s channel feature, one acquires ‘graphite’ status, which comes with access to basic tools such as Creator Studio; passing the threshold of 1,000 subscribers awards ‘opal’ credentials and, more importantly, access to monetization through advertisement; moving above 10,000 subscribers to the ‘bronze’ tier gives admission to the online and offline creator community ‘YouTube Space’, to pop-up events, and tailored training opportunities; ‘silver and up’ starts at 100,000 subscribers and these ‘elite’ channels get their own partner manager and awards. As Kumar (2019) has argued, these tiers also come with less visible perks, such as prioritization in appeals against demonetization. They are thus essential components of YouTube’s platformed media system. Rather than relying on arbitrary percentage groups, both our crawling strategy and our findings are organized around these four tiers and we will refer to them frequently throughout the text.

To collect data, we implemented a breadth-first crawler in Python that interfaces with YouTube’s Web-API. The script started from a single seed [ 10 ] and followed connections until no new channels were discovered.

 

A schematic overview of our crawling method
 
Figure 1: A schematic overview of our crawling method.

 

Figure 1 shows how the crawler uses featuring and subscribing connections to discover new channels and collect their metadata. If a new channel’s subscriber count, which we retrieve directly from YouTube, is above the specified cutoff (1,000 subscribers), the crawler continues deeper into the network; if it is below the cutoff, the channel is added to our dataset, but its outgoing connections are not followed further. This method discovered 4,415,180 channels with more than 1,000 subscribers and a full total of 36,336,861 channels — well above existing estimates for the total number of channels on the platform (Funk, 2020).

For the three tiers making up what we call ‘monetizable’ YouTube (1k–10k, 10k–100k, and 100k+), we retrieved the listings of published videos, which include video ids and publication dates. We then gathered detailed metadata for all of the videos published by the elite 100k+ tier and took a one percent sample for the other two [ 11 ]. Table 1 provides an overview of all the collected data:

 

Overview of collected data

 

The data collection process relied on the same API access token used for the YouTube Data Tools, which provides a quota of 50,000,000 units per day [ 12 ]. YouTube unfortunately no longer seems to issue similarly generous tokens for new research projects, which makes our research difficult to replicate. Despite the high number of requests we could make in a day, the data collection lasted from 26 November 2019 until 8 January 2020, two days before YouTube’s introduction of a special ‘made for kids’ flag [ 13 ] came into effect.

2.2. Sample qualification and limitations

Despite the substantial number of channels we were able to discover, there are serious questions about the character and coverage of our method. What did it capture and what was left out? One way to begin to answer these questions is to examine the distribution of featuring and subscribing connections. When looking at the full 4.4M channels above the 1k threshold, we find that 27.37 percent feature at least one channel, yielding an overall mean of 1.04. But it is mainly the subscription numbers that explain the density of the resulting channel network and broadly justify our approach: 37.03 percent made their subscriptions publicly available and subscribed to at least one channel, with a much higher overall mean of 53.34. Table 2 provides these numbers separately for our four tiers:

 

Subscribing and featuring by tier

 

Channels with higher subscriber numbers tend to feature other channels more often, but subscriptions are less readily made available. One interpretation for this is that these channels are indeed seeking to professionalize, investing more heavily in public networking and reducing the more ‘private’ or ‘consumption’ oriented practices that subscribing is indicative of. In fact, many professional creators run more than one channel to differentiate their offer and to cover broader advertising targets. These channels are then connected by featuring each other. Since motivations for featuring and subscribing to other channels are diverse and represent some mixture of strategic networking and private consumption, we do not distinguish between the two connection types in this paper, even though this may be an interesting direction for future research.

A second approach to understanding our sample consists in comparing crawls with different subscriber cutoffs. While all data used for analysis are based on the 1,000 cutoff mentioned above, we also performed crawls with a limit of 10,000 and 100,000. Table 3 lists the number of channels with more than 100,000 subscribers discovered in each crawl:

 

Discovered channnels with less than 100,000 subscribers with different crawl cutoffs

 

While moving from a 100k to a 10k cutoff grows the number of discovered elite 100k+ channels by 11.4 percent, lowering the cutoff further only adds very few channels with more than 100k subscribers to the list. This means that the ‘structural holes’ (Burt, 1992) that limit the crawl on one level are ‘filled-in’ by lowering the cutoff. The fact that lowering the cutoff further adds very little, together with the high subscribing count mentioned above, makes us confident that our method was indeed able to discover virtually all channels above 100k subscribers. If we repeat the same exercise for channels with more than 10k subscribers, a similar pattern emerges:

 

Discovered channnels with less than 10,000 subscribers with different crawl cutoffs

 

While the growth in channels from lowering the cutoff is not exactly the same as before, it is sufficiently close to hypothesize that the structural proportions are similar, meaning that our 1k cutoff crawl was indeed able to discover close to all 10k+ channels. Extending this logic further, we can estimate that we are missing about 10–15 percent of channels in our 1k–10k dataset. We thus end up with a hierarchy of confidence in the results: while the 100k+ and 10k–100k datasets are near complete, the 1k–10k dataset needs to be interpreted with prudence and the almost 32M channels below 1,000 subscribers with even more caution. While we did attempt to run a crawl without cutoff, the resulting numbers prompted serious problems with quota limits and led further into spaces where ‘user’ and ‘channel’ become hard to distinguish. Since we are mostly interested in ‘public-facing’ YouTube, we settled on the 1,000 subscriber cutoff as a workable compromise.

Several limitations with our approach and dataset cluster around time . First, the fact that collecting the data took over six weeks means that our snapshot does not capture channel and video data at the same moment, manifesting in different ‘video list’ and ‘video data’ numbers in Table 2 . We made sure, however, that the start of our crawl serves as the cutoff for incoming videos and our macro-scale approach is less sensitive to small variations. Second, numbers are cumulative, and we lack historical data for virtually all variables. Just because a video came out years ago does not mean that this is when it was viewed: an older video may well find an audience years after initial publication. A similar caveat holds for data that may have been changed over the course of a video’s or channel’s timeline. For example, descriptions or keywords may have been edited one or several times during a video’s lifetime. Third, our sample is marked by what is often referred to as ‘survivorship bias’: the channels that made it above our thresholds are indeed those that ‘made it’. While this is not a problem for the snapshot-type analyses that follow, certain historical approaches need to take stock that today’s elite today is not (necessarily) the same as yesterday’s. Most of these problems could be solved or mitigated by an approach based on regular snapshots.

2.3. Analytical approach

One of the challenges of presenting an exploratory, ‘overall’ view of a very large dataset is the choice of what to highlight and what to omit. Many more analytical directions than what we present in the following sections would have been possible. As a basis for future work, we have focused on providing relatively broad overviews that will hopefully be useful for other researchers as a point of reference, for example to situate their own samples within the larger platform ecology. Section 3.1 addresses the question of stratification and hierarchization in YouTube and presents quantitative descriptions in two main ways: comparing the four YouTube tiers gives us a general idea of the differences within YouTube’s institutionalized creator hierarchization; and statistical description of variables like views and comments highlights hierarchization in more continuous terms. Section 3.2 examines channel categories as a fundamental means to segment channels along topic areas, providing insights into overall content distribution as well as the differences between them. Section 3.3 applies similar analytical means to channels’ country affiliations, investigating the complex forms of media globalization as they play out within YouTube. In both of these latter sections, we highlight differences in volume, but also investigate the subtle inequalities that manifest when intersecting these broad analytical directions with the different variables present in our dataset.

Taken together, these three sections seek to step closer to our overall research goal, that is, a broad quantitative characterization of the platformed media system emerging on YouTube, with a focus on the ‘protoindustry’ and its quest to professionalize. To keep the following at least somewhat accessible to our readers, we not always compare all four tiers (100k+, 10k–100k, 1k–10k, <1k), but often focus on the particularly interesting ‘elite’ (100k+) segment or, in other instances, analyze the full 4.4M channels that comprise the top three tiers and make up what we have called ‘monetizable’ YouTube. In the spirit of collegiality and reproducibility, we make our data partially available, allowing researchers to dig deeper themselves (cf. Appendix ).

In terms of analytical methodology, we rely mostly on descriptive statistics and visualizations. This is not a quip against more mathematically involved forms of analysis, but a concession to both the complexity of our dataset and our exploratory goals. We feel that more complex forms of multivariate analysis would require a stronger subject focus and analytical directionality than we adopt in this paper. One of the outcomes of our inquiry — in line with our inductive outlook — is indeed a call for follow-up research, which we highlight at various points in the text. We hope to address some of these questions ourselves in future publications.

 

++++++++++

3. Findings

3.1. Channel overview and stratification

In this section, we provide a quantitative overview of YouTube channels along a number of standard and derived metrics. The guiding research interest, here, is to scope the channel landscape, to provide reference points for YouTube scholars ( e.g. , to localize channels within the hierarchy), and to understand stratification in terms of views, subscribers, videos, and user reactions. While we cannot establish clear causalities, we pay attention to ‘success factors’, that is, characteristics that distinguish channels that are doing well from those that do not.

Channel overview

The main structuring unit on YouTube is the channel, which serves not only as a ‘binder’ for videos and playlists, but also to build more stable, non-algorithmic audiences through subscription. In line with common assessments of social media as highly unequal in terms of views, followers, shares, or other metrics, our dataset is heavily dominated by ‘elite’ channels (100k+ subscribers). As Table 5 shows, these channels accrue most of the views and subscribers despite generating only 8.9 percent of all the published videos by the four tiers. What emerges, here, is the dominance of the most popular channels in terms of visibility and popularity and the vigorous creative effort of the user base that does not obtain the same reward, a trend already observed in previous literature (Bärtl, 2018).

 

Cumulative channel statistics separated per tier

 

This pattern continues further up the top: looking at the 15,496 channels that have more than 1M subscribers (0,04 percent), we found that they account for 37 percent of the subscribers, 37.4 percent of the views, but only 2.3 percent of the videos published. This is a much heavier skew than the 80/20 Pareto principle [ 14 ] and a clear indication that YouTube’s elite is central to the life — and earnings — of the platform. Overall, subscriber and view count correlate strongly (0.74) and we can easily imagine a mutually reinforcing dynamic that is further exacerbated by YouTube’s algorithmic visibility management.

Table 6 provides more detailed statistical descriptions that add nuance by highlighting the considerable internal variation within our four tiers. Subscribers and per-channel views drop significantly when moving down the subscriber tiers. This is particularly visible if we move below the monetization threshold, where the average subscriber number drops to 122. When it comes to published videos, however, we see that the elite may receive disproportionate levels of exposure but have also published, on average, many more videos per channel (940) than the next lower tier (294). And this is not simply an effect of having been around for a longer time: if we compare the average number of days active since the starting date, the numbers are surprisingly constant between tiers.

 

Descriptive statistics for channels

 

The picture that emerges from these numbers is the existence of a preeminent and professionalized elite that has the resources to produce more content, succeeds at gaining views and subscribers, and, consequently, gains more income via monetization. This dynamic continues into the next section.

Channel network properties

Looking at the population of channels as a network produces further insights — and questions — into the dynamics of success. The relational structure emerging from our crawl, shows a relatively well-connected network. For the 4.4M channels above 1k subscribers, where we have complete linking data, there are 162,502,327 directed edges, an average of 36.8 per node. The channels with the most incoming links are not just YouTube stars like PewDiePie or Eminem , but also categories like Music that serve as video aggregators within YouTube’s information architecture rather than as ‘real’ channels and therefore miss video and view numbers. Interestingly, the popularity of channels like NoCopyrightSounds, TeamYouTube [Help] , or YouTube Creators shows the relevance of material and advice for the established and budding creators in our sample. The prominence of these channels is an indicator for the importance of copyright on YouTube, but also for the aid the platform provides to creators seeking to professionalize. Table 7 shows the ten most linked channels:

 

Top 10 linked channels

 

While, unsurprisingly, indegree and subscriber count correlate quite strongly (0.793), absolute numbers for subscribers are vastly higher. We must be conscientious that the sample under scrutiny is only a small part of the entire ‘YouTube network’, understood as the site’s full user base, including those who have not activated their channel feature. Indegree, here, captures the visible part of subscribing within our network, but the actual subscriber numbers reflect the activities of a much wider range of users. Using the powerlaw package for Python (Alstott, et al. , 2014), we investigated the distributions of both variables statistically.

 

Both sides plot a Complementary Cumulative Distribution Function (CCDF), the left x-axis for indegree, the right for subscribers
 
Figure 2: Both sides plot a Complementary Cumulative Distribution Function (CCDF), the left x-axis for indegree, the right for subscribers. The y-axis shows rank. The blue lines represent a power law fit and the green a log-normal fit. In both cases, Xmin is set to the lowest value in the dataset (1 for indegree, 1000 for subscribers).

 

While Figure 2 raises more questions than it answers, it indicates that neither variable follows a simple power law. As we have seen in the previous section already, the broad logic of the rich getting richer still applies (Borghol, et al. , 2012), making the road to visibility and success harder for new creators. But the fact that a log-normal distribution is overall a better fit for both variables, in particular for indegree, indicates that the growth dynamics at play cannot be easily mapped onto a singular process [ 15 ]. Many different factors may come into play: social capital transfers across sites ( e.g. , when a celebrity opens a YouTube channel), ranking and recommendation algorithms affect topic-specific and overall visibility, and so forth. While advertising revenue per view can vary widely between topic domains and countries, there comes a point where a creator may be able to quit or reduce their ‘day job’, allowing them to intensify their publishing schedule and channel growth. Investing actual growth mechanisms and thresholds in more depth is beyond the scope of this paper but would clearly be a worthwhile endeavor.

Channel age distribution

The question how strongly channel age affects success is another element in the platform puzzle. While we can confirm Bärtl’s (2018) finding that there is a statistically significant correlation between channel age and success indicators, it is relatively small at 0.082 for subscriber count and 0.101 for view count — and this is focusing on the elite only. For the full dataset, these values go down to very low levels at 0.009 and 0.003 respectively. To switch perspective, Figure 3 presents the age distribution as percentage histogram for each tier.

 

Percentage histogram showing distribution of channel age in years since creation
 
Figure 3: Percentage histogram showing distribution of channel age in years since creation.

 

As with the days active averages in Table 6 , the differences are not striking: in all four cases, the most common channel age was two or three years. Moving down the tiers, however, we find a ‘flattened’ curve, that is, a higher percentage of both older and younger channels. This cannot be taken as an indicator for greater longevity of less popular creators: despite the bias toward popularity, our dataset includes abandoned spaces in which no content has been published for a long time. Indeed, out of the 153,770 channels in the elite tier, 152,681 (99.29 percent) had at least one video available, but only 137,605 (91.27 percent) featured videos created in 2019. There is an entire YouTube ‘cemetery’ of inactive channels hidden in our data that would merit further investigation.

Video creation tactics

One of the characteristics that have been distinguishing YouTube from other social media sites since 2007 already, is the decision to share advertising revenue — a considerable US$15B in 2019 — with content creators. Interestingly, the controversies around monetization that YouTube has seen almost constantly over the past years have revolved less around the split between platform (45 percent) and creators (55 percent) but rather around algorithmic findability and demonetization. The best tactics to reach more viewers and to increase revenue are heavily discussed topics within the creator community and the opaque and often-changing technical and administrative mechanisms have caused much frustration. This has led many creators to seek other sources of revenue, from crowdfunding to selling merchandise. Unfortunately, there is no automated way to know whether a video or channel has been demonetized, but there are at least three directions to investigate the effects of the ‘algorithmic dance’ (Kumar, 2019) creators have to engage in to succeed.

First , publishing frequency is considered to be one of the cornerstones of a successful channel strategy and uploading videos often and on a regular schedule is seen as essential to achieving visibility. The pressure this exerts became particularly visible in 2018 when a number of well-known creators had to pause due to burnout (Alexander, 2018). Figure 4 shows the evolution of monthly publication averages per channel, taking into account only channels that published at least one video in a given month. We chose this method of normalization to correct for the successive addition of channels to the elite tier over time and for the sometimes long periods of inactivity for certain channels.

 

Per-channel publication averages for active elite channels
 
Figure 4: Per-channel publication averages for active elite channels, Gaussian smoothing (sigma=4) added for readability.

 

Figure 4 shows that publication activity grew steadily until around mid-2014, then taking a slight dip before starting to grow again in 2016. The high variation in our dataset means that these results need to be taken with a grain of salt, but the overall trend toward increased publishing frequency is clear.

Second , the ‘optimal’ length of videos has also received much attention in creator communities. This concerns not only the question of how to handle users’ attention spans, but again relates to estimations of algorithmic preference and advertisement. Although YouTube does not specify a minimum video length for ad eligibility, videos that are longer than 10 minutes can place so-called ‘mid-roll’ ads [ 16 ] and, according to some creators, are favored by the all-important algorithms (Peterson, 2018).

 

Average video length in seconds per year for the 100k+ channel tier
 
Figure 5: Average video length in seconds per year for the 100k+ channel tier.

 

As Figure 5 shows, average video length has more than doubled since the beginning for the elite tier, with constant yearly progress from 2011 onward. This shows adaptation to real and imaginary incentives, but also tells a story of a changing platform: while YouTube was, for a long time, considered to be mainly a home for low-effort ‘user generated content’, the trend toward more substantive videos supports the ‘professionalization’ narrative, both on the level of individual channels and for the platform itself. That said, we only detected a very low level of correlation (0.002) between video duration and view count.

Third , in a situation where advertisement rules are unstable and opaque, creators have turned to product placement, sponsorships, affiliate programs, and crowdsourcing as means to generate income. Video descriptions are increasingly important tools to direct viewers to other places on the Web. In the 138M+ videos posted by our elite tier, we found 577,737,068 URLs that represent valuable traces for the specific forms of ‘industrialization’ happening on YouTube. Besides mapping creators’ cross-platform activities, they allow us to trace the appearance and spread of crowdfunding platforms like Patreon, of affiliate links and merchandise stores, and of e-commerce Web sites like Etsy. That is, of ways creators seek to develop their channels into media businesses that are less dependent on advertising income.

 

Videos with Patreon links in their description for the 100k+ channel tier
 
Figure 6: Videos with Patreon links in their description for the 100k+ channel tier.

 

Figure 6 shows the surge of Patreon links over the years. The slight dip in 2019 is explained by missing data for December, and the fact that we find the first link to the crowdfunding Web site in a video from 2005 confirms the idea that creators do adapt descriptions of older videos - Patreon was founded in May 2013. We will look more deeply into these practices, as well as visibility tactics such as keyword stuffing, in a follow-up publication dedicated to monetization and optimization.

Video overview and user reactions

Yet another way to measure channel stratification and success on YouTube is to investigate user reactions such as likes, dislikes, and comments. Table 8 gives a statistical description of the videos published by the 4.4M monetizable channels with at least 1,000 subscribers. We once more notice the uneven distribution along all variables. Duration, for example, shows that over 25 percent of published videos indeed conform to the ‘short clip’ cliché (less than two minutes), but there are also many much longer ones, reaching up to a whopping 46,043,514 second long video, which is 533 days of live feed from the International Space Station. More than half of the videos published fail to reach 550 views, despite the elite being included in this sample. Like, dislike, and comment counts are low compared to views, which suggests that interaction buttons (other than the play button) have a less central role on YouTube than on other platforms. The fact that comments generally rank higher than dislikes reinforces this observation. Another possible explanation for the centrality of views as a measure of success on YouTube is that the platform counts views from both logged-in and not logged-in users, whereas only logged-in users can interact with videos though liking, disliking, and commenting [ 17 ].

 

Descriptive statistics for videos from channels with 1k+ subscribers

 

To facilitate aggregate assessment, we generated [ 18 ] two additional metrics: intensity and likeratio . The first one identifies those videos able to generate more engagement per view and corresponds to the sum of likes, dislikes, and comments divided by view count and multiplied by 100 for readability. The second one highlights content that generates more controversy or negative feedback by dividing likes by dislikes. It is rare to receive more dislikes than likes, with an average likeratio of 15.8 in favor of likes for monetizable YouTube. This value goes up to 41 for the elite, adding another element to their success story.

Taken individually these metrics have certain problems: for example, videos made unavailable on copyright grounds retain their like and dislike counts, but their view count is reset to one, resulting in extremely high intensity values. These outliers can fortunately be easily identified, and we removed them from our analysis. Intensity and likeratio become particularly useful further down when intersected with channel topics and countries.

3.2. Channel categories

Categorization exists on (at least) two levels within YouTube’s information architecture. When uploading a video, creators have to choose from a list of set labels that are introduced in the following way: ‘Content categories organize channels and videos on YouTube and help creators, advertisers, and channel managers identify with content and audiences they wish to associate with.’ [ 19 ] From these labels and possibly other data, YouTube automatically generates channel categories or topics, which are not identical to the labels used for videos and cannot be changed by creators. As already discussed, categories form navigational hubs, but they also play a role in targeted monetization. This system, as Kumar argues, ‘skews the incentives for creating particular genres and types of content’ [ 20 ], raising concerns regarding its impact on creativity and innovation. This section first introduces categories more broadly and then investigates how various metrics vary across topics.

Categories overview

Channel categories are divided into seven main branches. Entertainment, gaming, lifestyle, music, society, and sports each come with their own subcategories, but a seventh topic, knowledge, is not divided any further. Channels may be categorized into one or several parent topics, and they may or may not be slotted further into subtopics. A certain percentage of channels across all of our tiers were in no category at all (see Figure 8 ). Figure 7 presents a structural view of category relationships based on co-occurrence for the same channel.

 

Network of YouTube categories based on co-occurrence
 
Figure 7: Network of YouTube categories (100k+ tier) based on co-occurrence; only edges with a greater weight than ten are visible.

 

Subcategories generally cluster close to their parent topics and they are often straightforward specializations: music is organized into genres like jazz, hip hop, or reggae; gaming into role-playing, strategy, or action game, and so forth. Entertainment, however, includes ‘professional wrestling’ as one of only five subcategories and lifestyle covers a very broad range of topics. The most striking exception is the society category, a mixed bag where traditional media outlets such as BBC News coexist with channels devoted to military, religion, politics, or health. Consequently, not all society subtopics are positioned close together in Figure 7 .

Connecting with recent debates on radical far-right content on YouTube, we can further problematize YouTube topics. While Politics could be considered the appropriate place for politically charged content, the category is mostly dedicated to news channels and documentary style content. These channels are grouped together into the Politics subcategory almost randomly, which has already been observed in previous literature (Paolillo, et al. , 2019). As a short experiment, we took the 31 channels mentioned as the core of Lewis’ (2018) far-right ‘alternative influence network’ and examined the categorization of the 28 channels still on YouTube. Ten were not classified at all, including the the most well-known ones: Steven Crowder, Stefan Molyneux, Jordan Peterson, The Rubin Report, and Ben Shapiro. Is YouTube explicitly reducing the findability of these highly controversial channels or are these signs of demonetization? Out of the remaining 18, only five were tagged as Society and only three as Politics. The most common parent topics, in fact, were Entertainment and Lifestyle (13 each). Making sense of why channels get grouped together in this category would require more forensic analysis, again pointing toward further research. Despite issues in specific areas, categories allow for broad characterizations of what is actually available on YouTube and the following section explores them in greater depth.

Categories in numbers

To provide a basic quantitative overview, Table 9 shows cumulative numbers for channel counts, subscribers, videos, and views per channel category while Figure 8 allows for the interactive exploration of the category system, in both cases for the 153k elite channels with more than 100,000 subscribers.

 

Full values per channel category