Studying Facebook and Instagram data: The Digital Footprints software
First Monday

Studying Facebook and Instagram data: The Digital Footprints software by Anja Bechmann and Peter B. Vahlstru



Abstract
The aim of this article is to discuss methodological implications and challenges in different kinds of deep and big data studies of Facebook and Instagram through methods involving the use of Application Programming Interface (API) data. This article describes and discusses Digital Footprints (www.digitalfootprints.dk), a data extraction and analytics software that allows researchers to extract user data from Facebook and Instagram data sources; public streams as well as private data with user consent. Based on insights from the software design process and data driven studies the article argues for three main challenges: Data quality, data access and analysis, and legal and ethical considerations.

Contents

Introduction
Extracting data from Instagram and Facebook: Related work
The Digital Footprints software
Analytical perspectives: Examples
Discussion: Data quality, access and ethics
Conclusion

 


 

Introduction

Statistics show that increasingly much of the time spent on the Internet happens on or through social networking services (SNS) where user engagement unfolds in password protected environments (alexa.com). This limited access requires new methods for the study of communication taking place in such settings. In this paper the terms social networking services and social media are synonyms and define services where users communicate semi-publicly with a network of ‘friends’ or peers (e.g., Facebook, Weibo, Twitter, and Instagram). When studying social media traditional digital methodological tools such as Web crawling (Gjoka, et al., 2008; Cantanese, et al., 2011) is not an option because it does not allow researchers to crawl pages that require user authentication. Raw data log files are the ideal objects of study in these environments, but they are only accessible to social media researchers if the social media company employs the researchers or an extended collaboration is established (see Kramer, 2012). This limited access has until recently made it nearly impossible to analyze what is going on inside Facebook, in terms of data and information patterns from an outside perspective.

Still social media communication has gained traction in media, communication and social studies. These studies are primarily executed as interview or survey based studies (Caers, et al., 2013), but also studies that use “data crawling” or “gleaning information about users from their profiles without their active participation” (Wilson, et al., 2012; Rieder, 2013). The latter studies use the application programming interfaces (APIs) to retrieve data for research purposes even though the APIs are made for developers to primarily integrate with gaming or service apps. This is especially the case with Facebook (see also Lomborg and Bechmann, 2014).

Existing software tools use a researcher’s own account to get access to other users and their activity without their consent (e.g., Netvizz). This article describe the software Digital Footprints (digitalfootprints.dk), a software tool that in contrast to existing tools use API data retrieval to access a wider number of data types across services, and ask participants (as normal procedure within qualitative and quantitative studies) if the researcher may retrieve and use the data in a specific research project. The description of the software will provide a rich entry point to the exemplification and discussion of the analytical potentials and challenges of such software-supported methods.

 

++++++++++

Extracting data from Instagram and Facebook: Related work

In this section we choose to outline related work by focusing on the use of APIs as a data retrieval method and a short outline of similar existing multiuser social media (Facebook and Instagram) data retrieval tools.

APIs as a data retrieval method

Traditionally, studying online communication has involved various kinds of survey-based or ethnographically inspired approaches, but they all entail significant limitations. Survey-based methods rely on answers about behavior but do not study the actual behavior or data. Classic physical participatory observation follows the user in real time, but the method is time-consuming, and very intrusive (Hammersley and Atkinson, 1995). The activity on social networks happens both during the day and night, during work and leisure hours, making comprehensive observation difficult. The researcher has to discuss the viability of observations, as the presence of the researcher redefines the situational context in which the use of Facebook takes place. Another method used is technology supported diary studies (Brandt, et al., 2007; Carter and Mankoff, 2005), and observations using other technically supported tools, such as screen capture client software e.g., Jing (http://www.techsmith.com). Facebook usage happens across devices, and tracking software can be installed on the different devices and combined with a unique user identifier, so that it can track across devices, as is the case with services from Apple, for example. However, the data generated by this tracking software is in the form of video recordings of screen activity, which is weak in terms of isolating each data unit (such as posts, comments and likes) and making them searchable and countable without coding the data manually.

Online virtual ethnographers (Kendall, 2002; Howard, 2002; Hine, 2000; Markham, 1998; Baym, 2000) conduct various kinds of participatory observation studies by participating in the online community, for example, by ‘friending’ research participants, commenting on updates and discussions, and asking for elaborations on different actions in the community. As a method, virtual ethnography builds on the fundamental ethnographic idea of becoming involved in the community studied, but has been criticized by researchers for lacking the detachment necessary to see and follow up on interesting aspects of the patterns of use and communication taking place (Magnani, et al., 2012). Furthermore, the studies cannot describe the context of the user, but only the data context as member of the network. For example, on Facebook, the virtual ethnographer must rely on the Facebook algorithm showing all participant activity, and the researcher cannot see the participants’ actual screen, and therefore cannot see what s/he is exposed to (e.g., the newsfeed). To document the activity, the virtual ethnographic studies needed to manually copy/paste the content from the social networking service, in order to analyze it further.

This in return is one of the main trajectories of using APIs as data retrieval tools, to make data searchable and sortable for analysis. On the other hand in contrast to both survey and especially ethnographically inspired studies APIs as a single-standing method for data collection does not provide researchers with a tool to study usage or data in context, but only digital traces of activity as they occur in the API. This provides the API method with a major blind spot especially concerning the ability to analyze user reading mode or ‘lurking’ activity and click-through patterns (see also Lomborg and Bechmann, 2014). On top of that both Instagram and especially Facebook APIs do not deliver all data as they appear for the users. This shortcoming will be the object of discussion in the section on data quality.

When using APIs as a method for data retrieval researchers can choose between two ways of collecting data either within the social networking sites or from outside. When researchers recruit within the social networking site they often use their own account to collect data from connected users (see Rieder, 2013) or use virality and network (personal network, sponsored content or Facebook ads) to recruit participants. The advantage of such convenient collection is that it is easily available and the reach is potentially large. However, in personal network it can be difficult to obtain a broad sample on different strata (e.g., demographics) and when using sponsored content and Facebook ads the researchers must rely on potentially false self-reported demographics and it can be difficult to secure participation. Recruitment outside can take place by asking people to “fill out questionnaires, often via so called Facebook applications” (Rieder, 2013). In this way the Facebook app is used to gather information, but does not actually collect Facebook activity data except profile data. Another method that the Digital Footprints support is to recruit people externally and then ask for permission to use their content. In this way researchers are able to control the demographics and to secure participation through for instance recruiting through Internet panels or qualitatively in small sample studies. However, due to a lower participation rate (see Bechmann and Valstrup, 2015) an actual representative study of a broader national or international sample using APIs will be difficult to execute when researchers ask for permission.

Existing social media data retrieval tools

Some large-scale analytical systems serve as platforms for accessing data on Facebook, but they only crawl publicly available data, through public profiles, e.g., SocialMention (http://socialmention.com/) and Radian6 (http://www.radian6.com/). The purpose of these services is for companies to scan the social media networks for mentionings. These systems scan across social media sites such as Facebook, Twitter and Instagram, but on Facebook and Instagram they only crawl data from public profiles. This, despite the fact that Facebook and Instagram allow third party companies and researchers to access user data with the permission of the user, if the third party company/researcher complies with terms of use for developers.

NVivo is a qualitative data analysis system that among others facilitates retrieval and analysis of Facebook data (http://www.qsrinternational.com/). The software provides access to semi-private data, but through the browser plugin NCapture analysts can only access public profiles and groups or posts from users they are friends with and groups they are part of. Furthermore NVivo cannot collect data from the newsfeed.

ArchivedBook is an app using the Graph API for data retrieval, but ArchivedBook is not a tool for researchers. It only shows the end user’s personal data. It is not possible nor the purpose to retrieve data from other users. Often researchers need a system that asks for permission to retrieve data that are not publicly available from a sample of end users. Netvizz (Rieder, 2013) on the other hand is a tool primarily designed for researchers to extract data from Facebook. The tool is also a multiuser software that extracts data from Facebook, but through the researcher’s own Facebook account and thereby collecting only data from the researcher’s network of friends. Furthermore, Netvizz is primarily build to make network analysis and therefore focus on network graphs and relational data.

Open APIs as data collection tools

In recent years there has been a focus on medium-sized and “big data” structures in cloud services, and how Web crawlers, databases, and graphical presentation tools can help surmount the obstacles to documenting user interfaces, making data accessible and searchable for network and semantic data mining. Quite a few researchers studying open/public social networks have used APIs on a smaller or larger scale to extract and crawl data. This is particularly the case with Twitter studies, where using APIs has been the standard approach for extracting data to various external software, such as IssueCrawler, Gephi and other graphical network tools (see also Bruns, 2012). Public data has also been accessed for YouTube studies, to sample survey users on the basis of data from the Google Data API (Courtois, et al., 2011).

However, few studies have used the Instagram or Facebook API to collect private data with user consent, or in other words, to act as a third party, accessing the users’ private information, which can inform researchers of actual user behavior, making mixed methods study possible, triangulating with ethnographic methods (observation, interviewing, etc.). Other researchers have used open APIs in specific studies on other social networking sites. Korn (2012) uses check-in data in his study of location based data patterns on Foursquare. This method enabled him to automatically sent text messages shortly after the participants checked in, and they also triangulated methods, interviewing participants afterward, to obtain rich, in-depth accounts of user experiences (rich deferred contextual interviews).

A common denominator for these open API studies is that they are customized for specific purposes and often made by assisting computer scientists. In contrast we believe that social scientist and humanist researchers need more generic multiuser systems that do not require collaboration with computer scientists in order to retrieve data from the various social media APIs. In return such system will not encompass customized needs.

 

++++++++++

The Digital Footprints software

We started building the Digital Footprints software in January 2012. The initial aim of Digital Footprints was to make it easier for researchers to collect and analyze Facebook and Instagram data (private, as well as public) from a selected number of participants with user consent. Compared to existing methods we needed a tool that among others could be more unobtrusive, make data searchable and sortable, access historical Facebook data, and help secure informed consent.

 

Table 1: Digital Footprints compared to existing social media methods.
Requirements & functionality
Method
Programming skillsInformed consentTime consumption for researchersTime consumption for participantsSearchable and sortable dataAccessing historical Facebook data
Physical participatory observationLowNot facilitatedHighHighNoYes
Technology supported diary studiesLowMay be facilitatedHighHighNoNo
Screen capture softwareLowNot facilitatedLowLowNoNo
Friending participantsLowNot facilitatedHighLowNoYes
Customizing a new solutionHighNot facilitatedLowLowYesYes
Digital footprintsLowFacilitatedLowLowYesYes

 

The application was initially designed to optimize small-scale, in-depth, qualitative user studies with the intent to study a small sample or a panel of users. Inspired by the seven steps in qualitative studies (Kvale and Brinkmann, 2009), the system is designed to address the following five steps: 1. Create a new Facebook research project/panel; 2. Automatically guide and collect the legal information needed from the researcher and the user; 3. Facilitate the invitation of participants to take part in a project/panel through links; 4. Retrieve the Facebook data needed from the participants; and, 5. Facilitate different analytical functionalities such as descriptive data statistics and graphical presentations, search and sorting possibilities.

Later the application has been developed to support larger and broader samples and quantitative analysis by making it possible for researchers to verify demographic variables in survey questions before collecting the data through a link. Furthermore the user interface and server code have been optimized for analysis of larger amounts of data. The recent development has been to collect social media data across services. Here, we have made it possible to collect data on Instagram as well and offer similar analytical functionalities.

New research project

Through an authentication process, researchers conduct their own research project through the Web interface of Digital Footprints, accessing our servers directly. The identification process exists to identify the researchers as full-time researchers affiliated with a university. Thereafter, the researcher will be able to administer his/her research projects. The administrative researcher can change the project status, update data and delete data. Furthermore, for privacy protection and as a security measure, it is important for us as administrators of the system to be able to approve or deny a research project. The researcher (and his/her team) only has access to her/his own data and the researcher still need to decide on how to recruit the participants. Digital Footprints only provide the researcher with a series of links and a formula to fill out necessary (legal) information about the research project.

Legal information

In software designs using open APIs (Rieder, 2013; Bruns, 2012), legal considerations on privacy have only partly been explicitly addressed and discussed as a part of the study and the system design. This is extremely important to us when using APIs. We have consulted privacy lawyers, to build in the legal aspects of working with private, potentially sensitive user data in our system. Hence, an important part of the system is providing users with accurate information (e.g., compliance with EU law) about the research project in which they are participating. Even though the system is intended for international use, at this time the legal guidelines are limited to EU law (European Union, 2002; European Union, 1995; Ohm, 2010). For this reason, it is mandatory for researchers to furnish information on the duration of the project, the purpose of the project, the data needed for the project, and how and under what circumstances they will analyze the data. Furthermore, should data not be deleted at the end of the project period, researchers must verify that any archiving is done according to the law. Also according to the law, a user must be able to withdraw from a research project. Therefore, the system has a built-in function that allows researchers to select “delete profile from research project”. Furthermore, the participants in a research project can delete the access token to collect the data by deleting the digital footprints app in Facebook. The researcher will be notified in the dashboard of Digital Footprints researcher’s view and an incomplete icon will occur on the participant in question. This is done because in particular small-scale, in-depth studies are more sensitive to user withdrawal than are large-scale studies. Still, large-scale studies rely on “perfect” sampling criteria and therefore need to control demographic parameters on the participants that choose to leave the study.

Inviting participants

The researcher invites participants to take part in a research project or panel by providing information required by law, but implemented in the system as standard information provided, in order to start inviting participants. The user consent process is threefold. First the researcher sends out e-mail messages, provide a link in an e-mail message, survey or recruits in person (depending on methods and panel size), informing participants of the research project (purpose, aim, etc.) and asking potential participants to visit an URL unique to the research project in question. Second, this URL directs potential participants to a Web interface (at this time, using a browser interface optimized for laptops, Android, iOS for iPhone and iPads) where they receive more information on the research project along with privacy policies and are asked to log in with their Facebook/Instagram account. The participants then permit the research project to collect data from their profiles.

Collecting data

When a user permits Digital Footprints to collect the data specified in the invitation, the data is collected through Instagram and Facebook’s APIs typically in the form of JSON (JavaScript Object Notation) and then transformed and stored in a relational database. This procedure for data collection from Instagram and Facebook connects to private data normally only accessible in closed networks, such as user data (name, e-mail address, location, religion, political views, friends, biography, education, work history etc.), posts (wall posts, including images, comments, check-ins, locations, etc.), news feeds (data from other users, including images, comments, locations etc.), and groups. Researchers can check a wide range of data types in our system, and we intend to provide access to the majority of the data available through the API, but the researcher can only choose the data type needed for his/her specific research project. Digital Footprints also provides researchers with the possibility to collect public data. In this case the researcher does not need the permission of a participant, but can simply enter the name of the Facebook public group or page or Instagram hashtag in the Digital Footprints interface and start collecing data immediately. Researchers can also study their own data by inviting themselves as participant in a project following the procedure for private data collection already accounted for.

Data analysis

When the data is collected, Digital Footprints facilitates different kinds of searching, sorting and filtering options. To support the opportunity for researchers to study both individual users and user patterns across users in the panel, the system builds on two sorting mechanisms (participant sorting and panel sorting).

 

Self-reported demographic and interest data for a single participant in a research project
 
Figure 1: Self-reported demographic and interest data for a single participant in a research project.
Note: Larger version of figure available here.

 

 

Facebook Newsfeeds across participants in a particular research project
 
Figure 2: Facebook Newsfeeds across participants in a particular research project.
Note: Larger version of figure available here.

 

For example, researchers can either see wall posts and search for keywords on specific participants (deep data) or for all users participating in the research project. For advanced filtering it is possible to make ranged queries by defining start and end dates, choosing which columns to search within (e.g., from, to, content, application, place, etc.) and whether Facebook stories should be excluded from the search (A Facebook story is an auto-generated story from Facebook such as ‘A is now friends with B’). The current version also supports descending and ascending sorting of categories such as ‘likes’, ‘location’, ‘application’, ‘birthday’, and ‘civil status’ on Facebook projects and ‘likes’, ‘tags’, ‘place’, ‘filter’ on Instagram data, along with option of creating different ‘views’. The ‘views’ function allows the researcher to select specific data units and transfer them to different ‘sub-projects’, simulating manually coding techniques in qualitative studies.

To serve a more general data overview of the research project for both small-scale studies but especially large-scale studies we designed different descriptive statistics such as data type counts and visualizations, top 100 word counts, and data distributed on days, weeks, months and years.

 

Examples of data type visualization and data distributed daily
 
Figure 3: Examples of data type visualization and data distributed daily.
Note: Larger version of left figure available here.
Note: Larger version of right figure available here.

 

Furthermore it was important to support export of datasets, so the data could be used in more advanced analytic and visualization tools (e.g., Tableau, R and IBM BigInsights). One of the challenges in making a standard tool for research purposes is, as already mentioned, that it does not take into consideration the tailored solutions. An export function is a way to address such needs.

 

++++++++++

Analytical perspectives: Examples

At the time of writing 150 researchers divided between 100 research projects use Digital Footprints in 13 different countries. In order to discuss more closely what kind of analytical potentials and challenges that characterize the study of social media traces through the use of Digital Footprints the following section will present three different research projects that have driven the development of the software. The purpose of this presentation is to show the challenges connected to different research designs: A qualitative hybrid method study focusing on deep data (Bechmann, 2015), a quantitative study with a sample of 1000 people mirroring demographics of Denmark (Bechmann and Vahlstrup, 2015), and a cross-platform study of the social media layer of a music festival (Bechmann, et al., 2015). The findings of the studies are or will be presented and published elsewhere. The purpose of presenting the projects here is solely to show how we have used the software and to discuss the practical and methodological challenges in such use.

Personal data in social media networks

This first project was a qualitative study with interviews, screen dumps and API data from Facebook (using Digital Footprints) from young Danish high school students and American college students. The project focused on analyzing and discussing the young students behavior and attitudes towards personal data sharing in social media networks, with a special emphasis on connecting social media, thereby exchanging data across services. The participants were recruited in classrooms at the school/college and the signup link was written on the blackboard. The information about the research project and how we intended to use the data was thereby given both as an oral presentation and as written text in the invite text on the signup page.

The study was inspired by ethno-mining (Anderson, et al., 2009) and the Digital Footprints software provided us with an opportunity to mix data and interviews where we would address and ask participants about their behavioral patterns and then look for data patterns according to the reflections of the participants (Bechmann, 2014; Bechmann, 2015). In the study we were primarily interested in ‘deep data’, looking for the ways in which the participants tend to share data across social media and the way the participants reflected on such behavior. Hence the visual interface and the search function of the software was very important in this study as it allowed us to show the participants how we could search their data for particular topics and words and also show them immediately the data entities and patterns that we found. This study also made use of the manual coding ‘views’ that allowed us as researchers to isolate parts of the dataset for further qualitative study.

The challenge in this study of ‘deep’ data was that we were not able to collect all data through the API. For instance we wanted to examine what kind of external apps they shared their Facebook data with but this information was not a part of the API permissions at the time of data collection so we needed the participants to take screen dumps of this information and then manually analyze them afterwards. Furthermore, it was difficult from the lack of documentation on the API to actually know how much of the data that we got out of the API compared to what kind of data the participants was exposed to (in the newsfeed) and what kind of data footprints the participant made (wall posts and group posts). For instance we found that live activities (e.g., music played in Spotify or running routes from Runkeeper) from apps that required extra permission was not a part of the newsfeed extracted. Furthermore the newsfeed did not contain sponsored stories and at the time of the study the API did not contain information on whether the participant had been exposed to the story or not. If a participant has many friends but not often read the newsfeed the stories that we extract would most likely not all have been read by/exposed to the participant.

Facebook footprints

The second project was a study where we wanted to look at the Facebook data patterns of a national (Danish) population on Facebook. We recruited a sample of 1,000 Danes with demographics mirroring the Danish Facebook population through collaboration with the Internet panel Userneeds. The recruitment took place in the panel as a three-fold written informed consent procedure where the participants were introduced to the research project on the panel Web site, followed by additional information on Digital Footprints signup page and lastly with information on Facebook before accepting to share their data with the research team. We added the opportunity to exit the sign up procedure at any time and The Digital Footprints software was designed to hide all names from the dataset replacing them with serial numbers. Furthermore the software was designed to create alerts when data collection unintentionally stopped (e.g., because participants changed passwords during the virus attack Heartbleed) and it was important for us to build in more general statistics on data patterns over time and top words for the dataset. The focus was primarily on semantic analysis and behavioral data pattern recognition, the use of platforms and applications (Bechmann and Vahlstrup, 2015).

The big challenge in this project was to automate the registration of topics on the total amount of for instance groups and not just hand code a random subset. Furthermore, we were struggling with automatized analysis of pictures instead of analyzing pictures either as manual pattern detection or by using metadata such as picture text, filename and tags (Vis, 2013).

One other major obstacle was to properly identify the Danish population on Facebook and not just the Danish population in total. ‘Only’ 65 percent of the Danish population is on Facebook but there is no consecutive, reliable and official statistics on the demographics of this population. This makes sampling more difficult as studying Facebook patterns of Danes that are not on Facebook is irrelevant. Also when studying data over time (in this case seven years) the demographics of the participants is also not the same during the period. We can account for age differences, but changes in municipality are more difficult to account for (e.g., when people move).

In the study it became very evident that changes in API structure and data access along with interface design played a role in determining the development of data patterns over time. For example we had a data peak in 2012 that were created from automated stories from Farmville. The stories were not Facebook stories but a third party app allowed to post on behalf of the user. When we compare data patterns across different versions of the API and interface this is a methodological issue that ideally needs to be mentioned and that potentially lower the reliability of the findings.

In such large-scale studies where only API data exist it is clearly difficult to detect contextual factors surrounding the digital traces. At the same time the dataset consists of a high degree of details in terms of digital footprints in which new questions arise and can be tested. Still the API sets fixed limits for what we are able to retrieve and thereby what we are able to examine (Lomborg and Bechmann, 2014; Vis, 2013).

Measuring impact across social media

The last example of analytical perspectives we will present and discuss is a study titled ‘Measuring impact across social media’. The aim of the study was to analyze the digital social layer of a music festival (the indie music festival Northside in Denmark). Through Digital Footprints we collected data from the festival’s official Facebook page, as well as their Twitter and Instagram account and five hashtags from Instagram and Twitter (using yourTwapperkeeper) including the official hashtag (NS14). The official hashtag was collected from two months before the festival and until a week after the festival (Bechmann, et al., 2015). Furthermore, we interviewed and collected private streams across platforms from 15 participants sampled to maximize variation (age, gender, education) within the target group of the festival. The participants were recruited through the NorthSide newsletter and through second order network of friends on Facebook. As our research interest was to examine the characteristics of the digital social layer of the festival and how the festival event took place across social media we needed to decide on a way to compare the digital traces in Facebook, Twitter and Instagram when expanding the Digital Footprints software to Instagram data collection. For instance we decided that a heart on Instagram was comparable to a like on Facebook and a favorite on Twitter, a comment, tags, time codes and geo-location data was comparable whether on Facebook, Twitter (@mention) or Instagram, and a Facebook status update was comparable to a instagram photo/video and a tweet.

However, comparing different social media services is not trivial as they each have their own communicative rules, logic and content format (Dijck, 2013). Instagram has no share function as Twitter (retweet) and Facebook do and hashtags are not ‘administered’ by one organization as a Facebook page is, thus require more data cleaning because the same hashtag can relate to other topics/events as well as the one of interest.

Whereas hashtag analyses capture user-user or user-organization communication, Facebook page, Twitter and Instagram accounts primarily capture communication led by the event organizer, thus leaving a lot of the ‘social media layer’ communication of the festival to the personal feeds. On the other hand we assumed that not all updates were tagged with the five most common hashtags on Instagram and Twitter thus weakening the research design if the personal feeds are not included in the study. Therefore we chose to collect mainly public feeds but wanted to use the smaller sample of personal feeds among others to test the method; approximately how much of the data was visible in the public feeds and how much data we did not collect.

These data led to analytical perspectives such as measuring amount of data uploads (updates and comments) before, during, and after the festival across the three different platforms to analyze the data peaks on each platform. We also measured the amount of data and manually coded the content of the streams (subset) according to the time codes of the festival music program to see if some concerts created more data than others and what kind of data they created. Further the team used word count to compare content across platforms, geo-location data to show the impact of the festival in terms of where the data was uploaded from different geographical scales (world, country, city, and festival venue) and network analysis to analyze whether it was the same users who posted the most on different platforms and if the users tagged content with the same hashtags (see also Bechmann, et al., 2015).

 

++++++++++

Discussion: Data quality, access and ethics

As the exemplifying projects above show there has been a large potential in using Digital Footprints due to especially access to data, the ability to make data searchable and sortable in the database, and the ability to analyze data both internally in Digital Footprints through the graphical user interface and externally in third party applications. However, the exemplifying projects also highlight large challenges when it comes to data extraction and analysis. In this section we will discuss and add to these issues within three general challenges: Data quality, data access and analysis, and legal and ethical considerations.

Data quality

Even though both Facebook and Instagram provide an API that allows researchers to collect high quality data through Digital Footprints there is still little documentation of the API structure and the APIs are not as stable as live data feeds. Even though this article intended to explain the black box of Digital Footprints, the flaws, limitations and lack of documentation on the API side still leave researchers in the dark when using Digital Footprints or any other data collection app with restrictions when it comes to research designs and data reliability. The most profound limitation depends on the research interest, but can be server stability, time delay and lack of data that has extended permissions.

In order to secure the reliability and transparency in quantitative and qualitative studies the data received through the app ideally need to match the data delivered or viewed by the research participants. Neuhaus and Webmoor (2012) argue that the public API of Twitter only gives access to a sample size of the actual data streams. Our design shows that this seems not to be the case with Facebook. However, data collection is heavily dependent on the stability of the Facebook servers and the permission retrieved. For instance, using Digital Footprints in the first qualitative user study we had an empty time frame in the profile feed data. This was most likely due to a server outage at Facebook, and we had to restore the data by re-retrieving it from Facebook at a later date.

The API data has a time delay on Facebook compared to the live data feeds especially when studying Facebook newsfeed. We wanted to register the news feed flow to which participants were exposed. However, the data showed that the API transferred all entries to which the participants could potentially be exposed. Surprisingly, at the time of the study Facebook did not register whether or not a user has been exposed to an entry. This means that as methodological tool, Digital Footprints cannot simulate the actual flow on the user’s screen, because we do not know whether the user has actually been exposed to all updates. For example, this would not be the case if she logs on to Facebook very seldom, or has many friends/potential updates. In these cases, Digital Footprints retrieves more updates than the user has been exposed to. Still, the API showed exact time stamps on each entry, the exact coordinates for location-tagged data, and we were able to contextualize each data unit, because each unit has a unique identification code.

Furthermore, live data on Facebook from partners is not included unless the researcher knows specifically if the user has installed a certain app (e.g., Spotify or Endomondo). We wanted to determine how many posts were updated from a third party application such as Twitter, Spotify, Instagram, or YouTube. The API supports this functionality, but when analyzing the data, no entries from Spotify and Pinterest, for example, appeared, even though we could see from the screen dumps that participants had connected to these apps. We found that an extended permission was needed to access actions from third party apps and this caused the lack of data. To get hold of these entries we needed to request permissions to access each individual namespace for the apps used. To some research projects the request is difficult because they have to know which apps the user has installed in order to automate the information retrieval. For other projects focusing on a particular app (e.g., Spotify) this is trivial as one can assume that all users have the app and therefore retrieval can easily be automated.

When going back in time in data collection it is difficult for the researcher to see if data patterns occur due to changes in the user interface or API structure because they are not accounted for. The APIs are designed for developers of typically games, quizzes, and other user action oriented apps. The degree or type of accuracy in such cases is different from using the API for research purposes. These types of apps do not need to have a complete one-on-one mapping of the user activity data, but enough to optimize the service for the user.

Karpf (2012) argues that this is the case with most software-supported methods online but we argue that the problem is not that data is missing and we need to kludge methods together. Rather the challenge is when we do not know what kind of data we collect and what we are missing. In other words: where are the blind data spots? Do we see them all? In some cases researchers can do reverse engineering inspired approaches by looking at public available data, but when using private data this is more difficult. In these cases we need supplementary methods to test the data quality for instance through interviews and observations as we have made in some of the exemplifying research projects. We can find outliers in data that will reveal some aspects to ‘clean’, but we are not fully able to account for data lack if the lack does not create a significant pattern and if we are not able to reconstruct the data (e.g., through public streams) or to ask participants.

Due to server capacity and data limitations from social media service providers Digital Footprints has currently data limitations in the free version accessible to other researchers. However, collaborating research partners do not have this limitation.

Data access and analysis

During the project period of three years from 2012 to 2015 it was evident for us that Facebook added more features thereby increasing the different types of data that could be collected from the service, but at the same time the service limited the API data access in line with Twitter at some points. At the time of writing this was not the case with Instagram yet, but in April 2015 Facebook tightened the review process for applications using private data with informed consent. This means that research apps collecting data through the API needed to verify on equal terms as business apps how they use the data to improve or enhance the user experience of Facebook. Facebook will then manually evaluate whether they will support this kind of usage.

Some of the large research apps (e.g., Netvizz) are closing main features due to these restrictions. This profound (nearly monopolized) commercial control over data streams and the power to shut down expensive and government paid generic, multi-user research tools where research follows strict ethical and legal control (e.g. approval from national agencies and IRBs) show the vulnerability of APIs as research method. It also questions the ability of the research field to actually provide design user-friendly (and expensive) social media software tools for research as an important part of science infrastructure and digital humanities. The software tools are solely relying on the access provided by the commercial social media services and can instantly be shut down or limited. On the other hand if research relies on paid dataset from Datasift or other data companies collaborating with for instance Twitter we will experience even greater difficulties in accounting for lack of data and the black boxing of the APIs and the apps collecting the data. The result would be the starting point of this article that research on social media originate from social media companies or cooperating partners. This in turn also creates opaque empirical basis and black boxing due to competitive interests. As most socializing and time spend on the Internet happens in gated commercial services this has great impact on Internet research and digital social sciences.

Legal and ethical considerations

Is shutting down API data retrieval good for the privacy of social media users? In some respects the review process secures users against apps that retrieve data without using it to enhance the Facebook experience (Facebook terms of use), but it does not prevent the use of data to unauthorized usage or pretend to retrieve data for one purpose and then use it for other purposes. Closing API data may lead to a less transparent commercial exploitation of data, where data can be purchased by the highest bidder in non-public forums in the growing data brokerage industry. Immediate benefits of less open data sharing can lead to even less user control over data and less transparency over who has access to data and for what purpose.

Even though researchers are provided with legal permission for API data retrieval and analysis when it contains personal information it is still relevant to raise the ethical issues of mining massive amounts of data designed to be personally identifiable. This is relevant for public as well as private data and across different types of social media platforms such as Instagram, Facebook and Twitter.

In private data we both technically and legally need the informed consent of users.As the article has demonstrated, we facilitate an extensive informed consent procedure in Digital Footprints. However, existing studies (se also Bechmann, 2014) have shown that users do not read permission forms before accepting, even though Facebook provides users with information in several steps. The users mainly rely on the trustworthiness of the sender or the ‘friend’/peer distributing the app (Bechmann, 2014). In some ways securing user privacy in studies of public data, such as hashtag analysis, is more difficult because we do not have contact details of individual users. This grey zone of data still provides researchers with the ability to identify the user both intentionally and unintentionally. For instance, in the project on measuring impact across social media we wanted to examine whether the top-posting users were the same across platforms, but even though it is public data it is questionable whether users want their names to appear in this context.

When securing the informed consent of participants, Digital Footprints’ privacy policy draws attention to the fact that researchers are bound by privacy regulations to anonymize data collected from users. We also make sure to state to the user that several cases have shown that it is possible to de-anonymize data and text strings (Krishnamurthy and Wills, 2009; Ohm, 2010). Even though we try to follow basic privacy by a design principle of transparency (Kelley, et al., 2011; Korth, et al., 2011; Barkhuus, 2012) we are caught by the privacy as well as transparency paradox (Barnes, 2006; Nissenbaum, 2011) because users neglect to read privacy policies. If transparency becomes too complex, as is the case with a large number of permissions on Facebook, users do not read through them. Therefore we encourage researchers to inform users personally, if possible (Lomborg and Bechmann, 2014).

When users create a Facebook account, they consent to the fact that ‘friends’ will be able to ‘open’ users’ data for third parties. The API thereby provides software, like Digital Footprints, with access to data streams containing data units from other persons than the participants consenting. This is especially the case in the read_stream permission (wall and newsfeed data streams) and groups. Even though informed consent has been provided from the privacy policy contract with Facebook, this does not mean that it is not ethically sensitive and needs to be handled with careful consideration. In our research we use the anonymization function in order to protect the identities not only in publication but also in the analytical/data mining phase. However, handling information about other individuals, other than a specific participant, in a ‘community’ is a common challenge to ethnographic research, whether located digitally or physically (see also Hammersley and Atkinson, 1995).

The biggest challenges when trying to create an infrastructure based on personal data, such as information extracted from social media, are different legislative environments and ethical issues on an international scale. Having worked with cross-national research projects addressing privacy concerns in Digital Footprints, this aspect is not trivial. Designers and developers ideally need insight into regulations and ethical standards in a variety of national states in order to secure appropriate protections. Researchers may not have proper insight into different national regulations, placing legal responsibility on a principal investigator. This situation may not necessarily provide optimal legal and ethical considerations to participants. The ideal solution to this problem has yet to be discovered. We addressed this challenge by manually reviewing all applications and required an official university e-mail address for full-time, employed researchers. In this way, API tools may protect participants against the use of data in commercial settings with clearly defined researchers.

 

++++++++++

Conclusion

This paper proposed ‘black boxing’ (Latour, 1987) as one of the main challenges to API research, illuminating the benefits and challenges of using Digital Footprints as a social media research infrastructure. Despite different methodological implications and challenges identified in various research designs, three key challenges were identified relative to Digital Footprints as a tool and especially to APIs as data sources: Data quality; data access; and, ethical and legal concerns.

In terms of data quality, we discussed the analytical implications of not being able to recognize missing data and the associated challenge of placing data in context, actually studying usage and not just data patterns. Hence, it is important to not be overly enthusiastic about opportunities to collect data from APIs, and the ease with which this may be done. Is it possible to deliver sustainable generic multiuser tools for research when API access changes, limiting the effectiveness of specific apps for research and data analytical purposes? There are difficulties in cross-analyzing social media data because both API structures and sociological profiles of services are different.

Last but not least, we examined the ethical and legal challenges in analyzing personal information, with access to participants’ friends as well as conducting cross-national research with different legal and ethical considerations.

Based on this discussion, future research and infrastructure design need to address these challenges by tightening the focus on legal and ethical concerns in actual software design. In addition, there is a need for contextual awareness by incorporating more two-way reflexive feedback systems (such as digital diaries) that can place data patterns in context not only by interviewing participants, but through an analytical process. End of article

 

About the authors

Anja Bechmann is Associate Professor in Digital Media and Head of Digital Footprints Research Group at Aarhus University, Denmark.
E-mail: anjabechmann [at] dac [dot] au [dot] dk

Peter B. Vahlstrup is Teaching Assistant Professor in Programming and Head of Development of Digital Footprints at Aarhus University, Denmark.
E-mail: imvpbv [at] dac [dot] au [dot] dk

 

Acknowledgments

We thank Aarhus University (DAC) and Digital Humanities Lab (DK) for help funding the system design and for support on the legal matters of our software. Furthermore we thank Facebook for engaging in a dialogue on the terms of use interpreted in this specific research context.

 

References

K. Anderson, D. Nafus, T. Rattenbury and R. Aippersbach, 2009. “Numbers have qualities too: Experiences with ethno-mining,” Ethnographic Praxis In Industry Conference, volume 2009, number 1, pp. 123–140.
doi: http://dx.doi.org/10.1111/j.1559-8918.2009.tb00133.x, accessed 24 November 2015.

L. Barkhuus, 2012. “The mismeasurement of privacy: Using contextual integrity to reconsider privacy in HCI,” CHI ’12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 367–376.
doi: http://dx.doi.org/10.1145/2207676.2207727, accessed 24 November 2015.

S.B. Barnes, 2006. “A privacy paradox: Social networking in the United States,” First Monday, volume 11, number 9, at http://firstmonday.org/article/view/1394/1312, accessed 24 November 2015.

N.K. Baym, 2000. Tune in, log on: Soaps, fandom, and online community. Thousand Oaks, Calif.: Sage.

A. Bechmann, 2015. “Managing the interoperable self,” In: A. Bechmann and S. Lomborg (editors). The ubiquitous Internet: User and industry perspectives. New York: Routledge, pp. 54–73.

A. Bechmann, 2014. “Non-informed consent cultures: Privacy policies and app contracts on Facebook,” Journal of Media Business Studies, volume 11, number 1, pp. 21–38.
doi: http://dx.doi.org/10.1080/16522354.2014.11073574, accessed 24 November 2015.

A. Bechmann and P. Vahlstrup, 2015. “Private Facebook data patterns in a broad national sample,” paper presented at Digital sociology: The first digital sociology conference 2015 (New York).

A. Bechmann, J.L. Jensen, and P. Vahlstrup, 2015. “Studying social media data across platforms on planned event: Facebook, Instagram and Twitter data patterns at music festival,” paper presented at Users across media conference, University of Copenhagen.

J. Brandt, N. Weiss, and S.R. Klemmer, 2007. “txt 4 l8r: Lowering the burden for diary studies under mobile conditions,” CHI EA ’07: CHI ’07 Extended Abstracts on Human Factors in Computing Systems, pp. 2,303–2,308.
doi: http://dx.doi.org/10.1145/1240866.1240998, accessed 24 November 2015.

A. Bruns, 2012. “How long is a tweet? Mapping dynamic conversation networks on Twitter using Gawk and Gephi,” Information, Communication & Society, volume 15, number 9, pp. 1,323–1,351.
doi: http://dx.doi.org/10.1080/1369118X.2011.635214, accessed 24 November 2015.

R. Caers, T. De Feyter, M. De Couck, T. Stough, C. Vigna, C. Du Bois, 2013. “Facebook: A literature review,” New Media & Society, volume 15, number 6, pp. 982–1,002.
doi: http://dx.doi.org/10.1177/1461444813488061, accessed 24 November 2015.

S.A. Cantanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti, 2011. “Crawling Facebook for social network analysis purposes,” paper presented at WIMS ’11; version at http://cogprints.org/7663/1/71-catanese.pdf, accessed 24 November 2015.

S. Carter and J. Mankoff, 2005. “When participants do the capturing: The role of media in diary studies,” CHI '05 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 899–908.
doi: http://dx.doi.org/10.1145/1054972.1055098, accessed 24 November 2015.

C. Courtois, P. Mechant, and L. De Marez, 2011. “Teenage uploaders on YouTube: Networked public expectancies, online feedback preference, and received on-platform feedback,” Cyberpsychology, Behavior, and Social Networking, volume 14, number 5, pp. 315–322.
doi: http://dx.doi.org/10.1089/cyber.2010.0225, accessed 24 November 2015.

J. van Dijck, 2013. The culture of connectivity: A critical history of social media. New York: Oxford University Press.

European Union, 2002. “Data protection in the electronic communications sector: Directive 2002/58/EC” (12 July), at eur-lex.europa.eu/, accessed 24 November 2015.

European Union, 1995. “Protection of personal data: Directive 95/46/EC” (24 October), at eur-lex.europa.eu/, accessed 24 November 2015.

M. Gjoka, M. Sirivianos, A. Markopoulou, and X. Yang, 2008. “Poking Facebook: Characterization of OSN applications,” WOSN ’08: Proceedings of the First Workshop on Online Social Networks, pp. 31–36.
doi: http://dx.doi.org/10.1145/1397735.1397743, accessed 24 November 2015.

M. Hammersley and P. Atkinson, 1995. Ethnography: Principles in practice. London: Routledge.

C. Hine, 2000. Virtual ethnography. London: Sage.

P.N. Howard, 2002. “Network ethnography and the hypermedia organization: New media, new organizations, new methods,” New Media & Society, volume 4, number 4, pp. 550–574.
doi: http://dx.doi.org/10.1177/146144402321466813, accessed 24 November 2015.

D. Karpf, 2012. “Social science research methods in Internet time,” Information, Communication & Society, volume 15, number 5, pp. 639–661.
doi: http://dx.doi.org/10.1080/1369118X.2012.665468, accessed 24 November 2015.

P.G. Kelley, L. Cesca, J. Bresee, and L.F. Cranor, 2002. “Intentional design of privacy policies,” CHI 2011 Workshop on Networked Privacy, at https://networkedprivacy.files.wordpress.com/2011/04/kelley_cesca_bresee_cranor.pdf, accessed 24 November 2015.

L. Kendall, 2002. Hanging out in the virtual pub: Masculinities and relationships online. Berkeley: University of California Press.

M. Korn, 2012. “From hybrid spaces to experiencing augmented places,” paper presented at International Research Workshop about mobile communication as a cultural, spatial and social phenomenon; version at http://mkorn.binaervarianz.de/pub/korn-mtsh2012-slides.pdf, accessed 24 November 2015.

A. Korth, S. Baumann, and A. Nürnberger, 2011. “An interdisciplinary problem taxonomy for user privacy in social networking services,” CHI 2011 Workshop on Networked Privacy, at http://www.dfki.de/web/research/publications?pubid=5557, accessed 24 November 2015.

A.D.I. Kramer, 2012. “The spread of emotion via Facebook,” CHI ’12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 767–770.
doi: http://dx.doi.org/10.1145/2207676.2207787, accessed 24 November 2015.

B. Krishnamurthy and C.E. Wills, 2009. “On the leakage of personally identifable information via online social networks,” WOSN ’09: Proceedings of the Second ACM Workshop on Online Social Networks, pp. 7–12.
doi: http://dx.doi.org/10.1145/1592665.1592668, accessed 24 November 2015.

S. Kvale and S. Brinkmann, 2009. Interviews: Learning the craft of qualitative research interviewing. Second edition. Los Angeles: Sage.

B. Latour, 1987. Science in action: How to follow scientists and engineers through society. Cambridge, Mass.: Harvard University Press.

S. Lomborg and A. Bechmann, 2014. “Using APIs for data collection on social media,” Information Society, volume 30, number 4, pp. 256–265.
doi: http://dx.doi.org/10.1080/01972243.2014.915276, accessed 24 November 2015.

M. Magnani, D. Montesi, and L. Rossi, 2012. “Conversation retrieval from microblogging sites,” Information Retrieval, volume 15, numbers 3–4, pp 354–372.
doi: http://dx.doi.org/10.1007/s10791-012-9189-9, accessed 24 November 2015.

A.N. Markham, 1998. Life online: Researching real experience in virtual space. Walnut Creek, Calif.: Altamira Press.

F. Neuhaus and T. Webmoor, 2012. “Agile ethics for massified research and visualization,” Information, Communication & Society, volume 15, number 1, pp. 43–65.
doi: http://dx.doi.org/10.1080/1369118X.2011.616519, accessed 24 November 2015.

H. Nissenbaum, 2011. “A contextual approach to privacy online,” Daedalus, volume 140, number 4, pp. 32–48, and at http://www.amacad.org/publications/daedalus/11_fall_nissenbaum.pdf, accessed 24 November 2015.

P. Ohm, 2010. “Broken promises of privacy: Responding to the surprising failure of anonymization,” UCLA Law Review, volume 57, pp. 1,701–1,777, and at http://www.uclalawreview.org/pdf/57-6-3.pdf, accessed 24 November 2015.

B. Rieder, 2013. “Studying Facebook via data extraction: The Netvizz application,” WebSci ’13: Proceedings of the 5th Annual ACM Web Science Conference, pp. 346–355.
doi: http://dx.doi.org/10.1145/2464464.2464475, accessed 24 November 2015.

F. Vis, 2013. “A critical reflection on big data: Considering APIs, researchers and tools as data makers,” First Monday, volume 18, number 10, at http://firstmonday.org/article/view/4878/3755, accessed 24 November 2015.
doi: http://dx.doi.org/10.5210/fm.v18i10.4878, accessed 24 November 2015.

R.E. Wilson, S.D. Gosling, and L.T. Graham, 2012. “A review of Facebook research in the social sciences,” Perspectives on Psychological Science, volume 7, number 3, pp. 203–220.
doi: http://dx.doi.org/10.1177/1745691612442904, accessed 24 November 2015.

 


Editorial history

Received 24 April 2015; accepted 11 November 2015.


Creative Commons License
This paper is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Studying Facebook and Instagram data: The Digital Footprints software
by Anja Bechmann and Peter B. Vahlstrup.
First Monday, Volume 20, Number 12 - 7 December 2015
http://firstmonday.org/ojs/index.php/fm/article/view/5968/5166
doi: http://dx.doi.org/10.5210/fm.v20i12.5968





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.