Towards retrieving social events from spatio-temporal photo annotation
First Monday

Towards retrieving social events from spatio-temporal photo annotation by Davi Oliveira Serrano de Andrade, Anderson Almeida Firmino, Claudio de Souza Baptista, and Hugo Feitosa de Figueiredo

An event can be defined as a happening that gathers people with some common goal over a period of time and in a certain place. This paper presents a new method to retrieve social events through annotations in spatio-temporal photo collections, known as STEve-PR (Spatio-Temporal EVEnt Photo Retrieval). The proposed technique uses a clustering algorithm to gather similar photos by considering the location, date and time of the photos. The STEve-PR clustering approach clusters photos belonging to the same event. STEve-PR uses spatial clusters created to propagate event annotation between photos in the same cluster and employs TF-IDF similarity between tags to find the spatial cluster with the highest similarity for photos without a geographical location. We evaluated our approach on a public database.


1. Introduction
2. Related work
3. STEve-PR proposal
4. Validation
5. Results
6. Conclusions and future work



1. Introduction

Recent data from Facebook [1] shows that more than 300 million photos are uploaded daily, and more than 100 million photos and videos are uploaded daily to Instagram [2]. Simply put, digital photos have become part of our everyday routine (Kofoed and Larsen, 2016; Venema and Lobinger, 2017).

To help users to organize their photographs, metadata can be used, such as location, date and tags. Photo annotation is the process in which a software or a person assign metadata to a digital picture, in the form of subtitles or keywords (de Andrade, et al., 2018).

In the last several years, automatic photo organization techniques have been proposed improving the photo retrieval procedure (Firmino, et al., 2019; Sansone, et al., 2017; Huang, et al., 2016). Some researchers in this field have noted that one of the ways used to optimize and organize photos is the event concept (Nguyen, et al., 2013; Ahmad, et al., 2018; Schinas, et al., 2018). An event is as a happening that gathers people with some common goal over a period of time and in a certain place (Figueirêdo, et al., 2012).

Related research follows different approaches when organizing photos by events. Some studies try to annotate the event in the photo (Dao, et al., 2013; Firmino, et al., 2019), and others seek to cluster photos from the same event (Datia, et al., 2017; Ahmad, et al., 2018), regardless of the annotation of events. These clustering approaches and annotations may include two types of events: personal events and social events (de Andrade, et al., 2018). Personal events are those relating to the personal life of an individual, such as “Dad’s birthday” event. On the other hand, social events are those that represent an event in society, for example, the “Rock in Rio” event.

Many current devices have digital processing coupled to capturing photos, generating metadata, such as face identification to determine whether people are smiling. This information can be used to facilitate the annotation of people and location, for example. Thus, the annotation techniques may be focused on the use of the information attached to the photos, apart from low-level photo metadata such as texture, colour and shape of objects that are gathered from image processing techniques (Schinas, et al., 2018).

In addition to the information generated by devices and existing annotation techniques, users are also generating semi-structured content through textual tags associated with their photos. Thus, in addition to the photo metadata, there is also information generated by the user who owns the photo.

To mitigate the event detection task from digital photo collections in the Internet, we propose a spatio-temporal technique to automatically annotate social events in photo collections called STEve-PR (Spatio-Temporal EVEnt Photo Retrieval). The target audience of STEve-PR is anyone who is interested in photograph annotation in order to better organize and improve retrieval of photos in collections.

STEve-PR uses a clustering algorithm to gather similar photos by considering the location, date and time of the photos. Moreover, STEve-PR uses clusters created and tags recorded in photos to accomplish automatic propagation of events annotation. The main contribution presented in this article is a new spatio-temporal event related photo retrieval technique that can cluster same event photos through photo location and time independent of external social network. Furthermore, we propose an automatic event annotation method by considering current events, user-defined events or unknown events. Our proposed technique outperforms state of the art techniques using the same photo dataset.

We chose to use an open photo dataset — the ReSEED collection (Reuter, et al., 2014) and existing algorithms to compare annotation results of our proposed method with the state of art. The ReSEED dataset contains a large and diverse array of Flickr images corresponding to a heterogeneous assortment of different social events and social event types. We used the ReSEED dataset because it is an open dataset facilitating other works to replicate our experiments.

The major contribution of this work is a combination of existing clustering algorithms to perform event detection with an outstanding performance when compared to the state of the art. Our work make use of spatiotemporal and tags information to perform event detection.

The remaining of this article is structured as follows. Section 2 presents a survey of related work on the development of event annotation in photo collections. Section 3 describes our proposed technique. Our experimental setup is described in Section 4. We present and analyze the results of our experiments in Section 5. We summarize our results, draw conclusions and identify points for future research in Section 6.



2. Related work

This section discusses related work on the event photo retrieval problem.

Petkos, et al. (2017) proposed a novel approach for clustering multimodal data. It took into account an existing example clustering and utilized it in order to created a model that predicted whether a pair of items belonged to the same cluster or not. Then, the approach computed the predictions of this model how pairs of items in the collection were clustered and organized them in a graph. This graph had a node for each item in the collection; an edge between two nodes indicated that the prediction of the model for the corresponding pair was positive.

Feng, et al. (2014) described a new approach for detecting the presence of events in social networks. Instead of looking for events in social networks, they sought out influential people who might be organizing events. Two greedy techniques and an improved technique were proposed to ensure that they found the best answer, if any. All proposed solutions used a graph representing the social network. Although this work did not include a detection of events, it had more than one solution for finding influential people. Thus, through these people, events could be found.

Grabovitch-Zuyev, et al. (2014) used the location and text published on a microblog to find correlations involving this information. They used two main parameters to correlate observed users. The first parameter considered the geographical distance by using a function that calculated the distance between two coordinate points (latitude and longitude), taking into account the curvature of the Earth and the sets of locations representing each user. The correlation of the first parameter was found by using the average difference between the points of the sets representing each user. The second parameter considered a textual similarity based on the repetition of keywords found in the TF-IDF equation (Jones, 1972). This work provided a correlation between text content and geographic area in which the user was present. Although this study confirmed the relation between text and place, it did not present an approach to detect social events.

Mansour, et al. (2017) addressed the idea of feature-centric social events. They examined several features of an event, like time and the name of the user that uploaded a photo, for instance, where the user selected at least one central feature. Based on the chosen feature (or set of features) this approach detected the corresponding feature-centric events.

Brenner and Izquierdo (2013) presented a framework based on the MediaEval 2013 social event detection challenge (SED). They pre-processed metadata to propagate location and used temporal and spatial clusters to set up a training set. The framework presented a F measure of 0.94, but the events considered were only in “music concert” or “sports game” domains. Thus, detection did not find specific events or user-defined events.

Manchon-Vizuete, et al. (2014) performed a clustering of events through selected properties of photographs such as date and time; geographic location; and tags. The first procedure was to perform temporal clustering. As a consequence, two distinct events performed on the same day might be considered erroneously as the same event according to this approach.

Samangooei, et al. (2013) proposed an approach based on affinity matrices. They built matrices for each analyzed aspect — time, space and textual information. Then, a unified matrix was generated for each pair of photographs. The limitation of this work was the cost of building matrices. This study addressed event detection in social media streams based on the ideas expressed in a previous study by Hare, et al. (2015).

Nguyen, et al. (2013) used photo timestamps to generate clusters. Other photo metadata (e.g., geographic location and tags) were added. One limitation of this approach was that clustering was generated using just time.

Wistuba and Schmidt-Thieme (2013) used a combination of all available information in photographs (temporal, tags and spatial) to perform a clustering of events. They used a factorization machine to combine the advantages of polynomial regression with factorization models.

Sutanto, et al. (2014) prioritized the textual information of photographs for clustering events. The similarity of a photograph with a cluster was primarily based on text. Then, this similarity was combined with spatial and temporal distances to verify whether a photo belonged to a cluster. A disadvantage of using this approach was that events in different places with the same textual information (such as tags) were taken as the same event.

Shamsolmoali, et al. (2019) tried a combination of a convolutional neural network (CNN) and a residual network to reduce the dimensionality in a multimedia classification task. The CNN architecture acted as a feature extractor, and a deep neural network was used for the final classification.

Hong, et al. (2016) proposed a methodology to annotate photographs taking into account social circles of users. They initially detected social events to generate reliable photo tags, instead of using only those tags manually entered by users. The data was a subset of the ReSEED dataset. The authors consider an “album” as the basic unit for clustering and detection of events. The assignment of a photo to a given album considered just the time dimension. If a photo A had been taken more than 60 minutes from from photo B, then the photos were placed in different albums. This classification generates a problem, as some events routinely last longer than an hour.

Zaharieva, et al. (2015a) proposed an unsupervised cascade approach to cluster events across multiple platforms, first considering the most reliable photo metadata (time and location) and then less reliable textual information. The results showed that using all available information was better than separating it to perform clustering events. Nonetheless, this work was not able to detect events that were not registered in the database.

Ahmad, et al. (2017) relied on multiple instance learning (MIL), a modified form of supervised learning. They considered each photo album as a single bag and selected multiple images from each album for classification purposes. In their approach, event classification was achieved with a reduced number of training samples per class.

In this paper we address event detection from photo collections. Nonetheless, it is important to mention that there are several research projects that deal with the same theme in videos. There are several surveys in the area, among which we can highlight studies by Scherp and Mezaris (2014), Jiang, et al. (2012), Ballan, et al. (2010) and Yan, et al. (2010).

Some research (Hong, et al., 2016; Manchon-Vizuete, et al., 2014; Nguyen, et al., 2013; Samangooei, et al., 2013; Sutanto and Nayak, 2013; Wistuba and Schmidt-Thieme, 2013; Zaharieva, et al., 2015a; Zaharieva, et al., 2015b) used the same database as used in this study — the ReSEED dataset (Reuter, et al., 2014). Thus, in Section 5 a comparison is made between these studies and our proposal, considering precision and F measure metrics.

We propose the STEve-PR technique with two modules: one for clustering photos from the same event and another for automatic social event annotation propagation. The clustering module uses information provided in photo metadata to gather the same event photos by considering the location and time associated with the photo. In turn, the automatic annotation technique enables photos with or without location to correlate to the associated social event. In addition, other interesting contributions of our proposed approach is a clustering module that is independent of external social networks. Moreover, our automatic annotation technique considers three types of events: current, user-defined and unknown.

Table 1 presents a comparative study by analyzing the following characteristics:

C1. Detection of specific events: when the detection points to an entity representing a specific event. For example, to detect “Rock in Rio” rather than “rock concert”;
C2. User-defined event detection: when the work supports the detection of events defined by system users who were not registered in the database. For example, to detect a event “Trip to Vegas” previously created by a random user;
C3. Detection of events not registered in the database: when the work cannot perform the detection of events registered in the database, but it is not necessary to identify the name of the event. For example, if any user uploads a new set of photos, the approach must be able to detect if that set of photos represents a new event not yet registered in the database;
C4. Independence of social networks: when the work does not depend on any social network. An event annotation technique can use social networks to improve its detection but its operation does not depend on the social network. For example, if a user does not have any social network account, the approach must be able to detect the events related to her photos;
C5. Event content clustering: When the work performs event content clustering. For example, to gather photos related to the “Rock in Rio” event.


Table 1: Related work comparison..
Petkos, et al., 2017NoNoYesYesYes
Feng, et al., 2014NoNoNoYesYes
Grabovitch-Zuyev, et al., 2014NoYesNoNoNo
Mansour, et al., 2017YesNoYesYesYes
Brenner and Izquierdo, 2013NoYesNoNoYes
Manchon-Vizuete, et al., 2014NoNoYesNoYes
Hare, et al., 2015NoYesNoYesYes
Nguyen, et al., 2013NoYesNoYesYes
Wistuba and Schmidt-Thieme, 2013NoYesNoYesYes
Sutanto, et al., 2014NoYesNoYesYes
Samangooei, et al., 2013NoYesNoYesYes
Hong, et al., 2016NoNoYesYesYes
Shamsolmoali, et al., 2019NoNoYesYesNo
Zaharieva, et al., 2015aYesYesNoYesYes
Ahmad, et al., 2017NoNoYesYesNo
STEve-PR (this study)YesYesYesYesYes


From Table 1, one can notice that most research has not accomplished C1, focusing on abstract events like “rock concert”. Considering C2, most earlier studies detected user-defined events; only six did not fulfil this feature. Most works did not detect events not registered in the database (C3). However, most earlier works had independence of social networks and cluster event related content. Only two studies — Zaharieva, et al., 2015a; Mansour, et al., 2017 — complied with four of five mentioned features. Yet, six studies (Brenner and Izquierdo, 2013; Feng, et al., 2014; Grabovitch-Zuyev, et al., 2014; Manchon-Vizuete, et al., 2014; Shamsolmoali, et al., 2019; Ahmad, et al., 2017) fulfilled less than three mentioned features.STEve-PR was the only one to address all five characteristics.



3. STEve-PR proposal

Figure 1 presents a BPMN ( flowchart of event annotation propagation. First, STEve-PR creates clusters and searches for photos with location that do not have an associated event. For each photo with location, the predominant event is annotated in the cluster that contains a photo. Then, it finds the photos without location that have at least one common tag with the spatial cluster to which they belong to. Finally, for each no-location photo found, the predominant event is annotated in the spatial cluster in which the photo has a higher TF-IDF similarity.


STEve-PR flow chart
Figure 1: STEve-PR flow chart.


In this work, event annotation occurs through existing annotation propagation. Thus, clustering is performed first, and then annotation propagation is performed between photos belonging to the same cluster. The focus of this work is on event annotation and not on photo clustering, so clustering is focused on minimizing the number of different events present in the same cluster.

The STEve-PR algorithm does not need a training set for photo clustering. The model used by STEve-PR considers spatial and temporal information to create clusters using the DBSCAN clustering algorithm (Ester, et al., 1996). Our approach creates temporal and spatial clusters separately. Spatial information is necessary for STEve-PR to perform a complete clustering. For example, if a photo does not have a location, it will only be associated with a temporal cluster created by the algorithm.

DBSCAN requires two parameters: epsilon (the minimum distance among members of the same cluster) and the minimum number of objects within each cluster. In our approach, the minimum number of photos per cluster is one — an isolated photo can be considered a cluster. The epsilon parameter temporal and spatial values are detailed in Section 4. We chose to use this algorithm due to the fact that it considers outliers, the number of clusters is an algorithm output and clusters are not necessarily circles.

Considering spatial and temporal dimensions, clustering can be performed in the following ways:

  1. Both dimensions;
  2. Performing the temporal dimension first and for each temporal cluster, a different spatial clustering is performed;
  3. Performing the spatial dimension first and for each spatial cluster, a different temporal clustering is performed.

Because photos always contain temporal information, we decided to perform temporal clustering first. Then, spatial clustering was performed separately for each temporal cluster. Thus, photos without a location were related to a temporal cluster only.

Temporal clustering makes a temporal segmentation set of photos P, using a time tmax (in minutes) as epsilon parameter and one as the minimum points parameter. Considering tmax, we separate P into k clusters. Thus, cluster gj is a unique subset of P, such that:


Equation 1


The clusters have photos with temporal distances smaller than or equal to tmax minutes. By considering consecutive photos, they may be defined as follows:


Equation 2



Equation 4


DBSCAN spatial clustering was performed in a similar way to temporal clustering, but the distance between photos was verified by the geodesic distance from geographical coordinates of the photos. The spatial clustering was performed after the temporal clustering, separating each temporal cluster gj into spatial clusters. If temporal clustering created 10 clusters, spatial clustering was performed 10 times, one for each temporal cluster. Photos in different temporal cluster were in different spatial clusters.

Let gj be a temporal cluster, spatial clustering will separate gj into m clusters. Thus, cluster ci is a unique subset of gj such that:


Equation 5


Temporal clustering makes a spatial segmentation set of photos, using a distance smax (in meters) as epsilon parameter and one as the minimum points parameter. The spatial clusters have photos with spatial distance smaller than or equal to smax meters. Let geodesicDist(p1, p2) be a function to calculate the geodesic distance of two photos considering their geographical coordinates, the spatial cluster may be defined as follows:


Equation 6


With all photo spatial clusters created, STEve-PR performed the event annotation propagation. The identification of the event was geared towards associating a particular photo to a specific event. For event annotation propagation, STEve-PR used a training set composed of event annotations. These annotations were used in conjunction with clusters created previously. The training set will identify the predominant event in clusters. The predominant event of a cluster was the one that has the greatest number of photos within the cluster in question.

Let SC be the set of spatial clusters created by the clustering algorithm, and ci is a cluster of SC. The predominant ci event will be the event eci (which has the largest number of photos in ci). Let p be a photo with location that has no associated event. The event to be recorded in p is defined by the PredEvent function described below:


Equation 7


The STEve-PR algorithm is intended for use with photos with geographical information. However, let p2 be a photo without a location, let TC be the set of temporal clusters previously created for spatial clusters, let gi be a temporal cluster belonging to TC and let cik be a spatial cluster belonging to SC and created from gi. If p2 is inserted into the time slot gi and has at least one common tag with the photos in gi, the STEve-PR algorithm finds the ccik spatial cluster associated with p2 and annotates the predominant event egik in ft. Algorithm 1 describes the STEve-PR event propagation.


STEve-PR event annotation
Algorithm 1: STEve-PR event annotation.


As presented in Algorithm 1, STEve-PR separated all photos previously annotated in line 1 and clustered these photos into temporal and spatial clusters in lines 3 to 5. After clustering, in line 6 the algorithm ran through all photos without event annotation. In line 7 the algorithm found the temporal cluster corresponding to the photo without location, considering its time and date. In line 8 all spatial clusters created from the temporal cluster identified in line 7. Lines 9 to 13 found the correct spatial cluster corresponding to photos without annotation, considering location or mutual tags. In line 14 the predominant event of the spatial cluster found was detected and propagated to a given photo in line 15.

To find the cik cluster, STEve-PR calculated the similarity of a photo with all cik clusters created from gi related to the photo. A similarity was calculated using a TF-IDF function based on textual tags (Jones, 1972). TF was calculated using the frequency of the tags in each cluster, and IDF is the number of clusters that contains tags. This is similar to the traditional TF-IDF, wherein the tag is a term and a document is a spatial cluster created by STEve-PR. Hence, the cluster chosen has the highest TF-IDF in relation to the photo.



4. Validation

This section sets out the method used to collect data, characteristics of the photo collection used to test the techniques, metrics adopted to analyze the results achieved and how the metrics were validated to generate results.

The validation of event annotation propagation was performed individually because the techniques found in the state of the art could not be replicated faithfully because of the lack of technical configuration information. We used precision, recall and F measure metrics to validate our proposed technique.

In this study, we used a public database to extract photos used in the tests of the proposed algorithms. This database was proposed by Reuter, et al. (2014). It was taken from Flickr and consists of photos that have annotation of their respective events. Because the clustering algorithm uses location to create clusters, only the photos of the events with at least one photo with a location were considered.

Thus, the resulting filtered database has 6,650 events and 131,551 photos. The photos have textual tags, date and time of capture and geographical location as latitude and longitude coordinates.

4.1. Clustering validation

The validation of photo clustering regarding the events was tested based on the number of photos belonging to different events within the same cluster. Because the purpose of clustering in this work was to propagate event annotation between cluster photos, the smaller the number of photos of different events, the better.

Let G be the set of clusters created by the STEve-PR algorithm and let gi be any cluster of G. The gi predominant event will be the event with the largest number of photos in gi. Let difPhotosQty(x) be a function that returns a number of different photos for the predominant event of a photo cluster x. The metrics used to evaluate the STEve-PR cluster settings were precision, recall and a new metric, named PGr. PGr reflects the number of photos with different events in all clusters. PGr is defined as:


Equation 8


Considering precision and recall, our approach to calculate them was based on Figueirêdo, et al. (2012). The database utilized has all events photo annotations. These annotations were used to test the event photo spatio-temporal clustering. In our evaluation, precision is the percentage of photos of a given cluster correctly classified. Recall is the percent of photos in an event classified in a single cluster. For STEve-PR, if there is more than one cluster representing the same event that is not a problem, since clusters generate correct event annotation. As recall focuses on gathering all events photos in a single cluster, the precision and PGr metrics were important in our work.

To calculate precision and recall, it was necessary to identify the predominant event of a cluster. In order to do that, we considered the event with the highest number of photos in a cluster. The precision and recall parameters were calculated according to Equations 9 and 10, respectively:


Equation 9



  • p(e,c): the number of photos correctly classified from event e in cluster c;
  • fp(e,c): the number of photos classified in cluster c that do not belong to event e;
  • fn(e,c): the number of photos from event e not classified into cluster c.

To find the best clustering configuration, the following values were used for temporal clustering: 30 minutes, 1 hour, 2 hours, 3 hours, 4 hours, 8 hours, 16 hours, 1 day, 3 days and 1 week. For spatial clustering, the following values were considered: 20 meters, 50 meters, 100 meters, 500 meters, 1 kilometer and 2 kilometers. The values for the temporal and spatial clustering were selected to cover the temporal and spatial settings considered in earlier studies.

Algorithm 2 presents the result collection steps. For each of the 30 experiments conducted, a random photo portion of 30 percent was collected to conduct clustering, and the values of PGr, precision and recall were checked. After the random portion has been chosen, clustering was tested for each combination of the time intervals and spatial distances.


Photo clustering validation
Algorithm 2: Photo clustering validation.


After results collection, we obtained the PGr, precision and recall values to identify the best setting to use in spatial and temporal clustering. These results were analysed with the appropriate statistical test, depending on the normality of the data, T-Test or Wilcoxon (Boslaugh and Watters, 2008), by considering a significance level α=5 percent. Thus, the tests guaranteed 95 percent confidence in results so we could draw conclusions for the population and for the sample in the experiment.

4.2. Event annotation propagation validation

The validation of automatic event annotation propagation used several metrics from literature as a basis. The metrics used were precision, recall and F measure, wherein each measure was based on three values:

  • T = Number of correct propagations;
  • F = Number of incorrect propagations;
  • N = Number of photos that did not receive an event annotation.

The metrics are calculated as follows:


Equation 11


All metrics have individual specificities, and they must be analyzed separately and together. The recall metric indicates the number of photos that received the location within the error threshold adopted, considering all of the photos in a set.

In some photos, the propagation algorithms could not find the location. The precision metric does not consider those photos that did not receive a location and shows how correct the spread of a particular algorithm is. The F measure is used to analyze the recall and precision in a single metric (harmonic average).

The event annotation propagation algorithm needed a training set consisting of event annotations in photos. Therefore, the selection of photos that are part of the training set was crucial so that the representativeness of the data was maintained. To ensure that the training set was impartial, the choice of photos that included this training set was performed randomly.

The experiments were conducted through cross-validation, considering different numbers of folds to verify the algorithm’s consistency. Each configuration for the number of folds has 30 replicas. Algorithm 3 shows the dynamics of how the test experiments were performed.


Event propagation validation
Algorithm 3: Event propagation validation.


First, the algorithm created photo clusters. Then, for each replica, the training set was randomly selected, and the annotation propagation was performed for the other photos. Because the original database has the entire photo event annotations, after all of the suggestions by STEve-PR have been saved, the suggested event was compared to the original event.

A propagation was considered correct when the suggested event was equal to the current event. For each replica, the training set was defined by the getTestPhotos() method, which returned the identifiers of the photos that were not part of the training set.

With the test photo identifiers, the event annotations were removed temporarily. After that, the STEve-PR algorithm analysed all of the photos that did not have events and created events propagation. After saving all of the results, the original annotations were restored to the next replica.



5. Results

This section presents the results for the STEve-PR technique. First, we present results for the clustering algorithm followed by the event annotation propagation results related to observed metrics, and then we discuss both results. The graphics include results obtained by the statistical tests. The way the photos were clustered by STEve-PR influenced the event propagation. Therefore, finding the appropriate setting for photo clustering of the same event was crucial for the annotation spreads occurred properly.

In this work, we used the ReSEED dataset (Reuter, et al., 2014), which contains 437,370 Flickr images assigned to 21,169 events in total. The events are heterogeneous regarding type and length, e.g., it includes festivals which lasted for many days as well as protest marches only a few hours of duration.

Although clustering techniques are not recent, researches with novel techniques (such as neural networks, for example) show that those remain with excellent results. Ahmad, et al. (2018) used a CNN to perform event detection and obtained an accuracy of about 98.8 percent. Shamsolmoali, et al. (2019) also used a CNN to perform multimedia classification and obtained a precision of 93.9 percent.

Mansour, et al. (2017), as our proposal, used a clustering technique to detect events and obtained an F measure of 97 percent, demonstrating that clustering techniques are still promising in this area. Figure 2 shows the clustering configuration results of STEve-PR, considering PGr. The vertical axis indicates the percentage of replicas in that this configuration had the best result. Thus, the horizontal axis represents the clustering configurations.


Best configurations for clustering
Figure 2: Best configurations for clustering.


Figures 3 and 4 presents the clustering configuration results of STEve-PR. The vertical axis indicates the values of precision and recall metrics, respectively. Thus, the horizontal axis represents clustering configurations. The settings that are not listed in the chart did not achieve the best results.


Clustering configurations precision
Figure 3: Clustering configurations precision.


The settings that are not listed in Figures 3 and 4 did not achieve the best results in any replica. It is clear that the setting with 30-minute temporal variations and 20-meter spatial variations was more suitable for clustering the same event photos by minimizing the number of different event photos in the same cluster (highest precision and PGr).


Clustering configurations recall
Figure 4: Clustering configurations recall.


It is important to mention that the 20 meters and 30 minutes clustering parameters do not mean that an event lasting say two hours will be categorized as four different events. The 30 minutes parameter means that photos belonging to the same cluster will have a temporal distance less or equal to 30 minutes. The same reasoning applies to the 20 meters parameter.

With the test results for the clustering algorithm settings, testing the event annotation propagation algorithm could be executed. Thus, Figures 5, 6, 7, 8 and 9 show the results of STEve-PR for the clustering configuration of 30-minute temporal variations and 20-meter spatial variations.


Precision for annotation propagation with geotagged photos
Figure 5: Precision for annotation propagation with geotagged photos.


In Figure 5, the results regarding STEve-PR precision in event annotation propagation that considers photos with a geographical location are presented. These results are related to the training set used in event annotation propagation. Because the variation in the precision was low (it was less than one percent by decreasing the training set to 10 percent), it can be concluded that precision was stable and quite high. The line in the graph indicates a polynomial regression for the results.


Recall for annotation propagation with geotagged photos
Figure 6: Recall for annotation propagation with geotagged photos.


Figure 6 presents the results regarding STEve-PR recall in event annotation propagation, considering photos with a geographical location. Similar to the precision results, they are related to the training set. It is natural that with the decrease in the training set, recall decreases, but it presents a considerable decrease for only for a 10 percent training set.

Despite the considerable decrease with 10 percent of the training set, the achieved value was still high. Thus, STEve-PR becomes applicable to photo collections with few annotations. The line in the graph also represents a polynomial regression of the results.


F measure for annotation propagation with geotagged photos
Figure 7: F measure for annotation propagation with geotagged photos.


Figure 7 shows the results of the precision and recall combination presented in Figures 5 and 6, respectively. This combination (F measure) aims to analyse the two metrics together. Similar to the precision and recall results, the results of F measure are related to the training set.

Because of the high precision and recall levels, the F measure also showed high results. Even with a small 10 percent training set, the results were still close to one. As with other graphs, the polynomial regression line indicates the results.

As mentioned earlier, some studies (Hong, et al., 2016; Manchon-Vizuete, et al., 2014; Nguyen, et al., 2013; Samangooei, et al., 2013; Sutanto and Nayak, 2013; Wistuba and Schmidt-Thieme, 2013; Zaharieva, et al., 2015a, Mansour, et al., 2017, Shamsolmoali, et al., 2019) proposed solutions for event annotation in photo collections using the same database as used in this work. Thus, these works that shared a common database had their results analyzed and compared to ours. To compare the approaches we used the same training and test setup proposed in the literature: 70 percent for the training set and 30 percent for the test set.


Precision comparison
Figure 8: Precision comparison.


Figure 8 presents a comparison of the precision metric for related works and our approach. It is evident that the precision proposed in this work exceeds the precision of earlier studies.

We could not proceed with the analysis of the recall metric as few studies reported values for that metric. Figure 9 presents the values of the F measure metric of related studies and our approach. Again, it is evident that concerning the F measure metric our approach outperforms earlier work.


F measure comparison
Figure 9: F measure comparison.


For the metrics observed, STEve-PR obtained good results in event annotation propagation that considers photos with geographical locations. Considering the photos without geographical location, STEve-PR performs propagation through similar tags.

Concerning photos without location, we considered all previous propagations, so the propagation through tag similarity with spatial clusters was independent of the training set. Consequently, the propagation would only display a different outcome if the spatial cluster predominant event was different from the photo without a location event. The results were represented by the replicas average because it did not present a significant variation.

The metrics were analysed by considering the following sets:

  1. Total: all photos without location related events existing in the database;
  2. Has_any_tag: subset considering the photos that have at least a common tag with some photo with location;
  3. Has_mutual_tag: subset considering the photos that have at least a common tag with at least one spatial cluster created from the temporal cluster to which the photo belongs.

As precision takes into account only the automatic annotations made and not the entire set, the average value was the same for the three test scenarios analysed. The mean precision was 88 percent. The recall, as shown in Figure 10, showed very different results with the change in scenario analysis.


Recall for annotation propagation with no location photos
Figure 10: Recall for annotation propagation with no location photos.


Considering the scenario with the total test set, recall obtained 44 percent. For the Has_any_tag, the result increased slightly to approximately 49 percent. Considering the Has_mutual_tag, the results considerably increased, rising to nearly 90 percent. Thus, it was clear that the event propagation by tags only occurred correctly when the photo has at least one common tag with the possible spatial cluster.

The analysis of F measure (see Figure 11) yielded optimal results in relation to the Has_mutual_tag set but illustrated no significant improvements in other sets. Thus, STEve-PR can be used to propagate event annotations in photos without a location. Annotations will have a good chance of being correct (88 percent precision). Recall for photos without a location will not be as high because it will depend on the existence of some common tag with spatial clusters from its temporal cluster.

Nonetheless, we detected some patterns in wrong annotations. Most incorrect annotations were related to the clustering technique. If an event had only one related photo, STEve-PR may not annotate the event in new photos related to that event. If the clustering splits an event into several clusters, this event may not be the predominant event. If that happens the event may not be annotated in a new photo.



6. Conclusions and future work

In this paper, we presented an approach to improve event related photo retrieval. The proposed STEve-PR technique consists of a combination of existing clustering algorithms to achieve event annotation propagation. Clustering was tested for a number of photos that were not related to a predominant event in a cluster, and event annotation propagation was based on precision, recall and F-measure metrics.

The analysis of the clustering algorithm suggests that photo clustering should be performed by considering a temporal variation of 30 minutes and a spatial variation of 20 meters, because this configuration presented the best precision and accuracy. With the best precision and accuracy, we aim to minimize the number of photos that were not related to the predominant event in a cluster.

The event annotation propagation analysis suggests that event annotation propagation should use existing clustering techniques, instead of new algorithms. With STEve-PR, event related photo retrieval is completed by gathering all photos annotated with a desired event.

With these results, the importance of the geographical location information in photo management systems becomes evident. For photos without a location, event annotation propagation was precise. In addition, it was clear that events were related to the questions “Where?” and “When?”. Furthermore, the results demonstrated that existing clustering algorithms, when used correctly, can outperform state of the art proposed techniques.

Future work may propose a technique to identify different clusters that belong to the same event. As our work does not find the clustering parameters automatically, future work can also focus on a methodology to find the clustering parameters using machine learning algorithms. Also, we suggest the use of other features to perform event detection, e.g., people and recurring objects in photographs.

We can also enhance the similarity function between spatial clusters and photos without a location through tags preprocessing in future work. The removal of stop words, stemming and lemmatization are natural language processing examples that can be used to improve the results of the similarity function between spatial clusters and photos without a location. Furthermore, we intend to investigate ethical implications and limits on social event photo annotation. End of article


About the authors

Davi Oliveira Serrano de Andrade is a M.Sc. student in computer science at the University of Campina Grande, Brazil.
E-mail: davi [dot] o [dot] serrano [at] gmail [dot] com

Anderson Almeida Firmino is a Ph.D. candidate in computer science at the University of Campina Grande, Brazil.
E-mail: andersonalmeida [at] copin [dot] ufcg [dot] edu [dot] br

Cláudio de Souza Baptista is Full Professor and Coordinator of the Information Systems Laboratory at the University of Campina Grande, Brazil.
E-mail: baptista [at] dsc [dot] ufcg [dot] edu [dot] br

Hugo Feitosa de Figueirêdo is Associate Professor at the Federal Institute of Paraba, Brazil.
E-mail: hugo [dot] figueiredo [at] ifpb [dot] edu [dot] br



This research was partially supported by the National Council for Scientific and Technological Development (Conselho Nacional de Desenvolvimento Científico e Tecnológico or CNPq).







K. Ahmad, N. Conci, G. Boato and F.G.B. de Natale, 2017. “Event recognition in personal photo collections via multiple instance learning-based classification of multiple images,” Journal of Electronic Imaging, volume 26, number 6, 060502 (5 December).
doi:, accessed 25 November 2019.

K. Ahmad, M.L. Mekhalfi, N. Conci, F. Melgani and F. de Natale, 2018. “Ensemble of deep models for event recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications, volume 14, number 2, article number 51.
doi:, accessed 25 November 2019.

L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari and G. Serra, 2010. “Event detection and recognition for semantic annotation of video,” Multimedia Tools and Applications, volume 51, number 1, pp. 279–302.
doi:, accessed 25 November 2019.

S. Boslaugh and P.A. Watters, 2008. Statistics in a nutshell. Farnham: O’Reilly.

M. Brenner and E. Izquierdo, 2013. “Social event detection, retrieval and classification in collaborative photo collections,” MediaEval 2013 Workshop, at, accessed 25 November 2019.

N. Datia, J. Moura Pires and N. Correia, 2017. “Time and space for segmenting personal photo sets,” Multimedia Tools and Applications, volume 76, number 5, pp. 7,141–7,173.
doi:, accessed 25 November 2019.

M.-S. Dao, G. Boato, F.G.B. de Natale and T.-V. Nguyen, 2013. “Jointly exploiting visual and non-visual information for event-related social media retrieval,” ICMR ’13: Proceedings of the Third ACM Conference on International Conference on Multimedia Retrieval, pp. 159–166.
doi:, accessed 25 November 2019.

D.O.S. de Andrade, L.F. Maia, H.F. de Figueirêdo, W. Viana, F. Trinta and C. de Souza Baptista, 2018. “Photo annotation: A survey,” Multimedia Tools and Applications, volume 77, number 1, pp. 423–457.
doi:, accessed 25 November 2019.

M. Ester, H.-P. Kriegel, J. Sander and X. Xu, 1996. “A density-based algorithm for discovering clusters in large spatial databases with noise,” KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231, and at, accessed 25 November 2019.

K. Feng, G. Cong, S.S. Bhowmick and S. Ma, 2014. “In search of influential event organizers in online social networks,” SIGMOD ’14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 63–74.
doi:, accessed 25 November 2019.

A.A. Firmino, C. de Souza Baptista, H.F. de Figueirêdo, E.T. Pereira and B. de Sousa Pereira Amorim, 2019. “Automatic and semi-automatic annotation of people in photography using shared events,” Multimedia Tools and Applications, volume 78, number 10, pp 13,841–13,875.
doi:, accessed 25 November 2019.

H.F. de Figueirêdo, Y.A. Lacerda, A.C. de Paiva, M.A. Casanova and C. de Souza Baptista, 2012. “PhotoGeo: A photo digital library with spatial-temporal support and self-annotation,” Multimedia Tools and Applications, volume 59, number 1, pp. 279–305.
doi:, accessed 25 November 2019.

I. Grabovitch-Zuyev, Y. Kanza, E. Kravi and B. Pat, 2014. “On the correlation between textual content and geospatial locations in microblogs,” GeoRich’14: Proceedings of Workshop on Managing and Mining Enriched Geo-Spatial Data, article number 3.
doi:, accessed 25 November 2019.

J. Hare, S. Samangooei, M. Niranjan and N. Gibbins, 2015. “Detection of social events in streams of social multimedia,” International Journal of Multimedia Information Retrieval, volume 4, number 4, pp 289–302.
doi:, accessed 25 November 2019.

Y. Hong, T. Chen; K, Zhang and L. Sun, 2016. “Personalized annotation for mobile photos based on user’s social circle,” In: I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng and S. Vrochidis (editors). MultiMedia Modeling: 25th International Conference, MMM 2019. Lecture Notes in Computer Science, volume 9516. Cham, Switzerland: Springer, pp. 76–87.
doi:, accessed 25 November 2019.

S.-C. Huang, M.-K. Jiau and Y-H. Jian, 2016. “Optimisation of automatic face annotation system used within a collaborative framework for online social networks,” IET Computer Vision, volume 10, number 5, pp. 349–358.
doi:, accessed 25 November 2019.

Y.-G. Jiang, S. Bhattacharya, S.-F. Chang and M. Shah, 2012. “High-level event recognition in unconstrained videos,” International Journal of Multimedia Information Retrieval, volume 2, number 2, pp. 73–101.
doi:, accessed 25 November 2019.

K.S. Jones, 1972. “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, volume 28, number 1, pp. 11–21.
doi:, accessed 25 November 2019.

J. Kofoed and M.C. Larsen, 2016. “A snap of intimacy: Photo-sharing practices among young people on social media,” First Monday, volume 21, number 11, at, accessed 25 November 2019.
doi:, accessed 25 November 2019.

D. Manchon-Vizuete, I. Gris-Sarabia and X. Giró-i-Nieto, 2014. “Photo clustering of social events by extending PhotoTOC to a rich context,” ICMR Workshop on Social Events in Web Multimedia, at, accessed 25 November 2019.

E. Mansour, G. Tekli, P. Arnould, R. Chbeir and Y. Cardinale, 2017. “F-SED: Feature-centric social event detection,” In: D. Benslimane, E. Damiani, W.I. Grosky, A. Hameurlain, A. Sheth and R.R. Wagner (editors). Database and expert systems applications. Lecture Notes in Computer Science, volume 10439. Cham, Switzerland: Springer, pp. 409–426.
doi:, accessed 25 November 2019.

T.-V.T. Nguyen, M.-S. Dao, R. Mattivi, E. Sansone, F.G.B. De Natale and G. Boato, 2013. “Event clustering and classification from social media: Watershed-based and kernel methods,” MediaEval 2013 Workshop, at, accessed 25 November 2019.

G. Petkos, M. Schinas, S. Papadopoulos and Y. Kompatsiaris, 2017. “Graph-based multimodal clustering for social multimedia,” Multimedia Tools and Applications, volume 76, number 6, pp. 7,897–7,919.
doi:, accessed 25 November 2019.

T. Reuter, S. Ppapadopoulos, V. Mezaris and P. Cimiano, 2014. “RESEED: Social event detection dataset,” MMSys ’14: Proceedings of the Fifth ACM Multimedia Systems Conference, pp. 35–40.
doi:, accessed 25 November 2019.

S. Samangooei, J. Hare, D. Dupplaw, M. Niranjan, N. Gibbins and P. Lewis, 2013. “Social event detection via sparse multi-modal feature selection and incremental density based clustering,” MediaEval 2013 Workshop, at, accessed 25 November 2019.

E. Sansone, K. Apostolidis, N. Conci, G. Boato, V. Mezaris and F.G.B. De Natale, 2017. “Automatic synchronization of multi-user photo galleries,” IEEE Transactions on Multimedia, volume 19, number 6, pp. 1,285–1,298.
doi:, accessed 25 November 2019.

A. Scherp and V. Mezaris, 2014. “Survey on modeling and indexing events in multimedia,” Multimedia Tools and Applications, volume 70, number 1, pp. 7–23.
doi:, accessed 25 November 2019.

M. Schinas, S. Papadopoulos, Y. Kompatsiaris and P. Mitkas, 2018. “Event detection and retrieval on social media,” arXiv (10 July), at, accessed 25 November 2019.

P. Shamsolmoali, D. Kumar Jain, M. Zareapoor, J. Yang and M. Afshar Alam, 2019. “High-dimensional multimedia classification using deep CNN and extended residual units,” Multimedia Tools and Applications, volume 78, number 17, pp. 23,867–23,882.
doi:, accessed 25 November 2019.

J. Sutanto, E. Palme, C.-H. Tan and C.W. Phang, 2014. “Addressing the personalizationprivacy paradox: An empirical asssessment from a field experiment on smartphone users,” MIS Quarterly, volume 37, number 4, pp. 1,141–1,164.
doi:, accessed 25 November 2019.

T. Sutanto and R. Nayak, 2013. “ADMRG @ MediaEval 2013 social event detection,” MediaEval 2013 Workshop, at, accessed 25 November 2019.

R. Venema and K. Lobinger, 2017. “‘And somehow it ends up on the Internet.’ Agency, trust and risks in photo-sharing among friends and romantic partners,” First Monday, volume 22, number 7, at, accessed 25 November 2019.
doi:, accessed 25 November 2019.

M. Wistuba and L. Schmidt-Thieme, 2013. “Supervised clustering of social media streams,” MediaEval 2013 Workshop, at, accessed 25 November 2019.

W. Yan, D.F. Kieran, S. Rafatirad and R. Jain, 2010. “A comprehensive study of visual event computing,” Multimedia Tools and Applications, volume 55, number 3, pp. 443–481.
doi:, accessed 25 November 2019.

M. Zaharieva, M. Zeppelzauer, M. Del Fabro and D. Schopfhauser, 2015a. “Social event mining in large photo collections,” ICMR ’15: Proceedings of the Fifth ACM on International Conference on Multimedia Retrieval, pp. 11–18.
doi:, accessed 25 November 2019.

M. Zaharieva, M. Del Fabro and M. Zeppelzauer, 2015b. “Cross-platform social event detection,” IEEE MultiMedia, volume 22, number 3, pp. 14–25.
doi:, accessed 25 November 2019.


Editorial history

Received 27 February 2019; revised 6 September 2019; accepted 25 November 2019.

Creative Commons License
This paper is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Towards retrieving social events from spatio-temporal photo annotation
by Davi Oliveira Serrano de Andrade, Anderson Almeida Firmino, Cláudio de Souza Baptista, and Hugo Feitosa de Figueirêdo.
First Monday, Volume 24, Number 12 - 2 December 2019

A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2020. ISSN 1396-0466.