Relevance feedback, the application of human feedback regarding the performance of information retrieval systems, has been a topic of study in the field of information retrieval for almost 50 years. Starting with textual document retrieval, the field has since expanded to (content-based) multimedia retrieval (such as audio, images, and video) as well. In the current decade, we are confronted with more and more multimedia content. The high rate at which open sensor networks are being set up and increasing rate at which sensor data is available today expose the need to perform semantic searches over these large heterogeneous datasets.
One of the challenges this brings is bridging the semantic gap, the discrepancy between the high level query posed by the user in natural language, and the low level sensor data features. One way of decreasing this gap is to include the user in the information retrieval system. To identify what the user is looking for, the system can iterate a number of times over its results, using the users feedback on which returned results are relevant for its query and which results are irrelevant. Adapting to the user's feedback, the system can refine the search query and retrieve more relevant results.
In the last years, relevance feedback in information retrieval is an increasingly active field of research. In this section we will give a brief overview of the status quo. One of the most well-known and applied relevance feedback algorithms that has its origins in text-retrieval is the Rocchio algorithm. The classic Rocchio algorithm of relevance feedback looks to find the most optimal query based on the user's initial query, and relevance feedback consisting of positive and negative documents. This method of relevance retrieval found its way from text-retrieval into many other domains, including video retrieval.
Relevance feedback can be performed automatically, without any user interaction, or manually, by allowing a user to mark results as relevant or not relevant in varying degrees. The first branch of relevance feedback, also called pseudo-relevance feedback, has piqued the interest of a large number of researchers and software engineers due to its non-reliance on users, though normal relevance feedback is also very relevant to this day. Whereas in normal relevance feedback the user would have to iterate over the retrieved results a number of times, this does not have to be the case in a (well performing) pseudo-relevance feedback system. This makes it favourable for implementation since often the user does not have the time or motivation to give extensive feedback. Human relevance feedback has been known to provide major improvements in precision for information retrieval system.
Figure 1. AVES UI.
In this research, when I was interning at TNO, we evaluated a novel method of multimedia retrieval that incorporates relevance feedback using concept detector weight calibration. The method proposed in this research is based on work by Rocchio. Rocchio has been proven to be a strong method of relevance feedback for text-retrieval and we believe the application for multimedia retrieval purposes is promising and worth investigating. We simulated user relevance feedback (optimal-, pseudo-, and random-relevance feedback) to evaluate the model's performance. We also evaluated the model's performance using human relevance feedback. Retrieval without relevance feedback is taken as a baseline, and the well-performing RS method is used for comparison. Giacinto and Roli introduced this method - that takes into account not only a document's relevance towards the other documents, which is usually considered, but also its irrelevance. The formula they utilize of a document's relevance takes the inverse of a documents distance to the nearest relevant document over the distance to the nearest irrelevant document (see Equation 1 below). This score, named relevance score (RS), is then used as ranking measure.
The dataset that was used contains the full video set from the [2014 TRECVID Multimedia Event Detection track][https://www.nist.gov/itl/iad/mig/med-2014-evaluation] train set. The events in this dataset are all complex activities involving interactions with people and objects. It consists of a number of human activities, including human actions that have temporal and semantic relationships to the overarching activity. The list of events includes, for example: Attempting a board trick; Feeding an animal; Birthday party; Repairing an appliance. The full set adds up to a total of 5594 videos (consisting of, on average, 75.45 keyframes).
The implementation for the video search engine has been dubbed AVES (Automatic Video Event Search). It is setup as a Python app using Flask (a lightweight web framework based on Werkzeug and Jinja2). To render the pages it uses Blueprints and a number of small (optimization/minification) tools for static asset preprocessing. This is set up as a Gulp pipeline. Nunjucks is used as templating engine and SocketIO provides a way for the app to communicate changes to and from the backend quickly over websockets. The Figure below provides a schematic overview. We will dive into each element of this overview next.
Figure 2. AVES backend.
The user enters a query, after which the query is ran through a Word2vec model trained on a model of Google News words. Word2vec allows us to create vector representations of words, and thus provides a semantic similarity measure of the query to the various concept detectors.
Figure 3. Excellent word2vec illustration by @swatimeena.
The top concept detectors that have a similarity score above a set threshold are then retrieved and used for the following steps.
We trained a total of 2048 concept detectors using deep convoluted neural networks and SVMs. For each video in our test set we extracted 1 keyframe per 2 seconds uniformly from the video. The concept detectors are then used to score each keyframe, resulting in a 2048-dimensional vector for each keyframe (normalized). Background data is also extracted by calculating detector scores on videos in which the concept is not present.
Videos are ranked according to their score (see Equation 2). It is inspired by Aly et al. (similarly using the concept prior and probability measure given confidence) and is derived from the Bayesian model.
In Aly's equation, is the expected score for a document given the vector of confidence scores , and is the occurrence probability of a concept given relevance. is the concept prior, and is the probability measure of a concept occurring in the entire set of documents given the confidence score . These parameters correspond to the detector scores, background scores, and Word2vec scores respectively in our equation. We denote to be the probability of a concept occuring in a video (evidence), and be the prior.
In our method listed in Equation 2, is the score of a given video, for a given detector. It is the maximum of it's keyframe scores (determined by the neural network). This score is then multiplied by the Word2vec similarity score for that detector, , and we substract the background score (calculated by running the neural network and determining detector scores on videos that do not contain the concepts) of that detector, . This is done for each detector, and the scores are summed. The resulting score, is then used to retrieve a ranked list of videos for which the concept detectors that are most relevant to the user's query give the highest score. Video's are returned in descending order based on their score .
The concept detectors that were used by the model are also shown to the user in the UI (limited to the best 5 out of a maximum of 30). The next Figure gives a more visual overview of this this process.
Figure 4. Overview of the scoring model.
Relevance feedback and reranking
After the videos are retrieved as an ordered list, the user is able to make a selection of which results are relevant for their query. These results are grouped into the relevant set . The rest of the retrieved results are grouped into the non-relevant set . Inspired by methods from Aly, PRFUBE, and Rocchio, we constructed a method of concept detector weight calibration to determine a video's new score: . Equation 3 describes this method of determining the updated video score.
The method uses the following equations. We calculate , the recalibrated concept detector weight, by taking the original Word2vec score, , adding the average score of the videos in the relevant set (minus the detector prior, ), and subtracting the average score of the videos in the nonrelevant set (again, minus the detector prior). These scores are adjusted with weights and , similar to Rocchio. and were calibrated on sample data ( = 1.0; = 0.5), and also match values commonly used in other information retrieval literature.
This is done for all used detectors, after which the calibrated detector weight, , is then plugged back into the initial ranking (Equation 2), where we substitute the original Word2vec score for the adjusted weight. This then results in the video's new score, , which is used to retrieve a ranked list of videos with a better precision than the initial ranking.
In addition, we determine the minimum of the maximum detector scores in , , and check for each result in , if there is a detector score that is higher than . This is shown in Equation 4 below.
If there is a detector that scores higher than the minimum of the detector scores in , we add the detector to our detector set for reranking. The reasoning is that if there is a detector that scores well in the set , it may be advantageous to use it in retrieval (although it was not picked up with the Word2vec similarity initially).
In order to evaluate the AVES system and answer the research questions as described in the previous sections, an experiment was performed using human participants to provide relevance feedback.
Ten participants with a higher education level (employees and students from TNO, and the University of Twente) were asked to volunteer in the experiment. Participants were excluded from participation if they were dyslectic, had concentration problems, RSI, or were colour blind.
Users used the system for period of up to 50 minutes. The user was asked to search for up to four example queries, serving as an introduction for the user to the system. A list with topics to be queried was then presented, and the user was asked to evaluate the results based on relevance. Users would first perform 16 queries with one method (AVES, or RS), and then switch to do the remaining 16 using the other method. Meanwhile, the user's action and system performance (mean average precision, lookup time, runtime, etc.) was recorded automatically. After the list has been completed, a semi-structured interview was scheduled to evaluate the system on factors such as usability and usefulness (using the well-known System Usability Scale).
Mean Average Precision (MAP) was calculated by taking the mean of the Average Precision (AP) per query. AP for a query is calculated after removing the first 20 results (because in comparing modern information retrieval sysems these are often equal and therefore an uninteresting metric), and the selected relevant results. To signify the difference with normal MAP, we use the following annotation for MAP after the first 20 results and any selected relevant results from the baseline result ranking: MAP∗.
We also evaluated non-human relevance feedback through simulations. Relevance selection is done using optimal- (select N known true-positive), pseudo- (select the first N results as positive), and random- (select N positive at random) selection.
Figure 5. Translation of SUS scores into quartiles, acceptability and adjectives.
There was a statistically significant difference between groups as determined by a one-way ANOVA (F(2,27)=18.972, p=.000). A post-hoc Tukey's HSD test was performed to verify intergroup differences, which showed means of all methods differing significantly (p<0.05).
Table 1. Mean and standard deviation for MAP∗ scores per method.
To get an overview of the precision per event, we calculated precisions averaging over all sessions. A bar plot is shown in Figure 6. We also investigated relevance selection per event. An overview of the True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) scores are shown in Figure 6. Note that this terminology does not capture the users' beliefs sufficiently, since the user's cannot be wrong in their judgement (if we assume they performed the task honestly). We can state that they are True, or False Positives only relative to the ground truth. A better terminology reflecting the users' honest evaluation might be: Correct positives, Missed negatives, Correct negatives and Missed positives.
Figure 6. Top: Average precision (AP∗) per event per method. Bottom: Percentage relevance selection per event per method.
A commonly used measure of robustness is the robustness index (RI). RI is a measure of the ratio of the difference of the number of events that benefit from one method compared to another, to the total number of events. A plot showing the difference in precision between AVES and RS per event can be found in the Figure below. We obtained an RI of 0.5625.
Figure 7. Average precision difference (AP∗) between AVES and RS per event.
SUS ratings were for RS, and for AVES, with no statistically significant difference. Both could be classified as excellent.
Precision was found to be higher using the AVES method of relevance feedback than when not using any relevance feedback, or when using a current (well-performing) method, RS. We must note here that when looking at full MAP (including first 20 results) and perfect relevance selection, AVES performs as well as RS.
Lookup time was not found to be lower using AVES than when using RS, but AVES did sport a 75x faster runtime than RS, depending on the distance measure chosen for RS.
Results were published at the International Conference on Multimedia Modeling (2017), where much more information and additional experiment results can be found. For example, investigations into the effect of number of concept detectors, simulated runs, the size of the relevant set, supplementary detectors, parameters and in relevance scoring, and using user experiment feedback in simulated runs.
A full publication is available at: https://link.springer.com/chapter/10.1007/978-3-319-51814-5_27