Anti-Aging for AI: the challenge of building time-resistant models for CLEF 2024

Data changes all the time, so how can information retrieval systems like search engines, trained on older data sets, keep up without losing effectiveness. This is what the researchers in the Kodicare project looked into at the CLEF 2024 LongEval shared task.

The Kodicare-project members from studio Data-Science co-organized a research challenge: the “CLEF 2024 LongEval shared task”. The goal was to test how well search engines (so called information retrieval (IR-)systems) and text classifiers (programs that categorize text) can keep their accuracy over time, even as language and information change. Now, the researchers hosted a workshop at the CLEF Conference in Grenoble France. Kodicare is a bilateral project, funded by French ANR and Austrian FWF (more on the project further down).

The CLEF 2024 LongEval Retrieval challenge aims to propose an information retrieval system which can handle changes over time. Recent studies have found that the performance of models, like web search engines, drops if they’re tested on new data that’s very different from the data they were trained on. So, if you train a model on data from 2020 and test it on 2023 data, it might not perform as well. This is where LongEval Retrieval, the task in CLEF 2024, is different from typical search and classification challenges because it focuses specifically on how well IR models hold up over time.

Building resilient systems for changing data

Participants were asked to design IR models that can adapt to these changes or at least reduce how much performance declines as time goes on. In real-world applications like search engines or content recommendations, data changes all the time. Making models that can adapt to these changes could help build more reliable, long-lasting systems. This task is a step toward developing systems that aren’t just effective now but can keep up with changes in information over the years.

The Data Science-Kodicare team with Alaa El-Ebshihy, Tobias Fink and David Iommi processed datasets (provided from Qwant search engine), composed of around two million web documents for training and over four million for testing. Prior to the workshop, El-Ebshihy, RSA FG Data Science-researcher and TU Wien-doctoral candidate, also gave a short presentation at the main conference. With this she provided an overview of the lab, the participants and the results.

The day after the presentation, the workshop took place and the participants of the shared tasks presented their result and they discussed the future of the LongEval lab, eg releasing the manual assessments and comparing them to the models from the original released datasets for the lab. The workshop was moderated by Dr. Florina Piroi, senior scientist in the Data Science studio and TU Wien. El-Ebshihy and the Kodicare French partners were co-organizing it.

CLEF 2024 consisted of an independent peer-reviewed conference on a broad range of issues in the fields of multilingual and multimodal information access evaluation, and a set of labs and workshops designed to test different aspects of monolingual and cross-lingual Information retrieval systems.

In course of the conference, two other papers with El-Ebshihy’s participation were presented: “AMATU@Simpletext2024: Are LLMs Any Good for Scientific Leaderboard Extraction?” was a result of the submission of the SOTA shared task, which was a part of the simple text lab . The objective was to extract all tuples (Task, Dataset, Metric, Score) from scientific papers which report leaderboard data . The team did several submissions to the task using a neural network-based baseline as well as using LLMs. They also worked on a manual analysis that showed that it is challenging to extract those TDMS tuples from scientific text using LLMs.

The doctoral candidate also presented the paper “Improving Laypeople Familiarity with Medical Terms by Informal Medical Entity Linking” in the main conference. It is about proposing an end-to-end Medical Entity Linking model to help laypeople better understand medical terminology by linking popularized medical phrases in social media posts to their specialized counterparts and to relevant Wikipedia articles. Medical experts assessed the the accuracy and the relevance of the entity linking model. The study shows that the model can be a valuable tool to support medical terms understanding for laypeople using social media as educational potential.

More on the Kodicare-project:

When evaluating a search engine, you need to test it under specific conditions—this includes picking certain ways of measuring performance, datasets and evaluation metrics. But the choice of these testing conditions is often made without a transparent reason. Also, no one usually measures what happens if you change these conditions. This is where the Kodicare-project comes in. They use the term “knowledge delta” to mean the difference between various conditions. For example, a knowledge delta could be the difference between two datasets or two sets of search queries.

Similarly, they look at “results delta”, which means the difference in results when you change conditions in the testing environment. This helps show how much the search results change based on different conditions. By understanding the impact of different conditions (knowledge delta) and how they influence search results (results delta), the project aims to create a stable way to evaluate and improve search engines on an ongoing basis. This would help in explaining why search results change over time or why some queries work better than others.

Creating such a framework is tough because there are so many moving parts—different datasets, metrics, and so on. Right now, no complete system exists for doing continuous evaluation of search engines, especially with real-world data. Ultimately, the project aims to build a way to evaluate search engines continuously, in a stable and meaningful way, and to be able to explain why the results are the way they are. This would make search engines more understandable, reproducible, and open to continuous improvement. Kodicare works with the French search engine Qwant, which also delivered the data for the shared task.