Following the Trilogues, large language models (LLMs) are to be covered under the AI Act. Moreover, given the widespread use of LLMs, it seems particularly pertinent to ensure that LLMs are reliable and trustworthy, by developing standardized practices to evaluate their performance. In this contribution, I highlight a specific challenge that arises in the context of LLMs: the assessment of whether a system acquires so-called emergent abilities, abilities that cannot be explained by the scaling of the model. As emergent abilities are often poorly defined and difficult to assess, I argue that there is a need to develop conceptual criteria that clearly establish what ability–such as knowledge or understanding–is to be attributed and when such attribution is warranted. Moreover, I argue that the evaluation of such conceptual criteria requires the development of technical methods that can uncover the internal processing of LLMs. Although standardization of LLMs is still in early stages, this approach might help clarify what LLMs learn from the data and how their performance can best be assessed.
This poster presentation is part of a one-day workshop on standardization in NLP, bringing together researchers and NLP practitioners from academia and industry, as well as standardization experts.