Repetition Improves Language Model Embeddings

Springer, Jacob M.; Kotha, Suhas; Fried, Daniel; Neubig, Graham; Raghunathan, Aditi

Javascript is disabled or not supported in your browser. JavaScript must be enabled in order for you to use WIKINDX fully. Enable JavaScript through your browser options then try again, otherwise, try using a different browser.

WIKINDX

WIKINDX Resources

Springer, J. M., Kotha, S., Fried, D., Neubig, G., & Raghunathan, A. Repetition Improves Language Model Embeddings.

Resource type: Journal Article
BibTeX citation key: anon.156
View all bibliographic details

Categories: General
Creators: Fried, Kotha, Neubig, Raghunathan, Springer

Attachments

URLs https://www.semant ... ?email_index=1-0-5

Abstract

Echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings, and achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data. Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach,"echo embeddings,"in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9\% zero-shot and by around 0.7\% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.

WIKINDX 6.11.0 | Total resources: 209 | Username: -- | Bibliography: WIKINDX Master Bibliography | Style: American Psychological Association (APA)