Language Models Are Universal Embedders

Zhang, Xin; Li, Zehan; Zhang, Yanzhao; Long, Dingkun; Xie, Pengjun; Zhang, Meishan; Zhang, Min

Javascript is disabled or not supported in your browser. JavaScript must be enabled in order for you to use WIKINDX fully. Enable JavaScript through your browser options then try again, otherwise, try using a different browser.

WIKINDX

WIKINDX Resources

Zhang, X., Li, Z., Zhang, Y., Long, D., Xie, P., & Zhang, M., et al. Language Models are Universal Embedders.

Resource type: Journal Article
BibTeX citation key: anon.192
View all bibliographic details

Categories: General
Keywords: Fontos!, RAG
Creators: Li, Long, Xie, Zhang, Zhang, Zhang, Zhang

Attachments

URLs https://www.semant ... 24b5a8dd98aca5784c

Abstract

This work demonstrates that multiple languages (both natural and programming) pre-trained transformer decoders can embed universally when finetuned on limited English data, and provides evidence of a promising path towards building powerful unified embedders that can be applied across tasks and languages. In the large language model (LLM) revolution, embedding is a key component of various systems. For example, it is used to retrieve knowledge or memories for LLMs, to build content moderation filters, etc. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is desirable to build a unified embedding model rather than dedicated ones for each scenario. In this work, we make an initial step towards this goal, demonstrating that multiple languages (both natural and programming) pre-trained transformer decoders can embed universally when finetuned on limited English data. We provide a comprehensive practice with thorough evaluations. On English MTEB, our models achieve competitive performance on different embedding tasks by minimal training data. On other benchmarks, such as multilingual classification and code search, our models (without any supervision) perform comparably to, or even surpass heavily supervised baselines and/or APIs. These results provide evidence of a promising path towards building powerful unified embedders that can be applied across tasks and languages.

WIKINDX 6.11.0 | Total resources: 209 | Username: -- | Bibliography: WIKINDX Master Bibliography | Style: American Psychological Association (APA)