WIKINDX

WIKINDX Resources

An, N. M., Waheed, S., & Thorne, J. Capturing the Relationship Between Sentence Triplets for LLM and Human-Generated Texts to Enhance Sentence Embeddings. 
Resource type: Journal Article
BibTeX citation key: anonh
View all bibliographic details
Categories: General
Creators: An, Thorne, Waheed
Attachments   URLs   https://www.semant ... ?email_index=0-0-0
Abstract
This work evaluates the quality of the LLM-generated texts from four perspectives and finds that there exists an inherent difference between human and LLM-generated datasets and proposes a novel loss function that incorporates Positive-Negative sample Augmentation (PNA) within the contrastive learning objective. Deriving meaningful sentence embeddings is crucial in capturing the semantic relationship between texts. Recent advances in building sentence embedding models have centered on replacing traditional human-generated text datasets with those generated by LLMs. However, the properties of these widely used LLM-generated texts remain largely unexplored. Here, we evaluate the quality of the LLM-generated texts from four perspectives (Positive Text Repetition, Length Difference Penalty, Positive Score Compactness, and Negative Text Implausibility) and find that there exists an inherent difference between human and LLM-generated datasets. To further enhance sentence embeddings using both human and LLM-generated datasets, we propose a novel loss function that incorporates Positive-Negative sample Augmentation (PNA) within the contrastive learning objective. Our results demonstrate that PNA effectively mitigates the sentence anisotropy problem in Wikipedia corpus (-7\% compared to CLHAIF) and simultaneously improves the Spearman’s correlation in standard Semantic Textual Similarity (STS) tasks (+1.47\% compared to CLHAIF).
  
WIKINDX 6.11.0 | Total resources: 209 | Username: -- | Bibliography: WIKINDX Master Bibliography | Style: American Psychological Association (APA)