Fine Tuning Multi Downstream Tasks Based on BERT with Gradient Surgery

Cs224n, Stanford; Project, Default; Chen, Jiwen; Hanley, Hans

Javascript is disabled or not supported in your browser. JavaScript must be enabled in order for you to use WIKINDX fully. Enable JavaScript through your browser options then try again, otherwise, try using a different browser.

WIKINDX

WIKINDX Resources

Cs224n, S., Project, D., Chen, J., & Hanley, H. Fine Tuning Multi Downstream Tasks based on BERT with Gradient Surgery.

Resource type: Journal Article
BibTeX citation key: anon.39
View all bibliographic details

Categories: General
Creators: Chen, Cs224n, Hanley, Project

Attachments

URLs https://www.semant ... a6ebf5424de5534015

Abstract

This project has tested different fine-tuning techniques for improving the model performance on the aforementioned three downstream tasks, including sentiment analysis, paraphrase detection and semantic textual similarity analysis. Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based model that generates contextual word representations. Such representation could be utilized to multiple downstream tasks, including sentiment analysis, paraphrase detection and semantic textual similarity analysis. In order to create more robust and semantically-rich sentence embeddings, it is important to fine-tune the BERT weights with task-relative data efficiently. In the project, we have tested different fine-tuning techniques for improving the model performance on the aforementioned three downstream tasks. These tests include changes in the architecture (dense layer size, Siamese network), in the optimization step (train separately or together, whether to apply gradient surgery or not), and in the training setting (whether to reshuffle data after each epoch, batch size, regularization). After comparison between different tests, under the limit of AWS g5.2xlarge instance, the best performance is achieved with (1) Add Siamese network and cosine similarity for semantic textual similarity task (2) Pass embeddings for two sentences with their difference into the dense linear layer for the paraphrase task (3) Multi-task training with gradient surgery (4) Reshuffle dataloader after each training epoch (5) Apply l2 regularization (6) Train under batch size 2 for sentiment and similarity task, and batch size 32 for paraphrase task. The model achieves 52.1\% accuracy for sentiment task, 82.8\% accuracy for paraphrase task and 0.820 Pearson Correlation for semantic textual similarity task on the dev set, and 52.4\%, 82.7\% and 0.792 on the test set.

Notes

[Online; accessed 1. Jun. 2024]

WIKINDX 6.11.0 | Total resources: 209 | Username: -- | Bibliography: WIKINDX Master Bibliography | Style: American Psychological Association (APA)