Finetuning a Multitask BERT for Downstream Tasks

Cs224n, Stanford; Project, Default; Gu, Chenchen; Mentor, •.; Khosla, Sauren

Javascript is disabled or not supported in your browser. JavaScript must be enabled in order for you to use WIKINDX fully. Enable JavaScript through your browser options then try again, otherwise, try using a different browser.

WIKINDX

WIKINDX Resources

Cs224n, S., Project, D., Gu, C., Mentor, •., & Khosla, S. Finetuning a multitask BERT for downstream tasks.

Resource type: Journal Article
BibTeX citation key: anon.40
View all bibliographic details

Categories: General
Creators: Cs224n, Gu, Khosla, Mentor, Project

Attachments

URLs https://www.semant ... 7eebe80e087b6f01a3

Abstract

It is found that cosine similarity works best for semantic textual similarity, while sentence concatenation works best for paraphrase detection, suggesting that training a multitask BERT may be a “natural” multitask learning problem, with few conflicting gradients between tasks. Multitask models are of interest and importance in NLP because of their ability to handle a variety of different tasks. In particular, Transformer-based models such as BERT are able to generate useful sentence-level representations, leading to strong performance on many downstream tasks. There are many proposed methods for using BERT and its embeddings for downstream tasks. However, there has been little direct comparison between these methods, holding constant BERT’s model architecture. To address this gap, we perform experiments to find effective methods for building and finetuning a multitask BERT that simultaneously performs well on multiple tasks. We find that one-size-fits-all approach is not optimal—different tasks have different methods which work best. Among the methods we test, we find that cosine similarity works best for semantic textual similarity, while sentence concatenation works best for paraphrase detection. Also, we find that gradient surgery and model ensembling do not deliver significant performance gains, suggesting that training a multitask BERT may be a “natural” multitask learning problem, with few conflicting gradients between tasks.

Notes

[Online; accessed 1. Jun. 2024]

WIKINDX 6.11.0 | Total resources: 209 | Username: -- | Bibliography: WIKINDX Master Bibliography | Style: American Psychological Association (APA)