Stephens, D. Enhancing minBert Embeddings for Multiple Downstream Tasks.
|
 |
|
Abstract
|
This work implemented Multi-head Self-Attention and the Transformer Layer, including a portion of the Adam stochastic optimization method to obtain robust and generalizable embeddings that perform well in multiple downstream tasks: sentiment analysis, paraphrase detection and semantic textual similarity. Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based model that generates contextual word embed-dings. Starting with a minimal implementation of BERT, called minbert, I implemented Multi-head Self-Attention and the Transformer Layer, including a portion of the Adam stochastic optimization method. My goal was to make enhancements to obtain robust and generalizable embeddings that perform well in multiple downstream tasks: sentiment analysis, paraphrase detection and semantic textual similarity. The completed base implementation achieved a 38.5\% accuracy, 37.5\% accuracy and -0.074 correlation respectively (overall average of 0.229) on holdout datasets of the aforementioned tasks. After further pre-training on task specific data by training on a masked language model objective, fine-tuning using cosine embedding loss, applying a learning rate decay schedule, and hyperparameter tuning, my final model achieved a 52.6\% accuracy, 59.3\% accuracy and 0.418 correlation respectively (overall average of 0.512) on the same datasets, a 124\% increase over the base implementation.
|
| Notes |
[Online; accessed 1. Jun. 2024]
|