# Finetuning a Transformer model

In this notebook, we train a **RoBERTa** model for sequence classification on a **sentiment analysis** task using HuggingFace's [transformers](https://github.com/huggingface/transformers) library.

We do **full finetuning** of a pre-trained model using HuggingFace's setup here. Refer to [this notebook](https://colab.research.google.com/drive/1QR2Vy4mJFUi5r3HaQVROY3dQ9QMTJqhR?usp=sharing) for the same guide using _AdapterHub_ and Adapters.

For training, we use the [movie review dataset by Pang and Lee (2005)](http://www.cs.cornell.edu/people/pabo/movie-review-data/). It contains movie reviews  from Rotten Tomatoes which are either classified as positive or negative. We download the dataset via HuggingFace's [datasets](https://github.com/huggingface/datasets) library.

## Installation

First, let's install the required libraries:

In [None]:
!pip install transformers
!pip install datasets

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |▎                               | 10kB 26.1MB/s eta 0:00:01[K     |▌                               | 20kB 20.7MB/s eta 0:00:01[K     |▉                               | 30kB 15.9MB/s eta 0:00:01[K     |█                               | 40kB 11.3MB/s eta 0:00:01[K     |█▎                              | 51kB 10.5MB/s eta 0:00:01[K     |█▋                              | 61kB 9.1MB/s eta 0:00:01[K     |█▉                              | 71kB 9.1MB/s eta 0:00:01[K     |██                              | 81kB 9.3MB/s eta 0:00:01[K     |██▍                             | 92kB 9.3MB/s eta 0:00:01[K     |██▋                             | 102kB 9.4MB/s eta 0:00:01[K     |██▉                             | 112kB 9.4MB/s eta 0:00:01[K     |███▏                            | 122kB 

## Dataset Preprocessing

Before we start to train our model, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
dataset.num_rows

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1895.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=869.0, style=ProgressStyle(description_…




Using custom data configuration default


Downloading and preparing dataset rotten_tomatoes_movie_review/default (download: 476.34 KiB, generated: 1.28 MiB, post-processed: Unknown size, total: 1.75 MiB) to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9198dbc50858df8bdb0d5f18ccaf33125800af96ad8434bc8b829918c987ee8a...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=487770.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset rotten_tomatoes_movie_review downloaded and prepared to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/9198dbc50858df8bdb0d5f18ccaf33125800af96ad8434bc8b829918c987ee8a. Subsequent calls will reuse this data.


{'test': 1066, 'train': 8530, 'validation': 1066}

Every dataset sample has an input text and a binary label:

In [None]:
dataset['train'][0]

{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [None]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["text"], max_length=80, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)
# The transformers model expects the target class column to be named "labels"
dataset.rename_column_("label", "labels")
# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




{'test': ['attention_mask', 'input_ids', 'labels', 'text'],
 'train': ['attention_mask', 'input_ids', 'labels', 'text'],
 'validation': ['attention_mask', 'input_ids', 'labels', 'text']}

Now we're ready to train our model...

## Training

We use a pre-trained RoBERTa model from HuggingFace. Since we want to do sentiment analysis, we use the `RobertaForSequenceClassification` model class which has a sequence classification prediction head already added. Using the config object, we can specify the class labels.

In [None]:
from transformers import RobertaConfig, RobertaForSequenceClassification

config = RobertaConfig.from_pretrained(
    "roberta-base",
    num_labels=2,
    id2label={ 0: "👎", 1: "👍"},
)
model = RobertaForSequenceClassification.from_pretrained(
    "roberta-base",
    config=config,
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance:

In [None]:
import numpy as np
from transformers import TrainingArguments, Trainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=5e-5,
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
)

Start the training 🚀

In [None]:
trainer.train()

Step,Training Loss


KeyboardInterrupt: ignored

Looks good! Let's evaluate our model on the validation split of the dataset to see how well it learned:

In [None]:
trainer.evaluate()

RuntimeError: ignored

We can put our trained model into a `transformers` pipeline to be able to make new predictions conveniently:

In [None]:
from transformers import TextClassificationPipeline

classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)

classifier("This is awesome!")

[{'label': '👍', 'score': 0.998802125453949}]

At last, we can also save the full model for later reuse:

In [None]:
model.save_pretrained("./final_model")

!ls -l final_model

total 486992
-rw-r--r-- 1 root root       635 Nov  9 17:08 config.json
-rw-r--r-- 1 root root 498674724 Nov  9 17:08 pytorch_model.bin
