[Tutorial] A Step-by-Step Tutorial on BERT-based Sentence Classification of Large Corpora

En anglais uniquement pour le moment

by Senmiao Yang

Transformer-based sentence classification is now the method of choice when it comes to finely annotating large corpora.

In a recent paper by Do, S., Ollion, É., & Shen, R. (2022), the authors show that the annotations are oftentimes on a par with those of a skilled human. The paper comes with a Python package, AugmentedSocialScientist, which leverages BERT models (Bidirectional Encoder Representations from Transformers) to train highly accurate and proficient automated classifiers.

This tutorial provides a step-by-step guide for using the package. It supplements the already existing Colab notebook, which briefly explains how to run the code.

Update: Find the slides of our tutorial at IC2S2 2023.

Automatic annoation can be divided into the following steps:

1. Corpus Preparation
2. Manual Annotations
3. Model Training & Performance Evaluations
4. Sentence Classification (based on predictions)

Prerequisites

CPU or GPU?

Automatic annotation can be done using either CPU or GPU. The fundamental difference between these types of processors is that GPUs accomplish tasks significantly faster. It is thus recommended to use GPU if you have the available resources.

CPUs are normally considered less efficient for data-driven intensive machine learning tasks, but they are still a cost-effective solution as they power everyone’s computer. Moreover, if your data contains sensitive information, or if uploading it online violates privacy or security regulations, using CPUs could be a more suitable alternative as it does not force you to rely on a cloud server where GPU resources are often located.

To run it on your personal computer using CPU, you need to verify that Anaconda has been installed, allowing you to use Python and different IDEs (integrated development environments), such as Jupyter Notebook, to accomplish the task.

After conducting a series of experiments using datasets of different sizes, we recommend to use GPUs, as the process is time-consuming and resource-intensive.

For instance, using a basic GPU, we managed to train a model with around 1 300 annotated sentences in around 20 minutes, and to use it to generate predictions on more than 145 000 new sentences in around 1h30m. The same operation would take much longer on CPUs, and maybe even not work.

To use GPUs resources for better computing efficiency, there are two major options:

Google Colab

Google Colab is a web-based platform for Python coding. One of its benefits is that it provides free access to GPU resources. All you need is a Google account.

Before running any machine learning code in Colab, it’s crucial to make sure that GPU resources are available:

Select “Runtime” from the top menu and then “Change runtime type”;
In the pop-up window that appears, select “GPU” under “Hardware accelerator”.

This ensures a runtime environment with GPU acceleration. In addition, you can run the following code in Colab to make sure your runtime is using a GPU accelerator:

from torch import cuda
cuda.get_device_name(0)

⚠️

Precautions

1. Free usage limits: Google Colab doesn’t specify how many resources are allocated to free users. If you are conducting computing tasks on a very large dataset, Colab will directly terminate your runtime when running out of free quota, and you will lose all your computing efforts.
2. Privacy / GDPR-related concerns: running computing tasks on cloud services such as Google means that you would have to upload your data on their servers. If your data contains sensitive information, or doing so would violate local regulations on personal information protection, avoid using Colab.

Furthermore, Google Colab notebooks are designed to terminate automatically if users are inactive for more than 90 minutes. Additionally, each instance has a maximum lifetime of 12 hours. If you’re running a time-consuming training task on Colab, it would be inconvenient if your session is terminated when you walk away for a coffee break. To avoid this, you can use the following method.

If you’re using a Chrome-based (Google Chrome or Chromium) browsers, you can right-click the website, select Inspect, click Consolein the pop-up window, and enter the following JavaScript code:

function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);

It defines a JavaScript function called ConnectButton() that clicks on the “Connect” button every 60 seconds on a Google Colab notebook, so that it keeps the notebook in active mode. (Note that Colab is constantly updated to prevent such functions, so this handy trick may not be effective in the future)

Institutional or paid resources

If your institution has paid for access to computing resources, it is a better solution, both out of computation considerations and privacy concerns. Another possibility is to rent GPU resources provided by commercial entities, but it is recommended to check their Terms & Conditions to clear out privacy concerns.

Typically, these resources come with a Python environment and Jupyter Notebook, as well as the necessary packages for machine learning activities, so you wouldn’t need to worry about how to install them.

Packages

You should have the following packages installed:

Our package AugmentedSocialScientist for sentence classification:

!git clone https://github.com/rubingshen/AugmentedSocialScientist.git  
!pip install ./AugmentedSocialScientist/package/

Importing and managing datasets: pandas and numpy

import pandas as pd
import numpy as np

pd.options.display.max_colwidth=None
pd.options.display.max_rows=100

Step 1: Corpus Preparation

Before using BERT to annotate a large corpus, it is important to properly prepare and clean the corpus. This may involve steps such as:

Splitting the corpus into your chosen level units (for example, sentences or paragraphs)
Removing extraneous characters or formatting

Text segmentation

According to the BERT model specification, the maximum sequence length for input is limited to 512 tokens. This presents a significant hurdle, particularly for document forms that commonly exceed the 512-word limit. An approach to dealing with large documents is to trim them into sentence or paragraph segments before embedding them.

There are several ways to do this:

RegEx

Short for Regular Expression, it is a sequence of characters that forms a search pattern, usually used in text processing for identifying and matching specific patterns of text. There are a lot of online resources on how to use RegEx to extract information from a text string:

An interactive tutorial with exercises (in English)
We also made a detailed explanation and how to use it in R here (in French).

Since sentence structures can differ greatly across various languages, it’s recommended to search for the most optimal RegEx solution for your specified use case on the Internet.

NLP Tokenizers (sentence-level segmentation as an example)

Regexes are fast, but NLP Tokenizers are better. Tokenizers are useful tools that are used in Natural Language Processing to split text documents into smaller pieces, such as words or sentences. There are many tokenizers freely accessible, including NLTK and spaCy.

spaCy is a popular open-source library for Natural Language Processing, which has a built-in tokenizer that can be used to split long documents into sentences. Below is an example of how to manage it (for English corpus):

import pandas as pd
import spacy
# loading model
nlp = spacy.load('en_core_web_sm')
# loading datasets
data = pd.read_csv("corpus.csv")
# sentence segmentation
sentences = [] # create an empty list
for i in data['text']:
  doc = nlp(i) # Process each line of text provided as input
  for sent in doc.sents: # Iterate over each sentence in the article
    sentences.append(sent.text)  # Add the sentence text to the sentences list

In this chunk of code:

We first import the spacy library and load an instance of the en_core_web_md model.
Then, we iterate through the text column in the data dataframe and process each one using the nlp instance to create a doc object.
We then iterate through the sentences in each doc object using the sents attribute to tokenize the sentences.
We extract the text of each sentence using the text attribute, and append it to the empty sentences list created before.

ℹ️

Additional information

1. Supported languages: you can find more detailed information for language supported by spaCy in its official documentation page. It also provides multi-language support. Note that some languages, including Chinese, Japanese, Korean, Thai and Russian, might require external dependencies.
2. Model installation: the above example assumes that you’ve already installed the model en_core_web_sm for English sentence segmentation. If not, you can easily install it by running the following command in your terminal:
$ python -m spacy download en_core_web_sm

The duration required to execute the segmentation task via spacy depends on the size of your dataset (the number of texts therein, as well as the length of each text). Here’s an illustration to provide a rough estimate of the required time: Using the CPU-only version of Google Colab, around 11 minutes were needed to divide 4 200 political articles from the New York Times, resulting in 159 857 sentences.

Tips on splitting text while keeping metadata

The aforementioned method allows you to split text into sentences, but only returns a list of segmented text for model training, testing, and predictions. If you want to conduct a social scientific analysis that incorporates metadata alongside the results, chances are that a mere list of sentences will be inadequate.

Let’s assume that you have a corpus of tweets and a series of metadata such as username, time, number of replies/likes/mentions, etc. You can use the following codes to manage text segmentation while keeping metadata in the same data frame:

import pandas as pd
import spacy
# loading datasets
data = pd.read_csv("corpus.csv")
# loading model
nlp = spacy.load('en_core_web_sm')
# create a new dataframe and split texts into sentences
corpus_bis = data.assign(sentence=data['text'].apply(lambda x: [sent.text for sent in nlp(x).sents]))
corpus_bis = corpus_bis.explode('sentence')

In this chunk of codes:

Starts identically as the previous one
Then, we use apply and a lambda function to process each line of text in the data['text'] column. The lambda function iterates through the sentences in each line of text, given by nlp(x).sents, and extracts their text with sent.text.
These sentences are returned as a list, which is assigned to a new column sentence in the corpus_bis data frame using the assign() method.
Finally, we use the explode() method to expand the sentence column into multiple rows, so that each row in the output corresponds to a single sentence.

R support

If you’re more comfortable / familiar with R, it’s also possible to use spaCy tokenizers:

# install the package "spacyr" if you haven't done it yet
# installed.packages("spacyr")

library(tidyverse)
library(spacyr)
library(readr)

# import data
data <- read_csv("corpus.csv")

# load the model
spacy_initialize("en_core_web_md")

# sentence segmentation
system.time({
  corpus_bis <- data %>% 
    mutate(sentence = spacy_tokenize(data$text, "sentence") %>% 
  unnest(sentence) %>% 
  unnest(sentence)
})

# release the resources used by spacyr package
spacy_finalize()

This code preprocesses text data using the tidyverse and spacyr R packages. Specifically:

It first initializes the en_core_web_md model from spaCy library using the spacy_initialize() function of the spacyr package.
Then, it reads the data and executes a sentence segmentation on the text while timing the process using the system.time() function.
The mutate() function is used to add a new column to the data dataframe which contains the text of each sentence. This is done by using the spacy_tokenize() function from the spacyr package to split the text of each row in the data$text column into sentences.
The output of this function is a list of lists, where each sublist represents a sentence. We use the %>% operator to pipe this output through the unnest() function twice, which flattens this list into a vector of sentences and then a data frame with one sentence per row respectively and collectively named as corpus_bis.
Finally, the spacy_finalize() function is used to release the resources used by the spacyr package after all the text data has been processed.

Note:

spacyr package in R relies on spacy in Python. You still need to have Python installed on your computer/the environment you’re using, as well as the selectedspacy model(s) installed through Python in the terminal.
Tokenizing large datasets could be very time-consuming. It is therefore recommended to use GPU resources.

Text cleaning

More traditional approaches in textual analysis suggest removing ‘unnecessary information,’ such as punctuations and stopwords. While a conclusive response can only be derived through actual execution and largely depends on the specific task where we measure the performance of the trained model, for methods leveraging BERT models, the removal of punctuations and stopwords would not, in general, enhance the outcomes.

Theoretically, BERT models are capable of learning how to create text representation in semantic ‘context.’ The contextual information acquired in this training process includes stopwords that are essential in altering the sentence’s meaning, and the same goes for punctuation (for example, a question mark can certainly change the overall meaning of a sentence.) Therefore, discarding stopwords and punctuations from sentences would culminate in removing the contextual nuances that BERT models use for producing optimal outcomes.

Furthermore, the practice of blindly removing stopwords that could contain useful information has become a norm in traditional methods such as LDA (topic modelling). However, its effectiveness is increasingly being scrutinized, particularly for social science research.

But some optional preprocessing might be necessary: removing irrelevant information from your documents might be helpful (e.g. the URL and/or hashtags from the tweets you’ve collected, the HTML elements in the texts you’ve scrapped from websites). If these elements are not what you’re looking for, you can use different functions in R or Python with RegEx to remove them (example in Python and a whole set of comprehensive tutorials on text analysis in R — note that not all preprocessing in these examples is necessary).

Step 2: Manual Annotations

Prior to model training, it is essential to appropriately label (“annotate”) data. The latter will be fed to the algorithm to identify different patterns present within the data later during the training process. The general goal is to obtain a dataset, normally in CSV format, that contains at least two columns: the text and its label.

Various factors, including the number of labels in total and the consistency of manual annotations, can have a significant impact on the accuracy of the training process. It is therefore of utmost significance to reflect on the labels before the annotation process, and this also highlights the importance of adopting a more user-friendly interface, since tagging data for extended periods might lead to erroneous annotations or mislabelling due to mental exhaustion.

This step entails two important elements to consider:
1. Evaluating the quantity and quality of manual annotations
2. Choosing the appropriate annotation tool

Quantity and quality of annotations

Quantity

When discussing “quantity”, it is important to consider two aspects: the number of labels, and of manual annotations needed for training and testing the model:

A large number of labels is, theoretically, not problematic, but you’ll need to annotate a ‘sufficient’ amount of text for each label so that the algorithm can identify potential patterns within unstructured data;
A larger number of annotated texts usually (but not necessarily) results in better performance of the trained model. But:
a) Since our primary objective is to gather a ‘sufficient’ amount of annotations to generate reliable results while avoiding squandering time, annotating a large amount of text might go against the original intentions.
b) it also implies increased resources to train the model and compute the predictions. It could cause problems if you have too many annotations for model training while having no access to GPU resources.

Quality

The quality of annotations encompasses various aspects. This involves the extent to which the labels you use match your research objectives and enable you to produce consistent annotations. It is noteworthy that a theoretically applicable label does not necessarily guarantee an efficient performance in practice when you start annotating your corpus. Subsequently, it is common to engage in a recursive process of reflection on definitions and criteria, and reevaluation of annotation protocols.

If you’re not sure about the quality of the labels and the consistency of your annotations, you can try to train the model with a few hundred annotations and then make predictions on new data. Inspecting their accuracy can give you more insightful information before you make more annotations to improve the performance of the model.

Examples

For the number of manual annotations needed:

When training a classifier to differentiate clickbait and non-clickbait titles (cf. example used in the next section on model training), we used a total of 700 tagged texts, with 500 rows in the training set and 200 rows in the testing set.

For designing the labels and making consistent annotations:

For an ongoing work, when training a classifier to identify the “off-the-record” practices in political journalism with the New York Times as a case study, we initially referred to AP Stylebook (2022)‘s definition of Attribution as our labels:

Background: The information can be published, but only under conditions negotiated with the source. Generally, the sources do not want their names published, but will agree to a description of their position.
Deep background: The information can be used but without attribution. The source does not want to be identified in any way, even on condition of anonymity.

During the manual annotation process, we encountered challenges in distinguishing ‘Background’ and ‘Deep Background’ if only evaluating journalistic written expressions that are often inconsistent, especially when we tried to annotate sentences by strictly conforming to newsroom standards set by AP. Additionally, the criteria for annotation differed based on the subject, which led to non-uniform labelling. In light of this inconsistency, we’ve decided to redefine our labels:

Unnamed implicated sources: All sourcing practices in a sentence without specifying the name of the person(s) interviewed or cited, excluding external/analytical opinions from experts, scholars, analysts, poll numbers, etc.
Unnamed external opinions: The report uses analysis and viewpoints from scholars, experts, and analysts to frame its content, or it mentions poll numbers. However, the sources of these opinions/poll results remain unidentified.

By critically examining and reassessing the stipulated definitions, annotation criteria, and procedures, we’ve been able to generate more consistent manual annotations, which can be effectively used for model training and testing purposes. Our example also displays that despite the clarity of predefined theoretical concepts, they can still provoke practical impediments when scrutinizing empirical data, and it’s normal to get unsatisfying outcomes in preliminary attempts.

Annotation tools

There are several annotation tools to choose from according to your personal preferences:

Spreadsheets (LibreOffice, Microsoft Excel, Google Sheets, etc.)

The most commonly used and intuitive interface for text tagging and annotation is spreadsheets. It requires minimal technical investments: simple to operate, it usually runs on your personal computer (excluding Google Sheets), thus avoiding any privacy-related issues.

The only disadvantage, in comparison with other options, may be the tedious task of manually tagging texts line-by-line in a spreadsheet for extended periods of time, due to its less user-friendly format.

Open-source annotation tools

There are numerous powerful annotation tools available on the internet that are free and open source. Some of the most popular ones include Label Studio and Universal Data Tool. Each tool has its own unique features and functions, but they are generally easy to use and implement on your computer. Most importantly, they feature a more user-friendly interface that helps to mitigate the fatigue-inducing process of annotating text on a plain spreadsheet for hours.

Note that some tools are cloud-based, which requires you to update data on their servers, thus having the same problems in the case of privacy-related concerns. Besides, these tools provide different export methods and formats, so it’s recommended to try a few annotations, export the dataset and import it in Python, to ensure the exported file could be accurately identified. (A different CSV separator, for example, could cause importing problems, and it would be too late if you’ve already made 1500 annotations.)

Python packages

There are also Python packages that allow you to manage the text tagging process directly inside your IDE or Jupyter Notebook. PigeonXT, for instance, is a straightforward widget that enables you to promptly annotate an accumulation of unmarked instances with ease in your Jupyter notebook. Below is a quick demonstration:

Install the package

!pip install pigeonXT-jupyter

Import your data

import pandas as pd
df = pd.read_csv("corpus.csv")
text = df["text"].astype(str).tolist()

Create labels and annotation interface

from pigeonXT import annotate
annotations = annotate(
        text,
        options=['others', '', 'unnamed external opinion']
    )

A widget would pop out below for you to annotate your text, line by line:

You can click the button prev or next to change the annotations you’ve made before.

Inspect and export the annotated dataset
- The annotated dataset is stored, as indicated in the previous codes, in the annotations object:

You can also merge your original datasets with the labels you’ve created by doing a left join:

df_labeled = df.merge(annotations, on='text', how="left")

⚠️

Caution

1. If you make any modifications to the annotations object later on, your data set containing tags will get overwritten, therefore losing your hard work.
2. Also, if you’re doing it in a Google Colab and the runtime is terminated due to inaction, the object will also be deleted at the same time.

Simply put: while its interface might be considered to be more user-friendly as compared to plain and mundane spreadsheets, you may risk losing your hours-long efforts due to unintentional mistakes.

Step 3: Model Training & Performance Evaluations

Dataset preparation

In order to effectively train and test a machine learning model such as BERT, it is important to split the annotated corpus into three separate datasets: training, testing, and prediction.

Once the documents are neatly split and cleaned, it should be divided into three sets for our training and evaluating purposes:

A training set, containing text inputs as well as your manual annotations / labels for each text, will be used to train the BERT model to recognize different patterns in the data;
A testing set, containing the same type of information as the training set (but not the texts used to train the model), will be used to evaluate the performance of the BERT model;
A prediction set, only containing texts that the model haven’t been previously trained and tested on, will be used to apply the trained BERT model to new data

Upon completion of annotations, it would be possible to filter unannotated texts as the prediction set, and reserve annotated texts to be split into training and testing sets.

There are various ways to split the dataset into training and testing sets. One commonly used method is the train_test_split() function from scikit-learn library in Python.

The key parameter for this function is the size of the train and test sets. Typically expressed as a fractional value ranging from 0 to 1, this factor specifies the proportion of the overall dataset for the purpose of either training or testing. For instance, a training dataset with a size of 0.67 (equivalent to 67%) signifies that the remaining 33% will be allocated for the test data set, i.e. for the purpose of assessing the quality of the trained classifier.

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, train_size=0.67)

Pre-trained model selection

The AugmentedSocialScientist package has incorporated three modules that respectively support different languages:

BERT model (Devlin et al. 2019): for embedding texts in English

from AugmentedSocialScientist import bert

CamemBERT model (Martin et al., 2020, French version of BERT): for embedding texts in French

from AugmentedSocialScientist import bert

XLM-RoBERTa (Goyal et al. 2020): a multilingual model that can perform NLP tasks on 100 different languages (see Appendix A in the paper for a list).

from AugmentedSocialScientist import xlmrobeta

The encoding syntax for these three modules is similar; you only need to specify the module name mobilized before the encode and predict function.

For example, you can then use the functions xlmroberta.encode, xlmroberta.run_training, xlmroberta.predict_with_model with the same syntax as for BERT and CamemBERT.

English Text Classifier Example: Clickbait Detection

Loading pre-trained model

For this example, we use data from Chakraborty et al. 2016, in order to train a classifier that distinguishes between clickbait and non-clickbait titles

from AugmentedSocialScientist import bert

Loading data

cb_train = pd.read_csv('./AugmentedSocialScientist/datasets/english/clickbait_train.csv')
cb_test = pd.read_csv('./AugmentedSocialScientist/datasets/english/clickbait_test.csv')

Inspect loaded data

cb_train
cb_test

Both datasets have the following columns:

headline, which is the text input
is_clickbait, which is the tag input

Training model

Encode the training data

train_loader = bert.encode(cb_train.headline.values, cb_train.is_clickbait.values)

Encode the testing data

test_loader = bert.encode(cb_test.headline.values, cb_test.is_clickbait.values)

The following command trains, validates, and saves the model.

score = bert.run_training(train_loader, 
                          test_loader, 
                          n_epochs=2, 
                          lr=5e-5, 
                          random_state=42,
                          save_model_as='clickbait')

To improve your model, you will have to tune some hyperparameters:

n_epochs: an integer that specifies the number of epochs to train the BERT model. An epoch is a single pass through the entire training dataset, and increasing the number of epochs would require longer running time;
lr: a float that specifies the learning rate to use for the BERT model optimizer;
random_state: an integer that is used to set the random seed for reproducibility. This allows you to reproduce the same results each time you run the code.
save_model_as: the name of model saving folder. The model will be saved at ./models/<model_name>. If you don’t want to save the model after training, set this parameter to None.

Once the model has completed its training phase, it calculates the following performance metrics:

Performance metrics interpretation

To interpret these performance metrics:

Precision: the proportion of instances that the model classified as positive that are actually positive (in other terms, the proportion of true positives among true and false positives).
Recall: the proportion of true positive instances that were correctly classified by the model (in other terms, the proportion of true positives among true positives and false negatives).
F1-Score: a measure combing both precision and recall — a harmonic mean of the two metrics. It provides a single score that balances the trade-off between precision and recall. The range of the F1 score is between 0 (worse) and 1 (best) score.
Support: the number of instances in the testing set that belongs to each class. It can be useful for getting a sense of the class imbalance in the data and whether the model is performing well on rare classes.

A more detailed explanation of these concepts could be found here.

We mainly use the F1-score (between 0 and 1) to assess the quality of the model. And since, in our case, our classifiers performed well (> .8), we decided to save them.

ℹ️

Note on metrics

It should be duly noted that while the mentioned measures hold utility in providing an insight into the performance of a model, their applicability should only be understood as indicative.

The foremost criterion for evaluating the accuracy of trained model predictions remains your personal, qualitative evaluation on its predictive performances on new data.

Step 4: Sentence Classification (based on the trained model)

Since we’ve successfully trained our model, now let’s check how it performs on predicting set.

Load unlabelled data for prediction, and inspect the dataset.

cb_pred = pd.read_csv('./AugmentedSocialScientist/datasets/english/clickbait_pred.csv')

cb_pred

Make predictions on unlabelled data using the saved model

pred_proba = bert.predict_with_model(pred_loader, model_path='./models/clickbait')

The model will return the probability of each headline in the unlabelled dataset to belong to a given category (0: not clickbait; 1: clickbait).

pred_proba

Store the predicted category and probability to the data frame
- np.argmax() is a NumPy function that returns the index of the maximum value in an array along a specified axis (in this case, axis 1 which represents the columns of the pred_proba matrix). By using the axis=1 argument, it returns the index of the column containing the highest probability value for each row.
- np.max() is another function that returns the maximum value in an array along a specified axis. By using the axis=1 argument, it returns the maximum value for each row, which corresponds to the highest predicted probability value.

cb_pred['pred_label'] = np.argmax(pred_proba, axis=1)
cb_pred['pred_proba'] = np.max(pred_proba, axis=1)

These two lines are used to add columns to a Pandas data frame cb_pred. Specifically, they are adding a predicted label column and a predicted probability column to the data frame based on the predicted probability matrix pred_proba.

We can inspect the prediction results

for i in range(len(cb_pred)):
    print(f"{cb_pred.loc[i,'headline']}")
    print(f"Is clickbait: {bool(cb_pred.loc[i,'pred_label'])}, with a probability of {cb_pred.loc[i,'pred_proba']*100:.0f}%")
    print()

Note that the bool() function is used to convert the predicted label value (which is either 0 or 1) into a boolean value indicating whether it is clickbait or not. If the tag you’re using is not binary as in our example, it would not suit your use case.

Below is a demonstration of the prediction results:

If you want to export the result as a CSV file for further analysis, you can use the to_csv() function from the pandas library:

cb_pred.to_csv("cb_pred.csv")

The exported file would contain the annotated text, the predicted label, and its corresponding probability.