First of all my disclaimer: this is not an article for data scientists. They should already recognize the mistakes I formulated in this post title. Topic detection and sentiment analysis are classification problems in Natural Language Processing. BERT is an algorithm to solve, among the others, classification problems, but of course, I formulated a title helping business people understanding the topic 😉 Don’t expect a nerd discussion around BERT.

Why text to numbers? Computers love numbers not text

The generic reason we must turn text into numbers is that computers, and neural networks, in particular, are unable to understand the concepts of words (unless you are using a keyword search solution which is NOT text mining). To process natural language a system for representing text is required. The standard mechanism for text representation is word vectors where words or phrases from a given language vocabulary are mapped to vectors of real numbers. In one of my last posts, I explained how word embedding works adapting the content of that post to a business audience. If you missed that post, go back one step, and learn why computers don’t understand the text, they just love numbers. In this post, I will try to do the same explaining the next steps in the evolution of word embedding concept, a new approach suggested by Google in November 2018: Bidirectional Encoder Representations from Transformers (BERT).

Traditional word vectors, the origin

Some of the traditional methods that pre-date neural embeddings are bag of words (BOW), TF-IDF, and distributional embeddings.

Bag of Words or BoW vector representation is the most commonly used traditional vector representation. Each word or n-gram is linked to a vector index and marked as 0 or 1 depending on whether it occurs in a given document. See here the full explanation.


TF-IDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word or n-gram is to a document in a collection or corpus. They provide some weighting to a given word based on the context it occurs. The TF–IDF value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently than others. See here the full explanation.


Distributional Embeddings enable word vectors to encapsulate contextual context. Each embedding vector is represented based on the mutual information it has with other words in a given corpus. Mutual information can be represented as a global co-occurrence frequency or restricted to a given window either sequentially or based on dependency edges. See here the paper describing distributional embeddings.

Why BERT? The prequel

A great contribution to solve NLP classification problems has been provided by Tomas Mikolov. When Tomas worked at Google, he conceptualized and deployed Word2Vec. Word2Vec is an approach to NLP transforming words in vectors and then apply linear algebra to solve specific tasks such as classifications (e.g. sentiment analysis).

How the model is working in general? It is a predictive model that learns its vectors in order to improve the predictive ability of a loss such as the loss of predicting the vector for a target word from the vectors of the surrounding context words. In the plain business text, the model takes a word or an n-gram and transform that word into a vector. That vector is then plot in a 3d space. Once all the words (or n-grams, or both) are plotted in that space, then it is possible to apply linear algebra to solve several natural language problems, such for instance, classification.

Later Tomas joined Facebook AI Labs and once there, he developed far the Word2Vec concept to a new level including subwords information to the existing Word2Vec concept. based on that paper, Facebook AI Research released open sourced FastText. FastText, in my opinion, is one of the best algorithm available so far to run topic detection and sentiment analysis where the corpus is relatively short, and where your training data set is not extremely big. With few hundreds labeled test data, the algorithm itself reach very good results in term of precision and recall. For business people, you can read “precision” and “recall ” as “accuracy”. In AI jargon “accuracy” is nonsense. FastText, moreover, is very easy to use, you can run it directly in the Unix shell using the command line. It doesn’t require big computing capacity (it is not a real deep NN). You can also use a Python wrapper or, if you are not a geek, you can use VOC Classify of sandsiv+ to get an easy to use graphic user interface to interact with FastText. (DISCLAIMER: I am the CEO of sandsiv+ and, of course, I am trying to promote my brand and my products, sorry for that)

In any case, to train a model in FastText or similar Word2Vec algorithms, you need data. Not simply data, but labeled data. It means for instance data such:

  • Not happy about your product | NEGATIVE
  • I love Raspberry PI | POSITIVE (…and by the way, I really love it!)

FastText doesn’t ask for a big quantity of training data, however, you need a certain amount of them to train the algorithm. While FastText run classification, it is possible to push him to use a pre-build vector model, in a way to help him to better generalize the classification problem.

What does it mean exactly in business words? It means, instead to train the model from scratch, you can add a “certain level of knowledge” of a specific language and, on top of that, you can then use your data set to fine-tune the model. Imagine, when you were young, you were at elementary school, and you envied the top of the class guy. Now, imagine there is a magic way to pass the knowledge of the top of the class to you, simply connecting your two brains to wifi. In AI language we call it transfer learning and it means not starting from scratch to train and build a model, but apply our labeled dataset to a model that has already a certain knowledge: the goal is to improve the power of “generalization” of that model.

One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. Here is where transfer learning can provide big support. Nowadays deep machine learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated (labeled) training examples.

Ladies and Gentlemen this is BERT!

To fill that gap, and provide a solid transfer learning architecture, Google introduced (2018) a new language representation model called BERT which stands for Bidirectional Encoder Representations from Transformers. What is so particular in BERT? Why is so different from FastText, for instance?

  • BERT is designed to pre-train (transfer learning) deep machine learning models. Some of the key milestones reached in the years 2017-2019 have been ELMo,ULMFiT and OpenAI Transformer. All these approaches allow us to pre-train an unsupervised language model on a large corpus of data such as Wikipedia articles and then fine-tune these pre-trained models on downstream tasks.
  • BERT is a bidirectional model that is based on the transformer architecture, it replaces the sequential nature of Recurring Neural Networks with a much faster Attention-based approach. (different from the classical Recurring Neural Network approach)
  • BERT is bidirectional, and this is quite interesting in certain NLP cases, it means he can “read”, in a certain way, in two “random directions”. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once (it is related to the Attention mentioned before). Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
  • BERT is considering the “context”, not just the single word (the “window word” and n-grams) as FastText. In a few words, BERT considers the sentence and the sentences around that sentence. It allows the model to understand the particular word within a sentence, and the sentence itself within a period.
  • BERT is also pre-trained on two unsupervised tasks, masked language, and next sentence prediction. This allows us to use a pre-trained BERT model (transfer learning) by fine-tuning the same on downstream specific tasks such as sentiment classification, intent detection, question answering and more.
  • BERT uses minimal vocabulary. BERT uses the idea of representing words as subwords or n-grams. On average vocabulary of 8k to 30k n-grams can represent any word in a large corpus (!). This has a significant advantage from a memory perspective.

Can you please use business languages to summarise those great BIRT features?

Ok, let’s use specific examples. Let’s focus on the particular features that set apart BIRT from the “first generation” of vector algorithms:

  • In a sentence with two words removed and replaced by MASK tag, BERT is trained to predict what those two words are. Wait a minute, removed? Yes, because BERT is bidirectional, if you allow BERT to see the word it should predict when is trained “left to right”, then he will “cheat” using that information when is predicting right to left. And of course, we don’t want to be cheated by BERT!

INPUT: I used my [MASK] in a normal way. It is broken and I need to come back to the [MASK].

LABELS; [MASK]1 = phone; [MASK]2 = shop

  • Given two sentences, BERT is trained to determine whether one of these sentences comes after the other in a piece of text, or whether they are just two unrelated sentences.

SENTENCE A = I used my phone in a normal way.

SENTENCE B = It is broken and I need to come back to the shop.

LABEL = Is NextSentence

SENTENCE A = I used my phone in a normal way.

SENTENCE B = There is a problem with the right engine.

LABEL = NotNextSentence

Train BERT to predict sentences (and not just words) allows the model to understand the sentence within the context of the corpus. This is definitely a big step forward in NLP and, of course, in the whole word embeddings methodology.

  • Those two specific tasks will allow us to pre-traine the model without labeled datasets, which is definitely a big advantage. It reduces the big cost of human annotation efforts. You can create your own pre-trained model or, and this is the interesting option, Google offers already some ready to go pre-trained models:

Interesting to mention is the pre-trained model on 104 languages. The full list of pre-trained models is available here.

Of course, training an initial model requires a lot of computation power. Thus it is better if you consider a GPU or a TPU environment. Training those models on an external cloud (e.g. Google Cloud Service) it is strongly recommended in terms of time. Once the basic knowledge is built (transfer learning) then you can work on the final layers and fine-tune it on a task of your choice that will benefit from the rich representations of language it learned during pre-training.

Even if you load an existing pre-trained model – unless you use a BASE model – it is strongly recommended to do it in a cloud of your choice. In my case, I prefer the Google cloud using TPU, but it works in whatever other cloud service offering an AI architecture.

An important advantage of BERT over the first generation of word embedding models is the capacity of embedding the same word with a different meaning. What does it mean exactly and what kind of advantages it brings to the business? Let’s do an example:

  • I checked my bank account.
  • There was a bank of dark clouds.
  • I climbed a steep bank up to the cabin.
  • Let’s bank the campfire so it could be easily revived in the morning.

The model generates embeddings for the word based on the context it appears thus generating slightly different embeddings for each of its occurrences. In our case, the bank will have a different vectors according to the different context. In standard word embeddings such as Glove, Fast Text or Word2Vec each instance of the word bank would have the same vector representation. BERT enables NLP models to better disambiguate between the correct sense of a given word.

The main reason for this game changer feature is the use of an Attention only model instead of a Recurring Neural network approach. The key idea of an “attention only” based model called Transformer (!) is described in the paper Attention Is All You Need which gave a computationally attractive (parallel as opposed sequential processing of input) and even better performance (ability remember information beyond just about 100+ words) than long short-term memory building units for layer in Recurring Neural Network.

How can we explain this unique feature in simple words? Are you familiar with the German language? I often explain to my friends why Germans are so precise and Italians so chaotic 😉 It all starts with the language. In German, on many many occasions, the verb pop-ups at the end of the sentence. Thus, you really need to focus on the whole sentence and wait until the verb pop-ups to get the whole meaning. An Attention only model works a little bit in that way: it receives information in parallel and it waits until all information is available to get its “final conclusion”.


It is BERT a killer NLP application for business? Let’s say, considering an academic point of view is definitely a big step forward in NLP and Natural language Understanding. It is relatively young and we need to apply it to real business problems in order to better understand its impact. My experience is that when I work with academic data sets and benchmark my work with academic papers, you see definite improvements. When I work in the real business world, those improvements are less evident and the fine-tuning process is long and complex. My personal feeling is BERT can really bring benefits to NLP and NLU. I will keep you updated on how BERT helps sandsiv+ and, at the end, your business.

Sentiment analysis
Federico Cesconi

Read the article on LinkedIn

Start growing with sandsiv+ today