How some of the latest models can help further shorten the information flow chain? This time I focused on the possibility of ingesting data in tabular formats – e.g. data contained in relational databases – without knowing languages like SQL, Python, etc. In that way, business people with no previous knowledge of specific technical languages can query databases and get insights in a fast and convenient way.
You may wonder what tapas has to do with it. Well, they do! But not those beautiful and appetizing multicolored shapes that are normally displayed in the windows of the typical restaurants of the Iberian peninsula. Those who know me know how much I love Spain, and in particular Andalucia. No, unfortunately not those, tapas, in this case, stands for TAble PArSing a solution that allows you to query relational databases and spreadsheets using a natural language question such as, “How much was the NPS of Apple in 2019?”.
To solve the problem Google has developed the new model (TAPAS) based on another model that the Mountainview giant also uses today in the search engine, namely BERT. We are therefore talking about Transformers and NLP.
A considerable portion of the information collected in the world is organized in relational databases, that is, represented in rows, columns, and tables. Navigating through these rows and columns historically required more or less automatically reading a spreadsheet, or writing a SQL query. Clearly, the ability to query data using natural language makes the task considerably easier for users, which is why the technology has been widely adopted by Google and other players in the analytics market.
The search giant says Tapas beats or equals the top three open-source algorithms for relational data analysis. Google trained TAPAS using 6.2 million tables from the English version of Wikipedia and then put it to work on a trio of academic datasets. A benchmark test showed that TAPAS provides accurate or comparable answers to rival algorithms on all three datasets.
And so far we’re all happy, but will the model be able to generalize enough to be used effectively in the business world as well? Let’s go find out immediately.
I created a small script, starting from the example provided by huggingface.co and available here, providing the model with new data, completely invented by me. I then created the questions for the model and asked it to provide me with the answers. Here’s what happened:
As you can see from the example, instead of creating a model bound to a specific table structure, Google decided to take a more holistic approach by building a neural network that can be adapted to any form of table dataset. In this case, I made up from scratch some typical data from an NPS observation and then asked the model some questions.
Keep in mind that in order to generalize the model as much as possible, and achieve a remarkable holistic level, Google decided to base TAPAS on its famous BERT coding architecture. BERT is an “old” acquaintance to which I have already devoted several articles.
I know what you’re thinking, whatever, it’s a small table of few records and with few columns. all right, I accept the comment, let’s see then to create something more difficult. Let’s take a much larger dataset. For this, I turned to kaggle.com and randomly selected a dataset. I looked for NPS-related data but didn’t have much success. A recent dataset with a topical theme caught my attention: COVID-19 World Vaccination Progress Daily and Total Vaccination for COVID-19 in the World. Let’s see what happened:
Not bad, what do you think? Think that all this comes about without fine-tuning the model. So the model has seen our data the first time. Yet it manages to answer correctly the related questions.
Conclusions
The potential of Tapas in everyday business is quite clear. Receive information contained in databases without writing a single line of SQL or other languages. Simply by asking the question in your own language.
Imagine a voice service or app that allows a sales team spread across the globe to quickly query sales databases. Queries such as “What was the best selling item yesterday in my area?”, “What is the top item seen this month?”. Can you imagine the speed of distribution and use of information?
In addition, Tapas can go beyond simple data retrieval and perform basic calculations as well. For example, if a business user evaluating sales data asks for the average revenue among their company’s three most popular products, TAPAS could calculate the answer.
My guess is that we’ll see TAPAS at work very quickly. Google could, for example, use the model to enhance its Sheets spreadsheet editor, which already provides limited natural language query options through its Explore feature. Tapas’ ability to pull specific items from large data stores could also lend itself to improving Google Assistant’s question-answering capabilities.
The geeks’ corner
To build the script for this article I used the examples given in the documentation of huggingface.co available here.
I use Google Colab, so I had to do a bit of research on how to solve a problem that occurred when installing the PyTorch Geometric packages. PyTorch is already installed in Google Colab, the easiest thing is to run these lines of code:
import torch print(torch.__version__) print('Cuda Available : {}'.format(torch.cuda.is_available())) print('GPU - {0}'.format(torch.cuda.get_device_name())) !python --version # Use first line to adjust the pip install in the next cell
And so I installed PyTorch Geometric using the code:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html !pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html !pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html !pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html !pip install torch-geometric # After sucesfully installed RESTART THE RUNTIME
I set up Colab to use GPUs. Remember, after this step, to do a RUNTIME restart so that the version is available in production.
One last important point, remember to manage padding and truncation if you use data outside of the example available on huggingface.co. Nothing difficult, if you don’t do this the debug error message will remind you. Anyway, for the second example I adapted the tokenizer like this:
inputs = tokenizer(table=table, queries=queries, padding=True, truncation=True, return_tensors="pt")