Understanding Google’s BigBird — Is It Another Big Milestone In NLP?

Google Researchers recently published a paper on arXiv titled Big Bird: Transformers for Longer Sequences.

Praveen Mishra
Towards Data Science

--

Image by Gerd Altmann from Pixabay

Last year, BERT was released by researchers at Google, which proved to be one of the efficient and most effective algorithm changes since RankBrain. Looking at the initial results, BigBird is showing similar signs!

In this article, I’ve covered:

  • A brief overview of Transformers-based Models,
  • Limitations of Transformers-based Models,
  • What is BigBird, and
  • Potential applications of BigBird.

Let’s begin!

A Brief Overview of Transformers-Based Models

Natural Language Processing (NLP) has improved quite drastically over the past few years and Transformers-based Models have a significant role to play in this. Still, there is a lot to uncover.

Transformers — a Natural Language Processing Model launched in 2017, are primarily known for increasing the efficiency of handling & comprehending sequential data for tasks like text translation & summarization.

Unlike Recurrent Neural Networks (RNNs) that process the beginning of input before its ending, Transformers can parallelly process input and thus, significantly reduce the complexity of computation.

BERT, one of the biggest milestone achievements in NLP, is an open-sourced Transformers-based Model. A paper introducing BERT, like BigBird, was published by Google Researchers on 11th October 2018.

Bidirectional Encoder Representations from Transformers (BERT) is one of the advanced Transformers-based models. It is pre-trained on a huge amount of data (pre-training data sets) with BERT-Large trained on over 2500 million words.

Having said that, BERT, being open-sourced, allowed anyone to create their own question answering system. This too contributed to its wide popularity.

But BERT is not the only contextual pre-trained model. It is, however, deeply bidirectional, unlike other models. This is also one of the reasons for its success and diverse applications.

Bidireactional Nature of BERT
Source

The results of this pre-trained model are definitely impressive. It was successfully adopted for many sequence-based tasks such as summarization, translation, etc. Even Google adopted BERT for understanding the search queries of its users.

But BERT, like other Transformers-Based Models, has its own limitations.

Limitations of Previous Transformers-Based Models

While Transformers-Based Models, especially BERT, are much improved and efficient than RNNs, they come with a few limitations.

BERT works on a full self-attention mechanism. This leads to a quadratic growth of the computational and memory requirements for every new input token. The maximum input size is around 512 tokens which means this model cannot be used for larger inputs & for tasks like large document summarization.

This basically means a large string has to be broken into smaller segments before applying them as input. This content fragmentation also causes a significant loss of context which makes its application limited.

So, what is Big Bird and how is it different from BERT or any other transformers-based NLP models?

Introducing BigBird — Transformers For Longer Sequence

As mentioned earlier, one of the major limitations of BERT and other transformers-based NLP models was because they ran on a full self-attention mechanism.

This changed when researchers at Google published a paper on arXiv titled “Big Bird: Transformers for Longer Sequences”.

BigBird runs on a sparse attention mechanism that allows it to overcome the quadratic dependency of BERT while preserving the properties of full-attention models. The researchers also provide instances of how BigBird supported network models surpassed the performance levels of previous NLP models as well as genomics tasks.

Source

Before we move onto the possible applications of BigBird, let’s look at the key highlights of BigBird.

Key Highlights of BigBird

Here are some of the features of BigBird that make it better than previous transformer-based models.

  • Sparse Attention Mechanism

Let’s say that you are given a picture and are asked to create a relevant caption for it. You will start by identifying the key object in that picture, say a person throwing a “ball”.

Identifying this main object is easy for us, as humans, but streamlining this process for computer systems is a big deal in NLP. Attention mechanisms were introduced to reduce the complexity of this entire process.

BigBird uses Sparse Attention Mechanism which enables it to process

sequences of length up to 8x more than what was possible with BERT. Keep in mind that this result can be achieved using the same hardware as of BERT.

In the said paper of BigBird, researchers show how the Sparse Attention mechanism used in BigBird is as powerful as the full self-attention mechanism (used in BERT). Besides this, they also show “how Sparse encoder-decoders are Turing Complete”.

In simpler words, BigBird uses the Sparse Attention mechanism which means the attention mechanism is applied token by token, unlike BERT where the attention mechanism is applied to the entire input just once!

  • Can Process Up to 8x Longer Input Sequence

One of the key features of BigBird is its capability to handle 8x Longer Sequences than what was previously possible.

The team of researchers designed BigBird to meet all the requirements of full transformers like BERT.

Using BigBird and its Sparse Attention mechanism, the team of researchers decreased the complexity of O(n²) (of BERT) to just O(n). This means that the input sequence which was limited to 512 tokens is now increased to 4096 tokens (8 * 512).

Philip Pham, one of the researchers who created BigBird, says in a Hacker News discussion“In most of our paper, we use 4096, but we can go much larger 16k+.”

  • Pre-trained On Larg Data-set

Google researchers used 4 different datasets in pre-training of BigBird — Natural Questions, Trivia-QA, HotpotQA-distractor, & WikiHop.

While the collective pre-training data-set of BigBird is not nearly as large as that of GPT-3 (trained on 175 billion parameters), Table 3 from the research paper shows that it performs better than RoBERTa — A Robustly Optimized BERT Pretraining Approach, and Longformer — A BERT-like model for long documents.

When a user asked Philip Pham to compare GPT-3 to BigBird, he said — “GPT-3 is only using a sequence length of 2048. BigBird is just an attention mechanism and could actually be complementary to GPT-3.”

[Possible] Applications of BigBird

A paper introducing BigBird was introduced very recently — Jul 28, 2020. As such the full potential of BigBird is yet to be determined.

But here are a few possible areas where it can be applied. A few of these applications are also proposed by the creators of BigBird in the original research paper.

  • Genomics Processing

There has been an increase in the usage of deep learning for genomics data processing. The encoder takes fragments of DNA sequence as input for tasks such as — methylation analysis, predicting functional effects of non-coding variants, and more.

Creators of BigBird say that: “we introduce a novel application of attention-based models where long contexts are beneficial: extracting contextual representations of genomics sequences like DNA”.

Upon using BigBird for Promoter Region Prediction, the paper claim to have improved the accuracy of the final results by 5%!

  • Long Document Summarization & Question Answering

Since BigBird can now handle up to 8x longer sequence lengths, it can be used for NLP tasks such as summarization of longer document form & question answering. During the creation of BigBird, the researchers also tested its performance for these tasks and witnessed “state-of-the-art results”.

  • BigBird for Google Search

Google started using BERT in October 2019 for understanding search queries and displaying more relevant results for their users. The ultimate goal of updating search algorithms by Google is to understand search queries better than usual.

With BigBird outperforming BERT in Natural Language Processing (NLP), it makes sense to start using this newly founded and more effective model to optimize search result queries by Google.

  • Web & Mobile App Development

Natural Language Processing has progressed significantly over the decade. With a GPT-3 powered platform that can turn your simple statements into a functioning web app (along with code) already in place, AI developers can truly transform the way you develop your web & web apps.

Since BigBird can handle longer input sequences than GPT-3, it can be used with GPT-3 to efficiently & quickly create web & mobile apps for your business.

Conclusion

While there is a lot about BigBird that is left yet to explore, it definitely has the capability of completely revolutionizing Natural Language Processing (NLP) for good. What are your thoughts on BigBird and its contribution to the future of NLP?

References:
[1] Manzil Zaheer and his team, Big Bird: Transformers for Longer Sequences (2020), arXiv.org

[2]Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv.org

--

--