Transfer Learning for NLP with TensorFlow Hub

Welcome to this hands-on project on transfer learning for natural language processing with TensorFlow and TF Hub. By the time you complete this project, you will be able to use pre-trained NLP text embedding models from TensorFlow Hub, perform transfer learning to fine-tune models on real-world data, build and evaluate multiple models for text classification with TensorFlow, and visualize model performance metrics with Tensorboard.

Learning Objectives

  • Use pre-trained NLP text embedding models from TensorFlow Hub
  • Perform transfer learning to fine-tune models on real-world text data
  • Visualize model performance metrics with TensorBoard


Now, you will use pre-trained NLP text embedding models from TensorFlow Hub, perform transfer learning to fine-tune models on real-world data, build and evaluate multiple models for text classification with TensorFlow, and visualize model performance metrics with Tensorboard.

In order to successfully complete this project, you should be competent in the Python programming language, be familiar with deep learning for Natural Language Processing (NLP), and have trained models with TensorFlow or and its Keras API.

We will accomplish it by completing the following tasks in the project:

  • Task 1: Introduction to the Project
  • Task 2: Setup your TensorFlow and Colab Runtime
  • Task 3: Load the Quora Insincere Questions Dataset
  • Task 4: TensorFlow Hub for Natural Language Processing
  • Tasks 5 & 6: Define Function to Build and Compile Models
  • Task 7: Train Various Text Classification Models
  • Task 8: Compare Accuracy and Loss Curves
  • Task 9: Fine-tune Model from TF Hub
  • Task 10: Train Bigger Models and Visualize Metrics with TensorBoard

While you are watching me work on each step, you will get a cloud desktop with all the required software pre-installed. This will allow you to follow along the instructions to complete the above mentioned tasks. After all, we learn best with active, hands-on learning.

Ready to get started? Click on the button below to launch the project on Rhyme.


Task 2: Setup your TensorFlow and Colab Runtime.




Sun May 29 08:48:46 2022       
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |


import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 8)
from  IPython import display

import pathlib
import shutil
import tempfile

!pip install -q git+

import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots

print("Version: ", tf.__version__)
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)


  Building wheel for tensorflow-docs ( ... done
Version:  2.8.0
Hub version:  0.12.0
GPU is available

Task 3: Download and Import the Quora Insincere Questions Dataset

A downloadable copy of the Quora Insincere Questions Classification data can be found Decompress and read the data into a pandas DataFrame.


df = pd.read_csv('', 
                 compression='zip', low_memory=False)


(1306122, 3)


df['target'].plot(kind='hist', title='Target Distribution');




从sklearn中导入model_selection 进行训练集和测试集的切分

from sklearn.model_selection import train_test_split

train_df, remaining = train_test_split(df, random_state=42, train_size=0.01,
valid_df, _ = train_test_split(remaining, random_state=42, train_size=0.001,
train_df.shape, valid_df.shape



((13061, 3), (1293, 3))


Parameter “stratify” from method “train_test_split” (scikit Learn)

The answer I can give is that stratifying preserves the proportion of how data is distributed in the target column - and depicts that same proportion of distribution in the train_test_split. Take for example, if the problem is a binary classification problem, and the target column is having proportion of 80% = yes, and 20% = no. Since there are 4 times more ‘yes’ than ‘no’ in the target column, by splitting into train and test without stratifying, we might run into the trouble of having only the ‘yes’ falling into our training set, and all the ‘no’ falling into our test set.(i.e, the training set might not have ‘no’ in its target column)

Hence by Stratifying, the target column for the training set has 80% of ‘yes’ and 20% of ‘no’, and also, the target column for the test set has 80% of ‘yes’ and 20% of ‘no’ respectively.

Hence, Stratify makes even distribution of the target(label) in the train and test set - just as it is distributed in the original dataset.

This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.

For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0’s and 75% of 1’s.



array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])




array(['What is your experience living in Venezuela in the current crisis? (2018)',
       'In which state/city the price of property is highest?',
       'Do rich blacks also call poor whites, “White Trash”?',
       'Should my 5 yr old son and 2 yr old daughter spend the summer with their father, after a domestic violent relationship?',
       'Why do we have parents?',
       'Do we experience ghost like Murphy did in Interstellar?',
       'Are Estoniano women beautiful?',
       'There was a Funny or Die video called Sensitivity Hoedown that got pulled. Does anyone know why?',
       'Is it a good idea to go in fully mainstream classes, even if I have meltdowns that might disrupt people?',
       'What classifies a third world country as such?',
       'Is being a pilot safe?',
       'Who is Illiteratendra Modi? Why does he keep with him a Rs 1 lakh pen?',
       'Have modern management strategies such as Total supply Chain Management applied to education? Can they be?',
       'Why are Lucky Charms considered good for you?',
       'How many people in India use WhatsApp, Facebook, Twitter and Instagram?'],

Task 4: TensorFlow Hub for Natural Language Processing

Our text data consist of questions and corresponding labels.

You can think of a question vector as a distributed representation of a question, and is computed for every question in the training set. The question vector along with the output label is then used to train the statistical classification model.

The intuition is that the question vector captures the semantics of the question and, as a result, can be effectively used for classification.

To obtain question vectors, we have two alternatives that have been used for several text classification problems in NLP:

  • word-based representations and
  • context-based representations

Word-based Representations

  • A word-based representation of a question combines word embeddings of the content words in the question. We can use the average of the word embeddings of content words in the question. Average of word embeddings have been used for different NLP tasks.
  • Examples of pre-trained embeddings include:
    • Word2Vec: These are pre-trained embeddings of words learned from a large text corpora. Word2Vec has been pre-trained on a corpus of news articles with 300 million tokens, resulting in 300-dimensional vectors.
    • GloVe: has been pre-trained on a corpus of tweets with 27 billion tokens, resulting in 200-dimensional vectors.

Context-based Representations

  • Context-based representations may use language models to generate vectors of sentences. So, instead of learning vectors for individual words in the sentence, they compute a vector for sentences on the whole, by taking into account the order of words and the set of co-occurring words.
  • Examples of deep contextualized vectors include:
    • Embeddings from Language Models (ELMo): uses character-based word representations and bidirectional LSTMs. The pre-trained model computes a contextualised vector of 1024 dimensions. ELMo is available on Tensorflow Hub.
    • Universal Sentence Encoder (USE): The encoder uses a Transformer architecture that uses attention mechanism to incorporate information about the order and the collection of words. The pre-trained model of USE that returns a vector of 512 dimensions is also available on Tensorflow Hub.
    • Neural-Net Language Model (NNLM): The model simultaneously learns representations of words and probability functions for word sequences, allowing it to capture semantics of a sentence. We will use a pretrained models available on Tensorflow Hub, that are trained on the English Google News 200B corpus, and computes a vector of 128 dimensions for the larger model and 50 dimensions for the smaller model.

Tensorflow Hub provides a number of modules to convert sentences into embeddings such as Universal sentence ecoders, NNLM, BERT and Wikiwords.

Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this project, we will demonstrate this by training with several different TF-Hub modules.


module_url = "" #@param [
] {allow-input: true}

Tasks 5 & 6: Define Function to Build and Compile Models


定义函数和编译模型,返回history 对象

def train_and_evaluate_model(module_url, embed_size, name, trainable=False):
  hub_layer = hub.KerasLayer(module_url, input_shape=[], output_shape=[embed_size], dtype=tf.string, trainable=trainable)

  model = tf.keras.models.Sequential([
                                      tf.keras.layers.Dense(255, activation='relu'),
                                      tf.keras.layers.Dense(64, activation='relu'),
                                      tf.keras.layers.Dense(1, activation='sigmoid')

  history =['question_text'], train_df['target'],
                      validation_data=(valid_df['question_text'], valid_df['target']),
                                 tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, mode='min'),
  return history


Task 7: Train Various Text Classification Models



histories = {}

module_url = ""

histories['gnews-swivel-20dim'] = train_and_evaluate_model(module_url, embed_size=20, name='gnews-swivel-20dim')


Model: "sequential_15"
 Layer (type)                Output Shape              Param #   
 keras_layer_15 (KerasLayer)  (None, 128)              124642688 
 dense_45 (Dense)            (None, 255)               32895     
 dense_46 (Dense)            (None, 64)                16384     
 dense_47 (Dense)            (None, 1)                 65        
Total params: 124,692,032
Trainable params: 49,344
Non-trainable params: 124,642,688

Epoch: 0, accuracy:0.9309,  loss:0.3219,  val_accuracy:0.9381,  val_loss:0.2110,  


histories['nnlm-en-dim50'] = train_and_evaluate_model(module_url, embed_size=50, name='nnlm-en-dim50')


Model: "sequential_16"
 Layer (type)                Output Shape              Param #   
 keras_layer_16 (KerasLayer)  (None, 50)               48190600  
 dense_48 (Dense)            (None, 255)               13005     
 dense_49 (Dense)            (None, 64)                16384     
 dense_50 (Dense)            (None, 1)                 65        
Total params: 48,220,054
Trainable params: 29,454
Non-trainable params: 48,190,600

Epoch: 0, accuracy:0.9338,  loss:0.3285,  val_accuracy:0.9381,  val_loss:0.2246,  


histories['nnlm-en-dim128'] = train_and_evaluate_model(module_url, embed_size=128, name='nnlm-en-dim128')


Model: "sequential_17"
 Layer (type)                Output Shape              Param #   
 keras_layer_17 (KerasLayer)  (None, 128)              124642688 
 dense_51 (Dense)            (None, 255)               32895     
 dense_52 (Dense)            (None, 64)                16384     
 dense_53 (Dense)            (None, 1)                 65        
Total params: 124,692,032
Trainable params: 49,344
Non-trainable params: 124,642,688

Epoch: 0, accuracy:0.9309,  loss:0.3167,  val_accuracy:0.9381,  val_loss:0.2104,  

Task 8: Compare Accuracy and Loss Curves


plt.rcParams['figure.figsize'] = (12, 8)
plotter = tfdocs.plots.HistoryPlotter(metric = 'accuracy')
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Accuracy Curves for Models")



Plot the loss

plotter = tfdocs.plots.HistoryPlotter(metric = 'loss')
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Loss Curves for Models")



Task 9: Fine-tune Model from TF Hub

这里我们进行fine tune, 查看训练效果

histories['gnews-swivel-20dim-finetuned'] = train_and_evaluate_model(module_url, 




Task 10: Train Bigger Models and Visualize Metrics with TensorBoard



histories['universal-sentence-encoder'] = train_and_evaluate_model(module_url, 


Model: "sequential_22"
 Layer (type)                Output Shape              Param #   
 keras_layer_22 (KerasLayer)  (None, 512)              256797824 
 dense_66 (Dense)            (None, 255)               130815    
 dense_67 (Dense)            (None, 64)                16384     
 dense_68 (Dense)            (None, 1)                 65        
Total params: 256,945,088
Trainable params: 147,264
Non-trainable params: 256,797,824

Epoch: 0, accuracy:0.9377,  loss:0.2853,  val_accuracy:0.9381,  val_loss:0.1698,  


histories['universal-sentence-encoder-large'] = train_and_evaluate_model(module_url, 


Model: "sequential_23"
 Layer (type)                Output Shape              Param #   
 keras_layer_23 (KerasLayer)  (None, 512)              147354880 
 dense_69 (Dense)            (None, 255)               130815    
 dense_70 (Dense)            (None, 64)                16384     
 dense_71 (Dense)            (None, 1)                 65        
Total params: 147,502,144
Trainable params: 147,264
Non-trainable params: 147,354,880

Epoch: 0, accuracy:0.9002,  loss:0.3592,  val_accuracy:0.9381,  val_loss:0.1811,  



loss 比较


%load_ext tensorboard
%tensorboard --logdir {logdir}




怎么查看TensorFlow Hub?

输入: 就可以进入到 hub

使用 TensorFlow Hub,这里提供可以 Fine Tune 的模型,只需要设置 trainable=True 即可。

并且,在这个 project 里面,我们还学习到训练不同的模型的过程。

另外,Transer Learning 为什么可行呢?

Here are a few reasons I could think of:

  • Many NLP tasks share common knowledge about language (linguistic representations, structural similarities, syntax, semantics…).
  • Annotated data is rare, make use of as much supervision as possible. If you can combine data sets that you used for several tasks to get much bigger datasets. Bigger datasets are generally better for deep learning models.
  • Unlabelled data is abundant (e.g. on the world wide web) and one should try to use as much of it as possible.
  • Empirically, transfer learning has resulted in SOTA results for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc).

来源:Snehan Kekre

加:2022-06-03 23:58:52  更:2022-06-03 23:59:00 
