The goal of Dialog State Tracking (DST) [NLP-DST3] is to build a representation of the status of the ongoing conversation being a sequence of utterances exchanged between dialog participants. In another words, the goal of DST system is to capture user goals and intentions and encode them as a set of slots along with the corresponding values. DST is considered an important module for most of the goal-oriented dialogue systems.


Fig. 1: An exemplary, multi-domain dialog along with the associated state tracking (source: [NLP-DST4])

In this tutorial we will focus on a multi-domain dialogue MultiWOZ dataset [NLP-DST1] and show how one can train a TRADE model [NLP-DST4], being one of the recent, state of the art models. Multi-domain setting introduces several challanges, with the most important coming from the need for multi-turn mapping. In a single-turn mapping scenario the (domain, slot, value) triplet can be inferred from a single turn. In multi-turn this assumption does not hold and the DST system must infer those from multiple turns, possibly spanning over several different domains.

The MultiWOZ Dataset

The Multi-Domain Wizard-of-Oz dataset (MultiWOZ) is a collection of human-to-human conversations spanning over 7 distinct domains and containing over 10,000 dialogues. The original MultiWOZ 2.0 dataset was introduced in [NLP-DST1]. However, in this tutorial we will utilize MultiWOZ 2.1 [NLP-DST2], which is an updated version of MultiWOZ 2.0. They have fixed several issues with the original dataset including errors in states, utterances, value canonicalization etc.). Our model can also get trained on MultiWOZ 2.0.

The MultiWOZ dataset covers the following domains:
  1. restaurant

  2. hotel

  3. attraction

  4. taxi

  5. train

  6. hospital

  7. police

As well as the following slots:
  • inform (∗)

  • address (∗)

  • postcode (∗)

  • phone (∗)

  • name (1234)

  • no of choices (1235)

  • area (123)

  • pricerange (123)

  • type (123)

  • internet (2)

  • parking (2)

  • stars (2)

  • open hours (3)

  • departure (45)

  • destination (45)

  • leave after (45)

  • arrive by (45)

  • no of people (1235)

  • reference no. (1235)

  • trainID (5)

  • ticket price (5)

  • travel time (5)

  • department (7)

  • day (1235)

  • no of days (123).

Please note that some of the actions and slots are associated with particular domain(s), whereas some are universal, i.e. domain independent. The latter ones are denoted with (∗).

MultiWOZ offers 10,438 dialogues, with 115,434 turns in total. Dialogues are generally classified into single and multi-domain dialogues. Dialogue length distribution is varying from 1 to 31, with around 70% of dialogues have more than 10 turns. The average number of turns are 8.93 and 15.39 for single and multi-domain dialogues.

Each dialogue consists of a goal, multiple user and system utterances as well as a belief state and set of dialogue acts with slots per turn. Additionally, each dialog is supported with a task description. Moreover, it contains both system and user dialogue act annotations (the latter introduced in MultiWOZ 2.1).


The TRAnsferable Dialogue statE generator (TRADE) [NLP-DST4] is a model designed specially for the multi-domain task-oriented dialogue state tracking problem. The model generates dialogue states from utterances and history. It learns embeddings for domains and slots, and also benefits from copy mechanism to facilitate knowledge transfer between domains. It enables the model to predict (domain, slot, value) triplets not encountered during training in a given domain.


Fig. 2: Architecture of the TRADE model (source: [NLP-DST4])

The model is composed of three main components:

  • Utterance Encoder,

  • Slot Gate, and

  • State Generator.

The utterance encoder is a bi-directional Gated Recurrent Unit (GRU), returning both context words and and an aggregated context vector encoding the whole dialogue history.

The state generator also uses GRU to predict the value for each(domain, slot) pair. Generator employ a soft-gated pointer-generator copying to combine a distribution over the vocabulary and a distribution over the dialogue history into a single output distribution.

Finally, the slot gate is a simple classifier that maps a context vector taken from the encoder hidden states to a probability distribution over three classes: ptr, none, and dontcare.

Data Pre-processing

First, you need to download MULTIWOZ2.1.zip from the MultiWOZ2.1 project website. It contains the data for MultiWOZ 2.1 dataset. Alternatively, you can download MULTIWOZ2.zip compressed file from MultiWOZ2.0 which contain the older version of this dataset.

Next, we need to preprocess and reformat the dataset, what will result in division of data into three splits:

  • traininig split (8242 dialogs in the train_dials.json file)

  • development/validation split (1000 dialogs in the dev_dials.json file)

  • test split (999 dialogs in the test_dials.json file)

In order to preprocess the MultiWOZ dataset you can use the provided process_multiwoz.py script:

cd examples/nlp/dialogue_state_tracking/data/
python process_multiwoz.py \
    --source_data_dir <path to MultoWOZ dataset> \
    --target_data_dir <path to store the processed data>


Argument –source_data_dir specifies the folder where you have copied and extracted data into. It will store the processed dataset in the folder given by –target_data_dir. Both MultiWOZ 2.0 and MultiWOZ 2.1 datasets can get processed with the same script.

Building the NeMo Graph

The NeMo training graph consists of the following six modules including data layer, encoder, decoder, and losses:

  • data_layer (nemo.collection.nlp.nm.data_layers.MultiWOZDataLayer)

  • encoder (nemo.backends.pytorch.common.EncoderRNN)

  • decoder (nemo.collection.nlp.nm.trainables.TRADEGenerator)

  • gate_loss_fn (nemo.backends.pytorch.common.losses.CrossEntropyLossNM)

  • ptr_loss_fn (nemo.collections.nlp.nm.losses.MaskedLogLoss)

  • total_loss_fn (nemo.collection.nlp.nm.losses.LossAggregatorNM)


In order to train an instance of the TRADE model on the MultiWOZ dataset and evaluate on its test data simply run the dialogue_state_tracking_trade.py script with default parameters:

cd examples/nlp/dialogue_state_tracking
python dialogue_state_tracking_trade.py \
    --data_dir <path to the data> \
    --work_dir <path to store the experiment logs and checkpoints> \
    --eval_file_prefix <test or dev>

You may find the list of parameters in the example file and update them as see fits. By default the script would train the model for 10 epochs on 1 single gpu. The police and hospital domains are excluded from the training by default as they do not exist in the development set. The list of the domains can get updated in the example.

Evaluating Checkpoints

By default a folder named “checkpoints” would get created under the working folder specified by –work_dir and checkpoints are stored under it. To do evaluation a checkpoint on test or dev set, you may run the same script by passing –checkpoint_dir and setting –num_epochs as zero to avoid the training:

cd examples/nlp/dialogue_state_tracking
python dialogue_state_tracking_trade.py \
    --data_dir <path to the data> \
    --checkpoint_dir <path to checkpoint folder> \
    --eval_file_prefix <test or dev> \
    --eval_batch_size <batch size for evaluation> \
    --num_epochs 0

Metrics and Results

In the following table we compare the results achieved by our TRADE model implementation with the results reported in the original paper [NLP-DST4]. We trained our models for 10 epochs on a single GPU with 16GB memory. As the authors reported results on just MultiWOZ 2.0 dataset, we ran the original implementation on MultiWOZ 2.1 dataset and reported those too.

We used the same parameters as the original implementation. There are some differences between our implementation and the original one. The main difference is that our model does not use pre-trained embeddings which seems not to affect the performance of the model. The other difference is that we used SquareAnnealing for the learning policy instead of fixed learning rate. Additionally, we create the vocabulary just based on the training data while the default for the original one is to create vocabulary from all the data including test and development sets. The main reason behind the improvement of our model in terms of accuracy is utilizing better learning rate policy. When we used fixed learning rate in our implementation, we got similar results as the original one.

We also did some improvements to the implementation of the model to have faster training. It makes our implementation significantly faster than the original one. Additionally, NeMo supports multi-GPU training which enables even faster training time. It should be noted that learning rate needs to get increased if you want to use multi-GPU training because of having larger batch size.

Following [NLP-DST4], we used two main metrics to evaluate the model performance:

  • Joint Goal Accuracy compares the predicted dialogue states to the ground truth at each dialogue turn, and the output is considered correct if and only if all the predicted values exactly match the ground truth values.

  • Slot Accuracy independently compares each (domain, slot, value) triplet to its ground truth label.

TRADE implementations

MultiWOZ 2.0

MultiWOZ 2.1













Original [NLP-DST4]









NeMo’s Implementation of TRADE









You may find the checkpoints for the trained models on MultiWOZ 2.0 and MultiWOZ 2.1 datasets on NGC:


During training, TRADE model uses an additional supervisory signal, enforcing the Slot Gate to properly predict special values for like don’t care or none for the slots. The process_multiwoz.py script extracts the additional labels from the dataset and dialogue_state_tracking_trade.py script reports the Gating Accuracy as well.



Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278, 2018.


Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur. Multiwoz 2.1: multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669, 2019.


Matthew Henderson. Machine learning for dialog state tracking: a review. research.google, 2015.


Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743, 2019.