DiagNet

[poster] [report] [code]
Course Project at UM (Jan. - May 2019): Mark Jin Shengyi Qian Xiaoyang Shen Yi Wen Xinyi Zheng
EECS 498-12: Deep Learning, Instructed by: Honglak Lee

Visual Question Answering (VQA) is to answer open-ended natural language questions related to images. Modern VQA tasks require reading and reasoning of both images and texts. We propose DiagNet, an attention-based neural network model that can effectively combine multiple evidence of texts, images and texts in images. Within DiagNet, a novel multi-task training strategy is used to combine answer type evidence in a hybrid fusion. We conduct comprehensive evaluation on multiple VQA tasks, and achieve competitive results.

Contribution

  • We propose a new neural architecture called DiagNet, that has the ability to read text in images and answer questions by reasoning over the text, objects, and questions.
  • We propose a novel multi-task training strategy combining answer type evidence in a hybrid fusion.
  • We conduct comprehensive evaluation on multiple VQA tasks, and achieve competitive results.

Architecture

The neural network consists of three branches. The first row corresponds to the image object branch. The second row is the question branch. The third line is the OCR token branch.

Experimental Results (find details in report)

Ablation Study on TextVQA v0.5 dataset

Model Accuracy
DiagNet w/o BUTD & OCR 11.42
DiagNet w/o OCR 11.25
DiagNet-late 15.34
DiagNet-binary 15.86
DiagNet w/o BUTD 18.22
DiagNet-OCR 18.44
DiagNet 18.77

Performance (in %) on TextVQA v0.5 dataset

Model Accuracy
Pythia 13.04
DiagNet 18.77
LoRRA+Pythia 26.56

Error Examples