Visual Question Answering (VQA) is to answer open-ended natural language questions related to images. Modern VQA tasks require reading and reasoning of both images and texts. We propose DiagNet, an attention-based neural network model that can effectively combine multiple evidence of texts, images and texts in images. Within DiagNet, a novel multi-task training strategy is used to combine answer type evidence in a hybrid fusion. We conduct comprehensive evaluation on multiple VQA tasks, and achieve competitive results.
- We propose a new neural architecture called DiagNet, that has the ability to read text in images and answer questions by reasoning over the text, objects, and questions.
- We propose a novel multi-task training strategy combining answer type evidence in a hybrid fusion.
- We conduct comprehensive evaluation on multiple VQA tasks, and achieve competitive results.
The neural network consists of three branches. The first row corresponds to the image object branch. The second row is the question branch. The third line is the OCR token branch.
Experimental Results (find details in report)
Ablation Study on TextVQA v0.5 dataset
|DiagNet w/o BUTD & OCR||11.42|
|DiagNet w/o OCR||11.25|
|DiagNet w/o BUTD||18.22|
Performance (in %) on TextVQA v0.5 dataset