Publications

CONFERENCE (INTERNATIONAL) On Text Localization in End-to-End OCR-Free Document Understanding Transformer without Text Localization Supervision

Geewook Kim (NAVER Cloud), Shuhei Yokoo, Sukmin Seo (NAVER Cloud), Atsuki Osanai, Yamato Okamoto, Youngmin Baek (NAVER Cloud)

10th International Workshop on Camera-Based Document Analysis and Recognition (CBDAR2023)

August 25, 2023

This paper presents a simple yet effective approach for weakly supervised text localization in end-to-end visual document understanding (VDU) models. The traditional approach in VDU is to utilize off-the-shelf OCR engines in conjunction with natural language understanding models. However, to simplify the VDU pipeline and improve efficiency, recent research has focused on OCR-free document understanding transformers. These models have a limitation in that they do not provide the location of the text to the user. To alleviate it, we propose a simple yet effective method for text localization in OCR-free models that does not require any additional supervision, such as bounding box annotations. The method is based on properties of attention mechanism, and is able to output text areas with competitive high accuracy compared to other supervised methods. The proposed method can be easily applied to most existing OCR-free models, making it an attractive solution for practitioners in the field. We validate the method through experiments on document parsing benchmarks, and the results demonstrate its effectiveness in generalizing to various camera-captured document images, such as, receipts and business cards. The implementation will be available at https://github.com/clovaai/donut.

Paper : On Text Localization in End-to-End OCR-Free Document Understanding Transformer without Text Localization Supervision open into new tab or window (external link)