Tokenization Repair in the Presence of Spelling Errors

NAACL 2021: Reproducibility Material for Submission #344

Web demo

Try our tokenization repair methods in the interactive web demo.

Visit the web demo

Evaluation web application

Click through our benchmarks and get a visualisation of the results in the evaluation web app.

Evaluation web app


The data contains our benchmarks described in the paper, as well as trained models and predicted sequences from all our methods (1GB compressed). In addition, you can download our training data (6GB compressed).

Download data

Download training data


The code comes with a Docker setup for easy reproducibility. A readme file in the code directory explains how to setup the Docker container. If you are not familiar with Docker, please visit

The Docker container allows you to try our methods interactively, run them on our benchmarks (or on yours!), and run the evaluation. Make targets simplify the program calls and give further explanations. It is really fun and simple, we promise!

The latest version is 1.1.2 (December 2, 2020).

Download code