opensource.google.com

Menu

arXiv LaTeX cleaner: safer and easier open source research papers

Friday, February 22, 2019

Open source is usually associated with code behind utilities and applications, though you can find it in many other places: such as the LaTeX source code that describes the PDFs of scientific papers.

As an example, the following source code:


Generates this PDF when compiled using pdflatex:
You can see a huge repository of such open source code at arXiv.org, an open access repository of scientific papers currently containing about 1.5 million entries (140,616 uploads in 2018). One can not only download all papers in PDF format, but also obtain the source code to regenerate them and freely reuse any of their parts.

Open sourcing LaTeX code, however, comes with its risks and challenges. We’ve built and released the code of arXiv LaTeX cleaner to remedy some of these.

Scrubbing the Code

The main risk one faces when sharing LaTeX code with the world is accidentally releasing private information, primarily through commented code left over in the file itself.

While authors put a lot of effort into polishing the final PDF, the code isn’t usually cleaned up and is left with many pieces of text that don’t actually appear in the PDF. Things like, “I do not see why the following statement should be correct,” or “Look, I’m citing you!,” make it into arXiv for everyone to see. This happens so often there’s even a Twitter bot that finds and publishes them!

Cleaning up this commented out code manually is laborious, so arXiv LaTeX cleaner automatically removes it for you.

Private information can also be found in the many auxiliary files that LaTeX generates when the code is compiled. Some of them are needed in arXiv (e.g., .bbl files), some of them are not: arXiv LaTeX cleaner will delete the unneeded ones and keep the rest automatically.

Cleaning and Autoscaling Images

Challenges also come our way when preparing the code to submit to arXiv: one needs to upload a ZIP file smaller than 10 MBytes. With high resolution pictures and figures, it’s easy to go beyond the limit.

Manually resizing images and deleting images that aren’t actually in the final version is time consuming and cumbersome, so arXiv LaTeX cleaner does that automatically, too. If there’s a very intricate figure you’d like to keep in high resolution, you can specify a list of images and their expected resolution.

We hope that, by making open sourcing research papers faster and safer, arXiv LaTeX cleaner will help even more researchers embrace open access and make their work freely available.

arXiv LaTeX cleaner itself is open source, so you can adapt it to your needs. If you think your adaptation would be useful for others, we’d love your contributions, too.

By Jordi Pont-Tuset, Machine Perception team
.