opensource.google.com

Menu

arXiv LaTeX cleaner: safer and easier open source research papers

Friday, February 22, 2019

Open source is usually associated with code behind utilities and applications, though you can find it in many other places: such as the LaTeX source code that describes the PDFs of scientific papers.

As an example, the following source code:


Generates this PDF when compiled using pdflatex:
You can see a huge repository of such open source code at arXiv.org, an open access repository of scientific papers currently containing about 1.5 million entries (140,616 uploads in 2018). One can not only download all papers in PDF format, but also obtain the source code to regenerate them and freely reuse any of their parts.

Open sourcing LaTeX code, however, comes with its risks and challenges. We’ve built and released the code of arXiv LaTeX cleaner to remedy some of these.

Scrubbing the Code

The main risk one faces when sharing LaTeX code with the world is accidentally releasing private information, primarily through commented code left over in the file itself.

While authors put a lot of effort into polishing the final PDF, the code isn’t usually cleaned up and is left with many pieces of text that don’t actually appear in the PDF. Things like, “I do not see why the following statement should be correct,” or “Look, I’m citing you!,” make it into arXiv for everyone to see. This happens so often there’s even a Twitter bot that finds and publishes them!

Cleaning up this commented out code manually is laborious, so arXiv LaTeX cleaner automatically removes it for you.

Private information can also be found in the many auxiliary files that LaTeX generates when the code is compiled. Some of them are needed in arXiv (e.g., .bbl files), some of them are not: arXiv LaTeX cleaner will delete the unneeded ones and keep the rest automatically.

Cleaning and Autoscaling Images

Challenges also come our way when preparing the code to submit to arXiv: one needs to upload a ZIP file smaller than 10 MBytes. With high resolution pictures and figures, it’s easy to go beyond the limit.

Manually resizing images and deleting images that aren’t actually in the final version is time consuming and cumbersome, so arXiv LaTeX cleaner does that automatically, too. If there’s a very intricate figure you’d like to keep in high resolution, you can specify a list of images and their expected resolution.

We hope that, by making open sourcing research papers faster and safer, arXiv LaTeX cleaner will help even more researchers embrace open access and make their work freely available.

arXiv LaTeX cleaner itself is open source, so you can adapt it to your needs. If you think your adaptation would be useful for others, we’d love your contributions, too.

By Jordi Pont-Tuset, Machine Perception team

Open sourcing ClusterFuzz

Thursday, February 7, 2019

Fuzzing is an automated method for detecting bugs in software that works by feeding unexpected inputs to a target program. It is effective at finding memory corruption bugs, which often have serious security implications. Manually finding these issues is both difficult and time consuming, and bugs often slip through despite rigorous code review practices. For software projects written in an unsafe language such as C or C++, fuzzing is a crucial part of ensuring their security and stability.

In order for fuzzing to be truly effective, it must be continuous, done at scale, and integrated into the development process of a software project. To provide these features for Chrome, we wrote ClusterFuzz, a fuzzing infrastructure running on over 25,000 cores. Two years ago, we began offering ClusterFuzz as a free service to open source projects through OSS-Fuzz.

Today, we’re announcing that ClusterFuzz is now open source and available for anyone to use.



We developed ClusterFuzz over eight years to fit seamlessly into developer workflows, and to make it dead simple to find bugs and get them fixed. ClusterFuzz provides end-to-end automation, from bug detection, to triage (accurate deduplication, bisection), to bug reporting, and finally to automatic closure of bug reports.

ClusterFuzz has found more than 16,000 bugs in Chrome and more than 11,000 bugs in over 160 open source projects integrated with OSS-Fuzz. It is an integral part of the development process of Chrome and many other open source projects. ClusterFuzz is often able to detect bugs hours after they are introduced and verify the fix within a day.

Check out our GitHub repository. You can try ClusterFuzz locally by following these instructions. In production, ClusterFuzz depends on some key Google Cloud Platform services, but you can use your own compute cluster. We welcome your contributions and look forward to any suggestions to help improve and extend this infrastructure. Through open sourcing ClusterFuzz, we hope to encourage all software developers to integrate fuzzing into their workflows.

By Abhishek Arya, Oliver Chang, Max Moroz, Martin Barbella and Jonathan Metzman, ClusterFuzz team

Dopamine 2.0: providing more flexibility in reinforcement learning research

Wednesday, February 6, 2019

Reinforcement learning (RL) has become one of the most popular fields of machine learning, and has seen a number of great advances over the last few years. As a result, there is a growing need from both researchers and educators to have access to a clear and reliable framework for RL research and education.

Last August, we announced Dopamine, our framework for flexible reinforcement learning.  For the initial version we decided to focus on a specific type of RL research: value-based agents evaluated on the Atari 2600 framework supported by the Arcade Learning Environment. We were thrilled to see how well it was received by the community, including a live coding session, its inclusion in a recently-announced benchmark for RL, considered as the top “Cool new open source project of 2018” by the Octoverse, and over 7K GitHub stars on our repository.

One of the most common requests we have received is support for more environments. This confirms what we have seen internally, where simpler environments, such as those supported by OpenAI’s Gym, are incredibly useful when testing out new algorithms. We are happy to announce Dopamine 2.0, which includes support for discrete-domain gym environments (e.g. discrete states and actions). The core of the framework remains unchanged, we have simply generalized the interface with the environment. For backwards compatibility, users will still be able to download version 1.0.

We include default configurations for two classic control environments: CartPole and Acrobot; on these environments one can train a Dopamine agent in minutes. When compared with the training time for a standard Atari 2600 game (around 5 days on a standard GPU), these environments allow researchers to iterate much faster on research ideas before testing them out on larger Atari games. We also include a Colaboratory that illustrates how to train an agent on Cartpole and Acrobot. Finally, our GymPreprocessing class serves as an example for how to use Dopamine with other custom environments.

We are excited by the new opportunities enabled by Dopamine 2.0, and look forward to seeing what the research community creates with it!

By Pablo Samuel Castro and Marc G. Bellemare, Dopamine Team
.