opensource.google.com

Menu

Posts from May 2009

Improving Freenet's Performance

Friday, May 29, 2009

The Free Network project is the community that creates and maintains Freenet, free software that allows you publish and obtain information on the Internet without fear of censorship by means of a decentralized, anonymous network. Since version 0.7 , the software has had built-in support for downloading and uploading large files. These are long-term downloads, which persist between restarts of the node. This support has improved performance and usability, but it has also meant that when lots of downloads are going on at the same time, Freenet uses a lot of memory, takes a long time to complete the startup process, and crashes if you queue too many downloads. By storing the current progress of uploads and downloads in db4o.com's open source object database (= a file on disk) rather than in memory, Freenet's memory usage can be greatly reduced, the end-user doesn't need to worry about running out of memory, we can have an unlimited number of uploads and tens of gigs of downloads, and so on.

To begin at the beginning, Freenet divides all files into 32KB blocks (called CHKs), which are each fetched and decrypted separately. Then we have a layer of redundancy, and various complexities surrounding putting files together and putting in-Freenet websites together, which makes up the client layer. Before the db4o branch, uploads were persistent, but downloads were restarted from scratch after every restart, pulling huge numbers of blocks from the datastore (on-disk cache). Worse, memory usage was rather large if you had any significant number of downloads on the queue. 

The db4o project puts the client layer (persistent downloads and uploads) into a database (db4o). I had initially hoped that this would be a relatively quick project, which shows how much I knew about databases then! We decided to use db4o in a fairly low-level way, specifically to minimize memory usage. We had heard from testimonials that some embedded applications had done this, but unfortunately this is not really the way that db4o is usually used, which caused some complications. Overall, the project took one developer most of a year, the final diff was over 46K lines of code covering 320 files, and went well beyond its original remit, solving many long-standing problems in the process. New architecture was required for optimal performance, including using Bloom filters to identify blocks we are interested in, a queue of database jobs, major refactoring in many areas of the client layer, a new system for handling temporary files, etc.

The effort was well worth it. Our client layer overall has vastly improved and Freenet now
  • starts up quickly

  • resumes work on downloads and uploads almost instantly on startup

  • can have an almost unlimited number of downloads and uploads

  • doesn't need the user to worry about or configure the maximum memory usage

  • doesn't go into limbo with constant 100% CPU usage desperately trying to scrounge a few more bytes

  • can insert DVD-sized files and huge websites (or git/hg repositories) on relatively low end systems

  • uses fewer file handles

This project would not have happened without support from Google's Open Source Programs Office. It will be one of the most important changes in version 0.8  of Freenet when it is released later this year, and current work includes Bloom filter sharing, a new feature that should greatly improve performance both for popular and rare content. Google is also funding that project, watch this space!

Web Storage Portability Layer: A Common API for Web Storage

Thursday, May 28, 2009

As discussed in our Google Code Blog post on HTML5 for Gmail Mobile, Google's new version of Gmail for iPhone and Android-powered devices uses the Web Storage Portability Layer (WSPL) to let the same database code run on browsers that provide either Gears or HTML5 structured storage facilities. The WSPL consists of a collection of classes that provide asynchronous transactional access to both Gears and HTML5 databases and can be found on Project Hosting on Google Code.

There are five basic classes:

google.wspl.Statement - A parametrizable SQL statement class

google.wspl.Transaction - Used to execute one or more Statements with ACID properties

google.wspl.ResultSet - Arrays of JavaScript hash objects, where the hash key is the table column name

google.wspl.Database - A connection to the backing database, also provides transaction support

google.wspl.DatabaseFactory - Creates the appropriate HTML5 or Gears database implementation


Also included in the distribution is a simple note-taking application with a persistent database cache built using the WSPL library. This application (along with Gmail mobile for iPhone and Android-powered devices) is an example of the cache pattern for building offline web applications. In the cache pattern, we insert a browser-local cache into the web application to break the synchronous link between user actions in the browser and server-generated responses. Instead, as shown below, we have two data flows. First, entirely local to the device, contents flow from the cache to the UI while changes made by the user update the cache. In the second flow, the cache asynchronously forwards user changes to the web server and receives updates in response.

By using this architectural pattern, a web application can made tolerant of a flaky (or even absent) network connection!

We'll be available at the Developer Sandbox at Google I/O to discuss the cache pattern, HTML5 development and the WSPL library. Check it out! If you have questions or comments, please visit our discussion list.

Support for Mercurial Now Available for All Projects Hosted on Google Code

You may recall that we recently asked for help from some early testers of Mercurial on Project Hosting on Google Code. As of today, all of our Project Hosting users can make use of this added functionality. For full details, check out the Google Code Blog. Better still, if you happen to be joining us at Google I/O, stop by the Mercurial on Big Table Tech Talk to learn more.

2009 Google Summer of Code™ Celebration at Beijing LUG

Thursday, May 21, 2009


Newly elected BLUG President, Pockey Lam, handing over Google Summer of Code stickers

The Beijing Linux User Group (BLUG) invited all the Google Summer of Code students and mentors in China to a special BBQ lamb skewers night night (we call these "chuan'r nights" actually) on Saturday, May 9th. The BLUG has been supporting this fantastic initiative from Google as much as possible since 2007 by organizing conferences in universities and spreading the word online in Chinese websites.

This year we're very proud to have 2 mentors and at least 5 student members of our group selected for Google Summer of Code. Both mentor proposals are centered around building better Linux support for MIPS and the Chinese Loongson CPU. We also managed to get the two main hardware manufacturers (Dexxon/Emtec and Lemote) to donate free MIPS laptops to the selected students.

And to add icing on the cake, Google sent schwag for our little party, Lemote a few free T-shirts and the Beijing LUG distributed its 2009 edition of their own T-shirts for free as well! Definitely something not to be missed if you were in town (Beijing, China)!

Celebration announcement details here, location details here, and full picture gallery here.

Happy Birthday, ICU!

Tuesday, May 19, 2009

The ICU project is celebrating 10 years of being open source this month.

"ICU" in this case stands for International Components for Unicode - not to be confused with Intensive Care Unit or International Communist Union... It is the premier software internationalization library, appearing in everything from your Google Android phone or your iPod all the way up to IBM mainframes. It provides the Unicode support that all of these programs need for handling the languages of the world, from Arabic to Chinese to Vietnamese.

ICU originated back in an Apple/IBM/HP joint venture. That code was morphed into the core of Java internationalization for JDK 1.1.4 - a large portion of this code still exists in the java.text and java.util packages. At that time, it included pretty much just sorting, locale/message support, and formatting for dates, numbers and so on. (If you're interested in early history, see an older paper by Laura Werner - now at Google). The libraries were refined over time and ported back to C and C++; now there are also wrappers for other languages, such as PHP.

ICU's data comes from the Unicode Consortium's open source project for locale data - CLDR - and typically releases each new version right after CLDR does. CLDR 1.7 was just released Friday, May 8, with ICU 4.2 following on the very same day.

While ICU was around before Google, more recently Google has played a strong role in the development of ICU, and in providing major contributions to the Unicode CLDR project. ICU forms the foundation of our 40 language initiative, so we look forward to many successful future birthdays!


Report from Day 2 of the Linux Storage and Filesystem Workshop, April 6-7th, 2009

Monday, May 18, 2009

(The sequel to Report from Day 1)

The first official discussion of the day was around optimizing Solid State Disk performance, led by Matthew "willy" Wilcox (Intel). SSD behaviors and "modeling" that behavior are still of interest. File systems developers need to understand how the performance costs have shifted compared to regular disks. Matthew noted the Intel SSDs are adaptive and export a 512 byte sector illusion. Ric Wheeler (RedHat) also raised concerns about current use of fallocate() and how adding use of "TRIM" to fallocate() would affect "Thin Client Provisioning." Consensus was any form of over subscription of HW would cause problems and TRIM command might exacerbate those issues. But TRIM would have measurable positive impact for most users (of SSDs). Ted Ts'o expected ext4 to already properly issue TRIM at the right times.

Moving on, "4KB sector" hard disk support is mostly done and Martin Petersen did much of the recent work to prepare linux kernel for it. Performance issues are still lurking however when using FAT partition tables or anything else that might affect the alignment offset of a write. The
root cause of this disaster is the drives export a 512 byte logical sector (despite implementing 4k physical sectors). They export 512 byte sectors in order to boot from older BIOS and function with "legacy" operating systems. But in order to get good performance with FAT partition tables, HW vendors have "offset" the logical->physical block mapping so logical sector 63 (size of "track" in ancient BIOSs) is "well aligned" physcially. "Badly aligned" means a 4k write will require reading and writing parts of two 4K physical sectors and thus "burning" an extra rotation (must read data first and then write it out on the next rotation). Well, anyone using full disk or some other partition table (e.g. GPT) will learn the joys of unnecessary complexity when they demonstrate and try to explain two levels of disk performance for the same application binary. The only way to avoid this mess is if HDD vendors provide the means to directly use native 4k blocks.

The second issue with 512 byte block emulation is error handling. The problem here is performance will be horrid for sequential small writes to a single 4k block unless the intermediate writes are cached anyway and then written in one go when the last sector is written....if the 8th write fails, the OS will think the previous 7 sectors are fine and just rewrite the last sector again. Previous 7x512 bytes are gone. With Write Cache Enabled (WCE) turned on for most SATA drives, this problem already exists. The only thing new this speculation exposes is disk vendors have strong incentives to violate the intent of "WCE off" despite dire consequences.

The last presentation I want to mention was "Virtual Machine I/O". The challenge was how Block IO schedulers need to manage bandwidth in various topologies typically seen in Virtualized IO. Google's Naumann Rafique was one of the presenters with a focus on "Proportional IO" implementation. Hannes Reincke summarized the core problems nicely: IO scheduling is only needed when there is contention at the device level- keep the mechanism to enforce scheduling at that level. Different policies should be implemented at higher levels as needed.

Later on, when talking with the group in a "hacking session", I backed up my assertion that this is not a new problem by showing my copy of OLS 2004 schedule where IO prioritization was mentioned by Jens Axboe in his talk and in a BOF led by Werner Almsberger (http://abiss.sourceforge.net/). My advice was to solve and push the simplest piece first before confusing everyone with grand designs and huge patches.

And I'll close with my kudos to Linux Foundation staff to pulling this off smoothly! Really. It was nice to see a small event get handled so professionally and courteously.

Zurich Open Source Jam 7

Friday, May 15, 2009

We did it again, and we are getting better at it. Last Thursday, May 7th we hosted the 7th Open Source Jam in Zurich. It was probably our largest event, with close to 50 participants!

It was a bit more than 3 hours of event, and 10 projects were presented: EVMS (Enterprise Volume Management Software) which promises to be a new model of volume management for Linux, My paint - very interesting software to create images from scratch, RTEMS - a real time operating system for multiprocessor systems, Solid State Drives, Monitoring Systems, Gurtle - an issue tracker integration for TortoiseSVN, RDKit - cheminformatics and machine learning software, a very nice overview on the development model and features of Drupal, SQL for Google App Engine, and OSGi, the dynamic module system for Java.

We would like to thank you all again for participating and sharing the interesting projects you are working on, and also invite you to subscribe to our Open Source Jam Zurich Group to stay informed about other events in Zurich.

Unfortunately, our official photographer was out of town, but we promise pictures for the next Open Source Jam!

Google Update Releases Update Controls

Thursday, May 14, 2009

Whenever we build out new products and features at Google, we try to ensure that we provide users with two key components: transparency and control. About a month ago we released the Google Update source code to give users and developers transparency into Google's update mechanism. Today we hope to fulfill this second component by providing advanced users the ability to control the installation and updating of Google products via Google Update. Thanks to automatic updates, most users should already have this version of Google Update.

The update policy is controlled via Windows Group Policy, allowing network administrators to apply policies to all computers on their domain and power users with administrative privileges to set the policy on individual machines. We provide an Administrative Template file that allows selection of policies using standard graphical user interfaces such as Group Policy Editor.

The new Group Policy support allows an administrator to specify which Google applications can be installed and how they should be updated. You can select from one of three update options: automatically, manually, or not at all. Administrators can also control how frequently Google Update checks for software updates.

Mac users have similar controls over Google Software Update. Mac users are able to change how often update checks occur or disable update checks all together. See the Managing updates in Google Software Update Help Center article for details.

We work hard to keep our users safe and secure when using our applications, and we believe that making sure users have the latest software available using automatic updates is a key component of that. However, we realize that there are situations where automatic updates may not be desirable so we wanted to provide the ability to control updates when necessary.

To get started, take a look at the Google Update for Enterprise documentation.

Spreading the Summer Love in Chicago

Tuesday, May 12, 2009

There are many ways to get the word out about Google Summer of Code™, and one of the most fun ways is the Google Summer of Code meetups that participants organize around the world. After a successful meetup last year for University of Chicago students, the University's ACM student chapter decided to organize another meetup this year, open to other universities in the Chicago area and again hosted by Google. Thus, on April 30th, around 70 students from The University of Chicago, Northwestern University, DePaul University, the Illinois Institute of Technology, and the University of Illinois at Chicago converged upon Google's offices in downtown Chicago, who happily allowed us to use their space, eat their food, and consume their caffeinated beverages.

Like last year, the event revolved around a series of lightning talks where Google Summer of Code students and mentors in Chicago talked about their upcoming work, and Google engineers talked about cool stuff they are involved in. Among the evening's highlights, there was mingling:



Whereas last year we were greeted by the Tower of Hanoi of Chinese Food, this year we were met by the Mexican Food Buffet of Awesomeness:



The talks took place in Google's offices on the 17th floor of a downtown building, with stunning views of the city:


(That tall black building is the John Hancock Center )



Our Google Summer of Code speakers were (left to right, top to bottom):



Not pictured are a couple other Google Summer of Code students from Chicago who couldn't make it to the meetup or did not give talks: Joe Doliner (University of Chicago, student for BRL-CAD), Caden Howell (DePaul University, student for the Electronic Frontier Foundation), Chandra Ramachandran (University of Illinois at Urbana-Champaign, student for the National Center for Supercomputing Applications ), and Ori Rawlings (Illinois Institute of Technology, student for the Natural User Interface Group).

Our Google speakers were (left to right) Nathaniel Manista, Jon Trowbridge, and Jacob Lee, who provided us with a steady supply of very amusing slides.








Oodles of thanks go to Google for hosting this event (specially to Jon Trowbridge, organizer extraordinaire) and congratulations to our Chicago-area students for making it into Google Summer of Code!

Introducing WebDriver

Friday, May 8, 2009

WebDriver is a clean, fast framework for automated testing of webapps. Why is it needed? And what problems does it solve that existing frameworks don't address?

For example, Selenium, a popular and well established testing framework is a wonderful tool that provides a handy unified interface that works with a large number of browsers, and allows you to write your tests in almost every language you can imagine (from Java or C# through PHP to Erlang!). It was one of the first Open Source projects to bring browser-based testing to the masses, and because it's written in JavaScript it's possible to quickly add support for new browsers that might be released

Like every large project, it's not perfect. Selenium is written in JavaScript which causes a significant weakness: browsers impose a pretty strict security model on any JavaScript that they execute in order to protect a user from malicious scripts. Examples of where this security model makes testing harder are when trying to upload a file (IE prevents JavaScript from changing the value of an INPUT file element) and when trying to navigate between domains (because of the single host origin policy problem).

Additionally, being a mature product, the API for Selenium RC has grown over time, and as it has done so it has become harder to understand how best to use it. For example, it's not immediately obvious whether you should be using "type" instead of "typeKeys" to enter text into a form control. Although it's a question of aesthetics, some find the large API intimidating and difficult to navigate.

WebDriver takes a different approach to solve the same problem as Selenium. Rather than being a JavaScript application running within the browser, it uses whichever mechanism is most appropriate to control the browser. For Firefox, this means that WebDriver is implemented as an extension. For IE, WebDriver makes use of IE's Automation controls. By changing the mechanism used to control the browser, we can circumvent the restrictions placed on the browser by the JavaScript security model. In those cases where automation through the browser isn't enough, WebDriver can make use of facilities offered by the Operating System. For example, on Windows we simulate typing at the OS level, which means we are more closely modeling how the user interacts with the browser, and that we can type into "file" input elements.

With the benefit of hindsight, we have developed a cleaner, Object-based API for WebDriver, rather than follow Selenium's dictionary-based approach. A typical example using WebDriver in Java looks like this:

// Create an instance of WebDriver backed by Firefox
WebDriver driver = new FirefoxDriver();

// Now go to the Google home page
driver.get("http://www.google.com");

// Find the search box, and (ummm...) search for something
WebElement searchBox = driver.findElement(By.name("q"));
searchBox.sendKeys("selenium");
searchBox.submit();

// And now display the title of the page
System.out.println("Title: " + driver.getTitle());

Looking at the two frameworks side-by-side, we found that the weaknesses of one are addressed by the strengths of the other. For example, whilst WebDriver's approach to supporting browsers requires a lot of work from the framework developers, Selenium can easily be extended. Conversely, Selenium always requires a real browser, yet WebDriver can make use of an implementation based on HtmlUnit which provides lightweight, super-fast browser emulation. Selenium has good support for many of the common situations you might want to test, but WebDriver's ability to step outside the JavaScript sandbox opens up some interesting possibilities.

These complementary capabilities explain why the two projects are merging: Selenium 2.0 will offer WebDriver's API alongside the traditional Selenium API, and we shall be merging the two implementations to offer a capable, flexible testing framework. One of the benefits of this approach is that there will be an implementation of WebDriver's cleaner APIs backed by the existing Selenium implementation. Although this won't solve the underlying limitations of Selenium's current JavaScript-based approach, it does mean that it becomes easier to test against a broader range of browsers. And the reverse is true; we'll also be emulating the existing Selenium APIs with WebDriver too. This means that teams can make the move to WebDriver's API (and Selenium 2) in a managed and considered way.

If you'd like to give WebDriver a try, it's as easy as downloading the zip files, unpacking them and putting the JARs on your CLASSPATH. For the Pythonistas out there, there's also a version of WebDriver for you, and a C# version is waiting in the wings. The project is hosted at http://webdriver.googlecode.com, and, like any project on Google Code, is Open Source (we're using the Apache 2 license) If you need help getting started, the project's wiki contains useful guides, and the WebDriver group is friendly and helpful (something which makes me feel very happy).

So that's WebDriver: a clean, fast framework for automated testing of webapps. We hope you like it as much as we do!

O BSDCanada!

Thursday, May 7, 2009

BSDCan 2009, an annual BSD conference at the University of Ottawa in Ontario, Canada will be held this year on May 8th and 9th, 2009. The Open Source Team's Leslie Hawthorn and Cat Allman will be there to mingle with the Open Source community and present a talk on Getting Started in Free and Open Source on May 8th at 11 AM local time. This talk is a fantastic introduction to the Open Source community for those who are new and want to get involved. In addition, Open Source veterans will discover insights into the concerns of newbies and learn ways to improve retention and make their projects more welcoming. By running projects such as Google Summer of Code™ and the Google Highly Open Participation™ Contest, Leslie and Cat have gathered a huge amount of experience working with Open Source newcomers, and they are excited to share their knowledge with the rest of the community.

This will mark Google's third year at BSDCan, with Brian 'Fitz' Fitzpatrick and Ben Collins-Sussman speaking there in 2007 and Leslie presenting in 2008. If you are in the area, make sure to attend this year's talk, and feel free to say hello or introduce yourself afterward!

.