More Adventures from SciPy: Jenny Qing Qian

Friday, December 5, 2008

You may recall our recent post from Rachel McCreary detailing her experiences at the 7th Annual Python in Science Conference (a.k.a. SciPy 2008). Also joining Rachel at SciPy was Jenny Qing Qian, one of Rachel's fellow Google Summer of Code™ 2008 students and Pygr developer. Jenny created a Python Ensembl API for her Summer of Code project, and attending SciPy gave her the opportunity to showcase her work and learn more from her co-developers. She was kind enough to send us this report from the conference:

The introductory tutorials, held during the first two days of the conference, were fantastic. In the tutorial, I was given an excellent hands-on demo of the interactive Python shell – IPython – and other general Python tools and libraries for scientific computing, such as NumPy and SciPy. In addition, I was fascinated by the diversity of plotting tasks the Matplotlib package can perform, tasks which were traditionally carried out using Matlab.

During the conference, I really enjoyed the keynote speech from Alex Martelli, who currently works at Google. His talk addressed the fundamental yet often neglected problems of treating a numeric software package as a 'black box' in the course of scientific and engineering computing. Supported by many vivid real-world examples, he effectively conveyed the message that you must be crystal clear about what you're computing and understand what the 'black boxes' can do and can not do. Otherwise, results may well be far away from being accurate, which can lead to disastrous outcomes, especially in the field of engineering. This is likely due to the fact that the targeted 'black box' is in fact not well-conditioned for the specific tasks or the sets of input data. His talk provided useful input both to users of software packages, but also to their developers. It rightfully prompts the developers to carefully document the behaviors and functionality of their software packages, especially the conditions for using it. In turn, it might help prevent the software being used for other than its intended purposes.

In addition, it was also interesting to listen to talks from various developers about how to apply or further develop general Python or SciPy libraries to solve their domain-specific problems. One of such talks was Summarizing Complexity in High Dimensional Spaces. In this talk, Karl Young presented a very useful method that can provide diagnostic summary information for multi-dimensional and multi-spectral medical image data. This method is developed based on the powerful SciPy array computation capabilities. I have implemented methods in R to analyze high-dimensional biological image data sets like time series analysis of microarray data. In addition, I have also implemented algorithms in Matlab to analyze and classify data with a large number of features, such as documents. Inspired by the talk, I'd love to try to employ SciPy to develop analysis tools for large high-dimensional and multivariate data sets that characterize fundamental properties of dynamic and complex biological systems.

After the conference concluded, I stayed for the coding sprints. My Summer of Code project was about prototyping a database API using standard components of Pygr – a Python Graph Database Framework. The functionality of the API is to retrieve information from a central biological data warehouse, the core Ensembl database system. At the sprint session, I finally got to meet the main Pygr developers, Dr. Christoper Lee (the project's founder) and Dr. Titus Brown.

During the sprint session, both Rachel and I presented and demoed our summer projects to the whole group, and we got some great feedback on our progress to date.

After the presentation, Dr. Lee and I further discussed the potential of best re-using existing Pygr components to further simplify my API framework, so as to make it more maintainable and easier to extend. In addition, we debugged the problems I encountered while porting the Ensembl database schema to pygr.Data namespace. For this project, we had decided to model the complex Ensembl database schemas by employing the strong support from the pygr.Data module. More specifically, rather than implementing a complex database schema in a conventional ORM (Object Relational Mapping ) way, this module transforms a schema into a portable Python namespace. In doing so, we hope to provide API developers as well as end-users with a much cleaner and more intuitive interface to access and distribute the relations among Ensembl data objects. Through a joint effort, I finally managed to save and retrieve typical Ensembl database schemas into and from the pygr.Data namespace! Needless to say, these discussions and my entire SciPy experience left me feeling incredibly motivated to continue working on my project.

Many thanks to Jenny for sharing her thoughts with us and many congratulations to her and Rachel for their successes this summer!