Monday, October 19, 2009

Open Source Repostitory Software

When most bioinformatics professionals think of "open source repository" software, they are probably thinking of things like SourceForge, or gForge, that provide a platform for sharing open source software and its associated source code, binaries, and related files. There is also, however, software that provides a platform for the management and sharing of documents, in a wide variety of different forms, with associated, highly structured and closely managed metadata.

The sciences, and the biomedical sciences in particular, generate a wide range of documents as a key part of any research effort. Developing and managing a laboratory notebook is one of the most important skills that a graduate student learns as part of his/her training. As more and more of these documents are created electronically, developing a means to manage them has also become more important. There have been many, many efforts to develop "electronic laboratory notebooks," with a mixed record of success. Given that practically every scientific activity involves the storage and management of a range of document types, some kind of mechanism for robust, secure and version-controlled management is critical, beyond just the final paper or complete data set.

Researchers and information technologists in library science have been working on this problem for a long time, and have started to come up with a range of solutions, both commercial and open source. Of note is an open source one by the eSciDoc team, designed to support the management of scientific materials. It is based on a widely-used set of technologies and standards, such as those from DuraSpace.org, widely-used web standards and a Services-Oriented-Architecture (SOA.) With both REST and SOAP interfaces, and a well-defined structure for controlled metadata elements and vocabularies, and distributed authentication (Shibboleth) it is a candidate for immediate integration into many scientific research information environments.


Figure from the eSciDoc website
https://www.escidoc.org/JSPWiki/en/Overview


Systems such as these may cover an important "last mile" of collaborative research, by providing a structured, metadata-rich and standards-based means to manage and share the unstructured documents that make up the background of the kind of structured scientific data that is provided in systems like GEO or caARRAY. The extent which this kind of software can be combined with large-scale and distributed collaborative translational science research remains to be seen. The fact, however, that such software is developing at a rapid pace, and that it appears to share important standards and interface approaches with other emerging programs suggests that we may be seeing the emergence of an interesting and important aspect of scientific research management.

Monday, October 5, 2009

10,000 Hours of caBIG?

In Malcolm Gladwell's recent book Outliers, he raises a very interesting point about subject mastery. In his book, he looked at the amount of time spent by individuals in the practice of an activity in which they have truly mastered. By looking over a variety of disciplines, he came to an estimate of about 10,000 hours of work as that threshold. When you think about it, it is not really that surprising. Employers often look for "5 years experience" as the measure of sufficient experience in a specific job. This equates to 40 hours a week for about 5 years, totaling roughly 10,000 hours. Gladwell found a similar rule true for a range of professions, sports and interests.

The question, then, is how to provide the opportunities and environment to support the development of true expertise in the caBIG tools, and the related underlying informatics framework. Given that the program itself is relatively new, and that there has hardly been time yet for anyone to develop expertise at the level that Gladwell describes, it shouldn't be a surprise that there are still many people within the caBIG community and outside of it who dismiss the tools as too complex and difficult to understand. The challenge for caBIG is to continue to establish a place where this expertise can be developed by the stakeholder community. One such critical component is the Knowledge Centers. Another is the experienced pool of caBIG implementers and developers who have successfully used the caBIG tools (often in conjunction with a wide range of tools from other sources) to satisfy the needs of their end-users. The formal documentation and training infrastructure that is provided buy the caBIG Documentation and Training Workspace, as well as the Center for Biomedical Informatics and Information Technology (CBIIT) at NCI provide the baseline for ensuring that the information needed by the ultimate users of caBIG is available, and is structured and consistent.

All the formal documentation and training in the world can not substitute for functional working systems at other institutes, and code and tools that can provide the template for similar successful efforts at similar interested institutions. Developers and integrators, especially in open source-rich environments, have long looked to software skeletons, "Hello World" examples and detailed tutorials to provide the foundation for their efforts. These kind of frameworks also provide demonstrable evidence that their particular needs can be satisfied by the tools. This kind of sharing of experiences is one of the things that has provided real impetus to many of the deployment efforts currently ongoing in the caBIG program, and has already led to sharing of innovative solutions to several shared needs within the community. To get to that true level of mastery, caBIG will need to ensure that such frameworks, skeletons and open demonstrators are available to everyone who is interested, and are well-categorized and presented, so what can be needed can be easily found.

One thing that is clear about gaining real proficiency in anything- the sooner that someone gets started, the better they will be in the long run. The old saying about training long-bowmen (you start by training his grandfather as a child) holds just as true with software systems and tools. Each of us develop our approaches to solving problems (and the associated toolkit) early in our careers. One of the great opportunities of the caBIG program is to give those just starting out in biomedical informatics (and there are bound to be more and more every year) a solid tool-kit supporting things like data-modeling, semantics, security, and the other components of robust systems. Perhaps even providing graduate students and postdocs support for attending caBIG meetings, or even grant support for participating in the development and integration of caBIG tools where they are supporting specific scientific and biomedical goals.

Gladwell points out in his book that success stories are almost never the result of a single independent effort, but rather made up of long-term and community-wide support. By providing the environment to support this kind of horizontal community participation in each deployment effort, we all can collectively give those separate development efforts the best possible chance for success.

Monday, September 28, 2009

Temple Smith, Bioinformatics Pioneer

I had the unbelievable privilege of spending my postdoctoral years working for Temple Smith, studying 3D protein structure prediction and multiple protein structure alignment. I can honestly say that those are still some of the best and most rewarding years on my professional life. This is due almost entirely to Temple's amazing brilliance, perspicacity, encouragement, and generous nature. Temple became a professor emeritus at Boston University last Friday, which was celebrated with a remarkable seminar series of talks given by his students, collaborators and colleagues. The depth and breadth of the presentations were remarkable, as were the collection of luminaries, each of which who demonstrated the amazing impact Temple has had on the science of bioinformatics. Equally wonderful was how each of these successful researchers in their own right acknowledged the support and inspiration that they have drawn for their own work from Temple and his original contributions to science.

Mike Waterman shows a slide of he and Temple at Los Alamos, NM
Summer, 1980, in a photo taken by David Lipman

Temple is (rightly) famous for his fractious demeanor, and his willingness to question the status quo of any situation, scientific or otherwise, and for his iconoclastic "cowboy" behavior and dress. This, as was acknowledged by all the speakers, hides an open and giving heart, and a true and deep desire to see those with whom he works succeed. Temple likes nothing more than to "stir the pot" and upset the commonly-held wisdom, something that he still continues to do with remarkable efficacy.

No article or post about Temple would be complete without an anecdote, so I will relate one of my own experiences several years ago. I was a postdoc in Temple's lab, and was being recruited by one of the large East Coast pharmaceutical companies. and I was going to have lunch with their executive recruiter in Boston. We were to meet at the lab, and go from there to lunch. As we were walking out of the offices, we passed the mailbox, where a gentleman in a mustache, serape and cowboy hat was picking up his mail. This person preceded to ask who the "character in the suit" was, and what I was doing with him, and then bustled past before I could make any introductions. As the recruiter and I walked out of the office, he asked me who "that guy" was, and I responded by asking him if he was familiar with the Smith-Waterman equation. My executive friend indicated that he was, and that knowledge of the same was a prerequisite for the job I was being considered for. When I told him that "that guy" was Smith, the look on his face was priceless, and something that I treasure to this day.

It was great talking to Temple on Friday, and getting a chance to catch up with so many of the remarkable people who has has taught and with whom he collaborated with over the years. It is a wonderful legacy, and a fantastic group of friends - we all expressed a hope that Temple become an emeritus professor every year. I certainly owe what I have been able to achieve in my own career to Temple, and am glad to wish him a very happy retirement!

Monday, September 21, 2009

Shan Zhai Bioinformatics

Can't argue with success: bioinformatics software that produces results favors agile, time and cost-effective work by staff very close to (or even participating in!) the research activity. Some potential inspirations come from interesting places. A recent article on China's Shan Zhai in Strategy & Business magazine (formally our own house organ here at Booz Allen Hamilton, but now run out of Booz&Co.) and some very interesting commentary throughout the blogosphere got me thinking about the very different approaches to software development used by the bioinformatics community. The term shan zhai literally translated, means "mountain fortress" and suggests banditry and lawlessness, andhas come in recent years to be associated with knock-off Western consumer products (iPhone, imitation is the sincerest form of flattery...) but has also come to be associated with a kind of native Chinese cleverness and DIY hackery. Much like the term bricolage in French, shan zhai practice an iterative, just-in-time form of product development, often targeting almost invisibly small communities with their products. This leads to remarkably nimble business behavior, although sometimes at the cost of the ultimate quality, manufacturability or broad relevance of the resulting products. Like the open source development community worldwide, the shan zhai make a practice of sharing information on materials and construction of their products. This level of information sharing is unheard of in the commercial world, but it allows the rapid creation and iteration of products from a geographically distributed community. In doing so, this community relies on emergent properties of this process to distribute and improve product ideas and concepts, rather than the planned approach common to the large multinational companies that produce most consumer goods worldwide.

This blog post made a fascinating connection between the shan zhai and Situated Software, a concept floated by Clay Shirky several years ago. Situated software is software that is designed for use by a specific social group - rather then designed for generality. As such, it can be developed quickly, and usually iteratively, by the community by which it will be ultimately used, or at least with that community's direct participation in the development process. This approach has significant weaknesses from a commercial point of view, which often depends on scalability and re-use to maximize the profits resulting from the initial investment of engineering effort. It also can lead to software which must be continuously managed and maintained in order to stay useful and relevant. The recent surge in websites and interactive services that are focused on particular communities of use is an example of this situated approach. With the proliferation of rapid-development software tools, and the easy interfacing made possible by REST web services, and flexible and scalable hosting provided by inexpensive webhosts and even cloud providers like Amazon and Google, this kind of development has already resulted in a wide range of remarkable and community-specific software. Everywhere you look on the web, there are niche services and sites that are directed at very specific user-communities from artists to autopilot enthusiasts and community activists of all stripes.

Which brings my back around to bioinformatics. Bioinformatics is a classic case of software that is often developed for use either by or for a very small community of users- often just a single lab or individual scientist. Since bioinformatics software is regularly developed to support scientific research activities, and most scientific research activity is by necessity bespoke and unique, it is not surprising that a lot of bioinformatics software development is itself also bespoke and small-scale. When one looks at the mixed success of commercial, large-scale software development in bioinformatics, one can see why smaller scale, situated efforts are often more successful in solving the day-to-day problems that face life scientists. Such efforts, when taken in whole, are often less efficient and more repetitive than those done in a more conventional fashion, but the costs associated with individual small-scale development can be lower than those of larger-scale efforts, since they rely upon graduate students, post-docs and small contractors, all of whom are notorious for their low cost.

The challenge, of course, is finding a path between situated, small-scale/low-cost development and scalable, reusable data and systems upon which larger scale efforts such as comparative effectiveness research, translational medicine and molecular clinical research can rely. The real payoff for bioinformatics and the research that it supports is that which leads to the development of new therapies and molecular markers for disease and treatments. Given the wide range of participants involved in such research, and their distribution both geographically and in discipline, the need for scalable, secure and standardized systems begins to transcend what can be effectively (or cost-effectively) be done by the small individual researcher acting alone. Now we look at the shan zhai. They succeed in a field which is notorious for standardization, large-scale, and risk-averse behavior. Consumer electronics has gone long past the days of small, garage operations dominating the field. But the shan zhai have identified the market inherent in the long tail of consumer electronics purchasers, those that want/need something just a little different (even if what is different is the lower price implied by rampant IP violation - I am not advocating piracy, merely indicating that the shan zhai are able to nimbly respond to a market demand for it.)

Key to the success of the shan zhai is their ability to share effectively with their colleagues throughout China and beyond. What they have recognized is that they can be more effective by reusing standardized components and data resources, and often improving them and sharing the result openly, than they can by fiercely protecting what they know as individuals or small companies. This leads to an important observation for those of us developing bioinformatics software and data resources. As we design and build standardized platforms and supporting infrastructure which can facilitate the development of bioinformatics software, it is critical to consider the means by which these tools can be flexibly and easily re-used by the many small developers that comprise the majority of the bioinformatics community. If the goal is to facilitate the development of standards-based software, and to ensure that the data collected by the community is made available using standardized and future-proofed components and representations, then we have an obligation to ensure that the tooling that we provide can be used by not only large cadres of professional software engineers, but also the graduate students and post-docs who are the bricoleurs constructing most of the software that is in use at any given time by the scientific community.

As we continue to develop software and systems that support the needs of translational research, comparative effectiveness research and other transdisciplinary medical and scientific work, we need to constune to ensure that along with well-recognized standards, we need to provide equally standardized, and easy-to-use interfaces that reflect the needs of those deploying the research solutions to the end-users. Those end-users are often just a single lab or even a single researcher, and the developer is a post-doc or graduate student who is doing his or her work using some of the simplest tools available- often no more than perl or PHP and a website. If we can support these folks with resuable modules that they understand, and provide a platform with which they can receive help and share results, we will have gone along way to create or own shan zhai, and can begin to reap the rewards of doing so.

Wednesday, September 2, 2009

The PHIN Grid (and other great stuff)

I was at this year's Public Health Information Network (PHIN) conference again, held without fail in Atlanta (always on my kid's first week of school.) This meeting is always interesting to me because of the breadth of participants and the depth and importance of the topics. This year, with H1N1 flu topmost in everyone's mind, there was a special urgency from many of the participants to get the informatics of many public health efforts operating as efficiently and effectively as possible. Key to these efforts was leveraging the products of many other programs, with federal and local informatics teams often making extensive use of open source tools and technologies- the resulting talks were both inspiring and cool.

I attended as many of the grid- and cloud-related talks as I could fit in to the conference schedule, and was rewarded with a truly remarkable view of how the stakeholders throughout the PHIN enterprise have been able to leverage technology products from a wide range of programs to satisfy their unique public health requirements. It was inspiring to hear how the CDC has been using many of the tools developed by the National Cancer Institute's caBIG program, especially key parts of the caGrid infrastructure. Equally cool was how many of the key participants in the caBIG program have been directly involved in leveraging those capabilities in an entirely new setting. In particular, Tom Savel and his team talked at length in a number of sessions about using these Grid tools to implement a range of services, and about the challenges in using them in a public health setting, such as security and reliability . Hearing about how familiar tools like GAARDS, Grid Grouper and Introduce are being used in a new community was well worth the trip, as was hearing how facilities that the caBIG program has implemented, such as the Knowledge Centers, are providing important means of support in diverse settings of national (and even international) importance.

As much as I appreciate hearing the accolades and credit given to the caBIG program (and I do!) and as much as it is rewarding to see our community providing support to these important areas, I am reminded of how teams providing software and tools must continue to improve and iterate that software, and continue to ensure that the communities using these systems do not become detached from the processes used to create them. The Knowledge Centers and their staffs are clearly leading the way here, and it will be critical for them (and us) to continue to listen closely to what is happening "out there" and ensure that the needs get "in here." To that end, though, it is both inspiring and heartening to see that we have a growing, talented, committed and capable group who will not only consume the resources that the caBIG program's participants create, but who can also significantly contribute to the infrastructure's development. The challenge for caBIG is to come up with the most effective possibly means of incorporating these contributions, and ensuring that they mesh, support and inform the program going forward.

Onward and upward, PHIN! Every time I sneeze now, I am going to wonder if I can enter that event on some aspect of the emerging public health Grid.

Monday, August 24, 2009

Best thing ever- caBIG iPhone Apps and more!


We are seeing a watershed in caBIG software development- the appearance of caBIG® software products in more places, with ever more potential users. Konrad Rokicki's amazing caBIO iPhone app (link) represents a really, really cool working proof of concept, and opening up so many more possibilities! In a similar vein, caBIG developers at Ohio State and Emory are working on integrating the caBIG imaging tools with existing iPhone applications such as Osirix (link). Our own team is hard at work developing an open source iPhone front-end to the caTISSUE application, to facilitate tissue collection in the field, and there are many other developers we talked to at this year's caBIG Annual Meeting who are working on similar projects at various stages of maturity.

Added to all of this is some equally cool and significant work on developing a robust .NET platform for constructing and integrating caBIG software and tools (link). Be sure and check out the screencasts from this year's Hackathon. This kind of effort is critical for extending the caBIG concepts to the parts of the community that have committed to a Microsoft infrastructure. As the program matures, developing these kinds of bridges to the wider community will be increasingly important. Over the next couple of years, we hope to see all kinds of new applications running in new places - it will certainly be cool!

Be competitive!

Driving business in an open source environment can be challenging - just ask anyone who runs a software service and support effort on top of and open source software stack. Being creative, leveraging relationships and defining a role in the community can all create a means for establishing a business position, developing a customer base, and deriving growth as a service provider in an open source world.

Working without the safety net of deep (and unique) knowledge of proprietary software can be seen as just asking for trouble - proprietary commercial vendors can not only charge a premium for that knowledge, they can closely control who has access to critical information such as the current and future APIs, software release plans, and marketing collateral like logos and branding. With that kind of advantage, authorized vendors can have the double benefit of access to an established customer base, and the ability to charge premium rates for their work. In this environment, becoming an authorized vendor means a clearer path to profitability than being just another provider of services.

Into this context, providing services in an open source context can seem much more difficult, where trying to compete with a whole range of other companies, all of whom have equal access to the underlying codebase can appear to dilute opportunities and decrease profitability. In order to develop business, service providers can do some specific things to increase their profile, establish their street cred, and develop a strong customer base. Here are a few examples:
  • Spend quality time with the community. By being present at the meetings attended by the customer base, a provider can raise their profile, develop personal relationships, and both learn and teach what they know about the open source products they are supporting. Almost every open source project has a community behind it, and whether they meet online, via teleconference or face-to-face, those meetings provide a unique business development opportunity.
  • Answer questions and participate in forums. Most open source projects have forums / wikis where the collective knowledge around the products are collected and developed. By answering questions and participating positively in discussions, capable service providers can demonstrate their knowledge and identify where there are opportunities for business development. There is a thin line between using these forums as a means to demonstrate goodwill and develop relationships and spam/hard selling. Being able to tell the difference between the two is a critical distinction for success...
  • Participate in the development of the code. The cool thing about open source code is that it is really owned by everyone. Becoming one of the contributors to the codebase is a very good way to both gain deep experience with the tools, as well as demonstrate goodwill and community engagement. One of the positive effects of this effort is that often work for clients that extend the shared codebase can itself be relevant to the broader community, and can be the basis for the contributor who added it to become the obvious expert on those components (obvious because it was shared with the community, and because the contributor was an active community participant - see above.)
  • Become formally identified with the program. When there is a way to get formal recognition of the status as a designated provider of services around the open source codebase, it can contribute to raising a vendor's profile, and gain a not insignificant amount of credibility as an expert in the underlying software. This is particularly useful when the potential customers are new to the field, and are looking for someone with credentials or other concrete demonstrations of capability.
By getting out there and getting involved, interested organizations can rise above the noise, get noticed, and engage in a virtuous circle that itself can create more business opportunities. Working in an open source environment is never easy, and developing business is hard no matter where that business is located, but by placing effort in community contribution, that work can be pay off downstream.

Monday, August 17, 2009

5 years worth of caBIG!

The deeply intertwined fields of bioinformatics and computational biology are maturing and evolving at a startling and ever-increasing rate. This progress was really evident at the recent caBIG Annual Meeting held last month here in Washington DC. As in much of biomedical informatics, the real development has been in the community. At the July meeting, we had the opportunity to see a whole range of interesting and cool things that people have been doing in the cancer informatics community- all of which represent a significant development over previous year's meetings. Similarly, it was really cool to see how groups within the program have started to develop their own collective approaches to problems that they share. In particular, the Center Deployment folks have initiated an impressive shared effort, working together to instantiate strong, relevant and shared applications throughout the country. On this 5th year anniversary for the program, I can't help but wonder what we are going to see in the next 5 years- it is certain, though, that we are coing to continue to see more and more of the kind of creative, inspired leadership and technology from the program, and that there will be even more interesting and creative solutions to the problems shared by the entire cancer informatics community displayed at future caBIG Annual Meetings!