Monday, October 19, 2009

Open Source Repostitory Software

When most bioinformatics professionals think of "open source repository" software, they are probably thinking of things like SourceForge, or gForge, that provide a platform for sharing open source software and its associated source code, binaries, and related files. There is also, however, software that provides a platform for the management and sharing of documents, in a wide variety of different forms, with associated, highly structured and closely managed metadata.

The sciences, and the biomedical sciences in particular, generate a wide range of documents as a key part of any research effort. Developing and managing a laboratory notebook is one of the most important skills that a graduate student learns as part of his/her training. As more and more of these documents are created electronically, developing a means to manage them has also become more important. There have been many, many efforts to develop "electronic laboratory notebooks," with a mixed record of success. Given that practically every scientific activity involves the storage and management of a range of document types, some kind of mechanism for robust, secure and version-controlled management is critical, beyond just the final paper or complete data set.

Researchers and information technologists in library science have been working on this problem for a long time, and have started to come up with a range of solutions, both commercial and open source. Of note is an open source one by the eSciDoc team, designed to support the management of scientific materials. It is based on a widely-used set of technologies and standards, such as those from DuraSpace.org, widely-used web standards and a Services-Oriented-Architecture (SOA.) With both REST and SOAP interfaces, and a well-defined structure for controlled metadata elements and vocabularies, and distributed authentication (Shibboleth) it is a candidate for immediate integration into many scientific research information environments.


Figure from the eSciDoc website
https://www.escidoc.org/JSPWiki/en/Overview


Systems such as these may cover an important "last mile" of collaborative research, by providing a structured, metadata-rich and standards-based means to manage and share the unstructured documents that make up the background of the kind of structured scientific data that is provided in systems like GEO or caARRAY. The extent which this kind of software can be combined with large-scale and distributed collaborative translational science research remains to be seen. The fact, however, that such software is developing at a rapid pace, and that it appears to share important standards and interface approaches with other emerging programs suggests that we may be seeing the emergence of an interesting and important aspect of scientific research management.

Monday, October 5, 2009

10,000 Hours of caBIG?

In Malcolm Gladwell's recent book Outliers, he raises a very interesting point about subject mastery. In his book, he looked at the amount of time spent by individuals in the practice of an activity in which they have truly mastered. By looking over a variety of disciplines, he came to an estimate of about 10,000 hours of work as that threshold. When you think about it, it is not really that surprising. Employers often look for "5 years experience" as the measure of sufficient experience in a specific job. This equates to 40 hours a week for about 5 years, totaling roughly 10,000 hours. Gladwell found a similar rule true for a range of professions, sports and interests.

The question, then, is how to provide the opportunities and environment to support the development of true expertise in the caBIG tools, and the related underlying informatics framework. Given that the program itself is relatively new, and that there has hardly been time yet for anyone to develop expertise at the level that Gladwell describes, it shouldn't be a surprise that there are still many people within the caBIG community and outside of it who dismiss the tools as too complex and difficult to understand. The challenge for caBIG is to continue to establish a place where this expertise can be developed by the stakeholder community. One such critical component is the Knowledge Centers. Another is the experienced pool of caBIG implementers and developers who have successfully used the caBIG tools (often in conjunction with a wide range of tools from other sources) to satisfy the needs of their end-users. The formal documentation and training infrastructure that is provided buy the caBIG Documentation and Training Workspace, as well as the Center for Biomedical Informatics and Information Technology (CBIIT) at NCI provide the baseline for ensuring that the information needed by the ultimate users of caBIG is available, and is structured and consistent.

All the formal documentation and training in the world can not substitute for functional working systems at other institutes, and code and tools that can provide the template for similar successful efforts at similar interested institutions. Developers and integrators, especially in open source-rich environments, have long looked to software skeletons, "Hello World" examples and detailed tutorials to provide the foundation for their efforts. These kind of frameworks also provide demonstrable evidence that their particular needs can be satisfied by the tools. This kind of sharing of experiences is one of the things that has provided real impetus to many of the deployment efforts currently ongoing in the caBIG program, and has already led to sharing of innovative solutions to several shared needs within the community. To get to that true level of mastery, caBIG will need to ensure that such frameworks, skeletons and open demonstrators are available to everyone who is interested, and are well-categorized and presented, so what can be needed can be easily found.

One thing that is clear about gaining real proficiency in anything- the sooner that someone gets started, the better they will be in the long run. The old saying about training long-bowmen (you start by training his grandfather as a child) holds just as true with software systems and tools. Each of us develop our approaches to solving problems (and the associated toolkit) early in our careers. One of the great opportunities of the caBIG program is to give those just starting out in biomedical informatics (and there are bound to be more and more every year) a solid tool-kit supporting things like data-modeling, semantics, security, and the other components of robust systems. Perhaps even providing graduate students and postdocs support for attending caBIG meetings, or even grant support for participating in the development and integration of caBIG tools where they are supporting specific scientific and biomedical goals.

Gladwell points out in his book that success stories are almost never the result of a single independent effort, but rather made up of long-term and community-wide support. By providing the environment to support this kind of horizontal community participation in each deployment effort, we all can collectively give those separate development efforts the best possible chance for success.