dMv (daemonv) wrote,


So, murphy just sent out the course description of a new mini course. Sounds like it will be hard, and a very intense experience.

Of course I'm interested. Even if it does involve selling my soul, and the school's soul, to Pharma.

Date: Fri, 13 Dec 2002 19:16:05 -0500
From: Robert F. Murphy <>
To: unspecified-recipients: ;
Bioinformatics Data Integration Practicum

03-410B 6 units (for undergraduate students) /
03-700B 6 units (for graduate students)

Spring 2003 second half mini-course (March 10 ? May 2)
Instructor: Robert F. Murphy

This course will provide a practical experience in integration of bioinformatics data of diverse types in collaboration with a major
pharmaceutical company, GlaxoSmithKline. At the beginning of the semester, students will be presented with a description of the problem and sample data sets. During the semester, students will work as part of independent teams to design, implement and evaluate an appropriate data integration system (with the opportunity for interaction with GlaxoSmithKline developers for advice and feedback). The course grade will be based on an oral presentation of the developed software system and a written report describing its development and evaluation. Selected students will have the opportunity to travel to GSK to present their projects.

03-310 or 03-311 or 03-510 and 15-211 (15-415 or 15-451 recommended), or permission of instructor. (The course will be offered this year as a special section of the Independent Study course and of the Masters Research course.)


Genetics Knowledge Management (GKM) is a data integration system developed at GlaxoSmithKline ( that allows scientists to correlate and analyze data from disparate scientific technologies. Scientists routinely analyze data across technologies, yet the effort required to do so on a high-throughput scale is prohibitive, resulting in incomplete understanding of the results. GKM provides scientists a way to combine data sets from multiple high-throughput data sources in a way that preserves the integrity of the original data, yet allows correlations across the data sets.

The underlying data merging processes take place in large part on the scientist?s desktop computer using basic (and slow) algorithms, which works well for small data sets. A proprietary technique was implemented to perform these operations on the server-side for larger data sets, yet was not optimized.

Project Task

The task in this project course is to develop an optimal technique for data integration from multiple sources that preserves the ability to analyze data.

The stepwise process to develop a solution includes the following steps: understand the problem and requirements; set up a test case scenario using a small sample data set to verify data integrity; design a highly optimized solution for delivery of large data sets in minimal time; implement the design; and evaluate the results. Students are encouraged to bring their skills directly into the development phase, either from biological science or computer science field.

The project must result in a working system to be demonstrated during an oral presentation near the conclusion of the course. Also required is a written report describing one or more of the following: metrics for a variety of tested solutions; a model for interfacing the merge component(s) with an existing web services data delivery engine (GKM); or a model for delivering a scalable version of merge, involving high usage and high-volume data sets running in parallel on a server.

Students with an understanding of genomics technologies (gene expression data, protein-protein interaction data, gene sequence annotation data) may add innovative steps to enhance the resultant data set, and allow deeper analyses by scientists. These may include drawing on web-based biological databases. Students with a strong computer science background may develop techniques for managing large data sets, apply parallel processing, test implementation strategies, and focus on application interface issues. Students with interest and background in both areas may ultimately create a ?workflow engine? capable of streaming numerous data sets into one highly enriched data set ready for deeper analysis, yet deliver the data in a minimum of time.

Interaction with GlaxoSmithKline

Students may interact with GlaxoSmithKline bioinformaticists and application developers to develop solutions. At the end of the semester, selected students will present their solutions to the GlaxoSmithKline evaluation panel consisting of, but not limited to, members from the GKM development team and representatives from the Information Technology
Development Program.

  • Three Videos

    After spending the day away from the social networks, and then cramming to catch up, the links in aggregate start to tell stories together. Here are…

  • (no subject)

    13:34 Boxed water is better? #

  • (no subject)

    13:12 @ cameo do you mean something like ? #

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded 

  • 1 comment