Hannah Thomas

|
|
hannah@hannah-thomas.com

Introduction

For the Senior Design Project Computer Science course, students were put into groups of six to work on projects for client companies. Five other students and I chose to work on a project for the National Center for Atmospheric Research(NCAR) to create a web application that analyzed climate data. In addition to analyzing the data, the application was also required to save created workflows and allow users to share these workflows with others. This application was to serve as a proof-of-concept for the NCAR team, and to help NCAR apply for a grant to take this proof-of-concept and make a fully-implemented application that works with more than just climate data.

Show More ↓

Fall Semester

The fall semester (August to December) was mostly dedicated to research and preparation for building the application. First, we hashed out the basic requirements for the application.

At a basic level, our application had to do seven tasks:

Since this application was designed to be a proof-of-concept, we only focused on data provided from the North American Regional Climate Change Assessment Program (NARCCAP).

To get a better idea of what architecture and infrastructure would be required to make this application, we made a list of the conceptual software elements we would need and a list of the conceptual functionality elements.

Conceptual Software Elements

Conceptual Functionality Elements

Our initial System Context Diagram.

We also created several diagrams to help us visualize the structure of our system. This included a basic System Context Diagram, a Dependency Structure Matrix (DSM), a Domain Mapping Matrix (DMM), and a conceptual Architecture Diagram.. We knew that we would need to access the NARCCAP database, have some sort of analysis engine, have a visualization engine, and, of course, interact with the user.

For many of us in the group, including myself, this was the first time to start a large project from scratch. We had very little idea of which technologies to use, and of the technologies we had heard of we were not sure which would be best for our needs. Luckily, our sponsor had some ideas on the basic technologies we would be using.

NCAR already has a command line language for handling and manipulating NetCDF files. Our sponsor recommended using NCL as a starting language and to expand to other languages (such as R) later. We were also informed about OPeNDAP, a protocol for pulling the NetCDF files from the NARCCAP server.

We also took a look at serveral web frameworks for our project. We had originally leaned towards Django, but our sponsor introduced us to a new framework in development, Tangelo. The largest benefit of this framework was that it was built with data visualization in mind. Originally, we had hoped to use this functionality for visualizing our data.

Once we had a basic idea of the technologies we would use, we went about setting up our virtual machine, a CentOS machine provided by NCAR.

Over the winter break we began creating our web application’s first working demo. First we installed Tangelo and NCL. For this version of the application, we used an already-downloaded NetCDF file and did not implement any OPeNDAP calls. This version of the application showed the basic workflow of Subset Data → Run an Analysis Step on the Data → Plot the resulting Data.

Our application was designed such that each action called a Python script that ran an NCL command as a Python subprocess. When the subprocess finished, the python would return a JSON Object containing either an error message or the filename of the NetCDF and/or image file generated. This JSON object was put into LocalStorage and then used for subsequent steps of the application.

We began the spring semester by showing this basic workflow to our sponsors.

The first iteration of our application.

Presentations from Fall Semester

A PDF of our first Presentation.
Download our first in-class presentation on Scientific Data. This presentation is an overview of the project and our initial thoughts on the project.
A PDF of our second Presentation.
Download our second in-class presentation on Scientific Data. This presentation includes many different diagrams that represent our system, including our initial Architecture, the DSM/DMM, and a System Context Diagram.
A PDF of our DSM/DMM.
Download our DSM/DMM.

Spring Semester

The spring semester (January to May) was dedicated to building, testing, and delivering our application.

We brought our initial demo that we made over the winter break to our sponsors to get feedback on the general process. They seemed pleased with what we had provided but were worried about actually implementing a user-defined workflow. They also expressed interest in what TangeloHub had to offer for our application.

It turned out that Tangelo and TangeloHub were two different products by KitWare. Tangelo was the web framework, and TangeloHub was a technology to create scientific workflows. On our initial perusal of Tangelo, we had assumed Tangelo and TangeloHub were the same thing. When we realized they were different, we stepped back and began to look more closely at what TangeloHub had to offer. We eventually came to the conclusion that while TangeloHub was indeed a powerful tool, the steep learning curve we would have to overcome to use it was too great. It was also not fully implemented yet, and so we were worried to use it at such an unstable stage, especially since we were already taking some of that risk with Tangelo.

After showing our sponsor the basic demo we had created, we were primed to begin making a more robust application. The demo only allowed for three steps that used a single NARCCAP file. As part of our next iteration, we worked to integrate the OPeNDAP protocol into our subsetting page and began working on a way to let users add and delete steps to make more advanced workflows. To keep track of these steps, we created a sidebar that listed each step. Users could switch between steps and add and delete any steps they chose.

Our new version of the application, with a sidebar that added and deleted steps.

We also began working on the reproducibility aspect of our project in earnest. Since our app called a process for each step that could easily be run in the terminal, we initially created a BashWriter program that created a shell script consisting of all shell commands for a given workflow.

We also needed a way to store these complex workflows on the back end. We stumbled across a python tool called pyutilib.workflow, which stored workflows as a series of tasks with inputs and outputs. This was perfect for our application, and we began modifying our existing Python code to take advantage of these workflows.

Our sponsors had originally stressed the importance of workflow reproducibility, and it had been our main focus, pushing the user interface itself to be a low-priority item. However, after showing them our latest demo, they changed their priorities. It became essential that the workflows be easily navigable and support nonlinear events.

Our sidebar was inadequate for nonlinear workflows; translating parallel steps into a single list that was easily navigable would be nearly impossible. So, we began work on a new workflow visualization tool. Instead of using a textual sidebar, we would use a built-in Tangelo plugin, Nodelink.js (based on the D3 Force Layout), to create a workflow graph.

A mockup of our design.

Since this would greatly change the user interface of our app, we went back to the drawing board and designed new mockups to better show our plan to our sponsors. We also implemented some dummy workflows to show how we would use the Nodelink plugin to our advantage.

This required us to rewrite parts of the pyutilib.workflow code, so that it could give easily parsable data for Nodelink to use. We implemented a basic workflow visualization and demoed it for our sponsors.

After this demo, we created a final requirements list for our deliverable:

With these requirements set, we began the final integration of all of the pieces.

The new Home Screen for the final version of the app.
The new Loading Screen for the final version of the app.
The new Workflow Builder Screen for the final version of the app.

First, we implemented the new design of the website based on the mockups we had previously made. This ended up being a rather large rework of the HTML. We ended up with three different pages: the home page, the workflow loading page, and the workflow builder page.

The home page was very simple, basically consisting of some brief explanation and the two necessary buttons: Create a New Workflow, and Load an Existing Workflow.

The load page was also very simple, basically consisting of a text box for a workflow serial number and a button to load the workflow.

The workflow builder itself is where most of the work took place. Our initial mockups provided a basic idea for the site, but it was severely lacking. For example, by putting the "New Step" button at the bottom of the screen, we would have to animate it along with the drawer that would display a task’s details. This seemed like a poor decision, especially once we realized that Tangelo came with a drawer plugin (called controlPanel) that we could utilize. So the "new step" button was moved to the top of the screen instead. In addition, we had completely forgotten about the "save" functionality we needed to implement, and did not have a "save" button anywhere. This was also added to the design. Finally, we also realized that it would be beneficial for the user to be able to see their workflow serial number at any given time, so this too was added.

Then it was a simple matter of reimplementing the add task functionality, adding the ability to delete tasks, and finally, adding the ability to update tasks.

As part of this process, we refactored our workflow code to be a Tangelo plugin, and also implemented tasks in the pyutilib.workflow library as PluginTasks, which can be generated via a built-in pyutilib.workflow TaskFactory. This simplified a lot of our code and also helped us to better compartmentalize our code.

Most of the trouble came from adding tasks. Once we figured out the bugs in adding a task, deleting and modifying tasks was simply a matter of expanding core components that already worked.

In addition, we were able to take advantage of Tangelo’s built-in temporary storage to store workflow representations while a user was creating their workflow.

We chose to represent workflows as large dictionary objects, that in turn became JSON strings we could parse and update. To add a task, a JSON string was “deserialized” into a workflow, the task was added, and the workflow was “reserialized”. For deleting a task, we modified the deserialization method to not deserialize the task matching the deleted node's unique ID, and to check the inputs of other nodes to make sure that any inputs referring to the deleted task were removed. Similarly, for updating tasks, the deserialization method was modified such that the task matching the unique id of the modified task would be given the new inputs passed into it instead of reading inputs from the original JSON string. Every time a task is added, updated, or deleted, the workflow is rerun to check for errors and give the user an opportunity to download the resulting netCDF or PNG.

Implementing the saving and loading of workflows into and from the database respectively ended up being relatively trivial. Since we had already come up with a method for storing the JSON locally, we simply needed to add some python functions to interface with the database and modify localStorage appropriately.

We also managed to prove that our tasks can be written in multiple languages, with some tasks being written in NCL and others being written in R. It shows that the Workflow application can have tasks written in the language best for that task, and makes our application very versatile.

On the other hand, our application runs exceptionally slow because it re-executes all tasks in a workflow even if that task has not changed and already has output. pyutilib.workflow did not implement a way to see if output is already calculated because the pyutilib.workflow library is not specifically made to pass filenames as input and output, as our application does.

Overall, this has been an amazing experience, and I have learned a lot about building a web application from the ground up. I am happy to have had the opportunity to work on this project and hope to utilize the experience it has given me.

Version 2 of the Scientific-Data Workflow

Final Version of the Scientific-Data Workflow (renamed NCAR Climate Model Analysis)

Presentations from Spring Semester

A PDF of our third Presentation.
Download our third in-class presentation on Scientific Data. This presentation focused on the progress made over winter break and what our next steps were.
A PDF of our fourth Presentation.
Download our fourth in-class presentation on Scientific Data. This presentation includes the documentation expectations from our sponsor, as well as what we had completed since our previous presentation and the next steps we planned to take.
A PDF of our Computer Science Expo poster.
Download the poster from the Computer Science Expo.
A PDF of our Computer Science Expo handout.
Download the handout from the Computer Science Expo.

The Web Application

The Stack

Operating System: CentOS 7
Database: MongoDB
Data Analysis Languages: NCL, R
Backend Language: Python
Server/Web Framework: Tangelo
Frontend: JavaScript(JQuery, D3), CSS, HTML

General Process

The general flow of the application. The user clicks “Add New Task,” “Update Task,” or “Delete Task”, which sends an HTTP GET request to the python workflow builder. The builder adds the new step to the workflow, and then returns a JSON object with the results of the workflow and a representation for rendering the workflow. The Javascript updates the workflow visualization, and then the user is free to add, update, or delete another step.

Demo

A simple demo showing the process of adding tasks together to make a workflow.

Documentation

You can read the documentation by downloading the PDF here. You can also view the code on GitHub.

Show Less ↑