Introduction
For the Senior Design Project Computer Science course, students were put into groups of six to work on projects for client companies. Five other students and I chose to work on a project for the National Center for Atmospheric Research(NCAR) to create a web application that analyzed climate data. In addition to analyzing the data, the application was also required to save created workflows and allow users to share these workflows with others. This application was to serve as a proof-of-concept for the NCAR team, and to help NCAR apply for a grant to take this proof-of-concept and make a fully-implemented application that works with more than just climate data.
Fall Semester
The fall semester (August to December) was mostly dedicated to research and preparation for building the application. First, we hashed out the basic requirements for the application.
At a basic level, our application had to do seven tasks:
- Subset chosen data by latitude, longitude, and time.
- Perform analysis steps on data.
- Save the analysis steps to a script or other easily-maintained format.
- Allow the user to download the subsetted and/or analyzed netCDF.
- Load previously run analysis workflows.
- Provide some form of data visualization for analysis results.
- Provide an easily navigable user interface.
Since this application was designed to be a proof-of-concept, we only focused on data provided from the North American Regional Climate Change Assessment Program (NARCCAP).
To get a better idea of what architecture and infrastructure would be required to make this application, we made a list of the conceptual software elements we would need and a list of the conceptual functionality elements.
Conceptual Software Elements
- A database to store analysis scripts
- A web server
- A Data Access API
- Data Analysis Packages
- Visualization Tools
- Some sort of plugin architecture or package
Conceptual Functionality Elements
- Pulling data from NARCCAP
- Subsetting data
- Calculation/Analysis of the data
- Visualization
- Storage
- Script Building
We also created several diagrams to help us visualize the structure of our system. This included a basic System Context Diagram, a Dependency Structure Matrix (DSM), a Domain Mapping Matrix (DMM), and a conceptual Architecture Diagram.. We knew that we would need to access the NARCCAP database, have some sort of analysis engine, have a visualization engine, and, of course, interact with the user.
For many of us in the group, including myself, this was the first time to start a large project from scratch. We had very little idea of which technologies to use, and of the technologies we had heard of we were not sure which would be best for our needs. Luckily, our sponsor had some ideas on the basic technologies we would be using.
NCAR already has a command line language for handling and manipulating NetCDF files. Our sponsor recommended using NCL as a starting language and to expand to other languages (such as R) later. We were also informed about OPeNDAP, a protocol for pulling the NetCDF files from the NARCCAP server.
We also took a look at serveral web frameworks for our project. We had originally leaned towards Django, but our sponsor introduced us to a new framework in development, Tangelo. The largest benefit of this framework was that it was built with data visualization in mind. Originally, we had hoped to use this functionality for visualizing our data.
Once we had a basic idea of the technologies we would use, we went about setting up our virtual machine, a CentOS machine provided by NCAR.
Over the winter break we began creating our web application’s first working demo. First we installed Tangelo and NCL. For this version of the application, we used an already-downloaded NetCDF file and did not implement any OPeNDAP calls. This version of the application showed the basic workflow of Subset Data → Run an Analysis Step on the Data → Plot the resulting Data.
Our application was designed such that each action called a Python script that ran an NCL command as a Python subprocess. When the subprocess finished, the python would return a JSON Object containing either an error message or the filename of the NetCDF and/or image file generated. This JSON object was put into LocalStorage and then used for subsequent steps of the application.
We began the spring semester by showing this basic workflow to our sponsors.
Presentations from Fall Semester
Spring Semester
The spring semester (January to May) was dedicated to building, testing, and delivering our application.
We brought our initial demo that we made over the winter break to our sponsors to get feedback on the general process. They seemed pleased with what we had provided but were worried about actually implementing a user-defined workflow. They also expressed interest in what TangeloHub had to offer for our application.
It turned out that Tangelo and TangeloHub were two different products by KitWare. Tangelo was the web framework, and TangeloHub was a technology to create scientific workflows. On our initial perusal of Tangelo, we had assumed Tangelo and TangeloHub were the same thing. When we realized they were different, we stepped back and began to look more closely at what TangeloHub had to offer. We eventually came to the conclusion that while TangeloHub was indeed a powerful tool, the steep learning curve we would have to overcome to use it was too great. It was also not fully implemented yet, and so we were worried to use it at such an unstable stage, especially since we were already taking some of that risk with Tangelo.
After showing our sponsor the basic demo we had created, we were primed to begin making a more robust application. The demo only allowed for three steps that used a single NARCCAP file. As part of our next iteration, we worked to integrate the OPeNDAP protocol into our subsetting page and began working on a way to let users add and delete steps to make more advanced workflows. To keep track of these steps, we created a sidebar that listed each step. Users could switch between steps and add and delete any steps they chose.
We also began working on the reproducibility aspect of our project in earnest. Since our app called a process for each step that could easily be run in the terminal, we initially created a BashWriter program that created a shell script consisting of all shell commands for a given workflow.
We also needed a way to store these complex workflows on the back end. We stumbled across a python tool called pyutilib.workflow, which stored workflows as a series of tasks with inputs and outputs. This was perfect for our application, and we began modifying our existing Python code to take advantage of these workflows.
Our sponsors had originally stressed the importance of workflow reproducibility, and it had been our main focus, pushing the user interface itself to be a low-priority item. However, after showing them our latest demo, they changed their priorities. It became essential that the workflows be easily navigable and support nonlinear events.
Our sidebar was inadequate for nonlinear workflows; translating parallel steps into a single list that was easily navigable would be nearly impossible. So, we began work on a new workflow visualization tool. Instead of using a textual sidebar, we would use a built-in Tangelo plugin, Nodelink.js (based on the D3 Force Layout), to create a workflow graph.
Since this would greatly change the user interface of our app, we went back to the drawing board and designed new mockups to better show our plan to our sponsors. We also implemented some dummy workflows to show how we would use the Nodelink plugin to our advantage.
This required us to rewrite parts of the pyutilib.workflow code, so that it could give easily parsable data for Nodelink to use. We implemented a basic workflow visualization and demoed it for our sponsors.
After this demo, we created a final requirements list for our deliverable:
- Workflow Visualization
- Each type of step a different color
- Each step is uniquely labeled
- User can click on a step in the workflow visualization to bring up the configuration of that step
- Non-linear workflow support
- Calculation steps will run either R or NCL scripts via the command line
- A user can download plots and NetCDF files anytime in the workflow using a download step.
- A user can save a workflow to the database.
- A user can enter a workflow serialization number to load a workflow from the database.
- A user can delete steps in a workflow.
- Supply the following calculations:
- Aggregation
- Unit Conversion (definite temperature conversion, if time other unit conversions)
- Ranges (number of days > x temp)
- Percentiles
- Delta between subsets
- Supply standard and native plotting (1 time step)
- General template/wrapper for adding custom analysis steps
- Investigate and deliver either a sample and/or documentation around creation of iterative/looping workflows. Deliver answers to these questions:
- Is this possible?
- What would the work be to implement?
- How would we visualize?
With these requirements set, we began the final integration of all of the pieces.
First, we implemented the new design of the website based on the mockups we had previously made. This ended up being a rather large rework of the HTML. We ended up with three different pages: the home page, the workflow loading page, and the workflow builder page.
The home page was very simple, basically consisting of some brief explanation and the two necessary buttons: Create a New Workflow, and Load an Existing Workflow.
The load page was also very simple, basically consisting of a text box for a workflow serial number and a button to load the workflow.
The workflow builder itself is where most of the work took place. Our initial mockups provided a basic idea for the site, but it was severely lacking. For example, by putting the "New Step" button at the bottom of the screen, we would have to animate it along with the drawer that would display a task’s details. This seemed like a poor decision, especially once we realized that Tangelo came with a drawer plugin (called controlPanel) that we could utilize. So the "new step" button was moved to the top of the screen instead. In addition, we had completely forgotten about the "save" functionality we needed to implement, and did not have a "save" button anywhere. This was also added to the design. Finally, we also realized that it would be beneficial for the user to be able to see their workflow serial number at any given time, so this too was added.
Then it was a simple matter of reimplementing the add task functionality, adding the ability to delete tasks, and finally, adding the ability to update tasks.
As part of this process, we refactored our workflow code to be a Tangelo plugin, and also implemented tasks in the pyutilib.workflow library as PluginTasks, which can be generated via a built-in pyutilib.workflow TaskFactory. This simplified a lot of our code and also helped us to better compartmentalize our code.
Most of the trouble came from adding tasks. Once we figured out the bugs in adding a task, deleting and modifying tasks was simply a matter of expanding core components that already worked.
In addition, we were able to take advantage of Tangelo’s built-in temporary storage to store workflow representations while a user was creating their workflow.
We chose to represent workflows as large dictionary objects, that in turn became JSON strings we could parse and update. To add a task, a JSON string was “deserialized” into a workflow, the task was added, and the workflow was “reserialized”. For deleting a task, we modified the deserialization method to not deserialize the task matching the deleted node's unique ID, and to check the inputs of other nodes to make sure that any inputs referring to the deleted task were removed. Similarly, for updating tasks, the deserialization method was modified such that the task matching the unique id of the modified task would be given the new inputs passed into it instead of reading inputs from the original JSON string. Every time a task is added, updated, or deleted, the workflow is rerun to check for errors and give the user an opportunity to download the resulting netCDF or PNG.
Implementing the saving and loading of workflows into and from the database respectively ended up being relatively trivial. Since we had already come up with a method for storing the JSON locally, we simply needed to add some python functions to interface with the database and modify localStorage appropriately.
We also managed to prove that our tasks can be written in multiple languages, with some tasks being written in NCL and others being written in R. It shows that the Workflow application can have tasks written in the language best for that task, and makes our application very versatile.
On the other hand, our application runs exceptionally slow because it re-executes all tasks in a workflow even if that task has not changed and already has output. pyutilib.workflow did not implement a way to see if output is already calculated because the pyutilib.workflow library is not specifically made to pass filenames as input and output, as our application does.
Overall, this has been an amazing experience, and I have learned a lot about building a web application from the ground up. I am happy to have had the opportunity to work on this project and hope to utilize the experience it has given me.
Version 2 of the Scientific-Data Workflow
Final Version of the Scientific-Data Workflow (renamed NCAR Climate Model Analysis)
Presentations from Spring Semester
The Web Application
The Stack
Operating System: CentOS 7
Database: MongoDB
Data Analysis Languages: NCL, R
Backend Language: Python
Server/Web Framework: Tangelo
Frontend: JavaScript(JQuery, D3), CSS, HTML
General Process
Demo
Documentation
You can read the documentation by downloading the PDF here. You can also view the code on GitHub.