Big Data Experts

Intro

This is the fourth and final post in a series on choosing a language for a project for Big Data experts. This post presents a particular Big Data case in which Mosaic Software, big data project experts, assisted NASA by building a tool to help identify inefficiencies in the national air space. Specifically, Mosaic Software built a simple prototype tool to help analysts compare and contrast groups of flights in order to identify key factors as the root causes of delayed landing times.

At the heart of this project is a Hadoop Big Data Cluster NASA has built where they fuse and store much of the national air space data. Because there are so many factors that can influence the landing time, the modeling looked at and included over 80 factors parsed from recorded aircraft events, flight summary information, and weather reports near the time of landing. 

Which Language? What Tools? Most would agree that Python is currently a data scientist’s language of choice due to its simple syntax, maturity, extensive libraries, and visualization capabilities. However, Scala arguably provides more speed and scalability and is easier to deploy and maintain. And what about Julia? Does its potential gain in computational speed outweigh its immaturity?

Point Scala

For this project, the heavy lifting was done using Scala. As stated in part II, using Scala to connect to a Hadoop database and manipulating data over a spark cluster is very straightforward and provides better performance than Python. Another reason for choosing Scala was for the use of DL4J. Although not as flexible and not as well documented as some Python Deep Learning frameworks, DL4J provides a wrapper for Deep Learning models which handles the parallelization of the model across Spark nodes and the synchronization of the model weights during training. This capability is something that most other deep learning frameworks do not provide ‘out-of-the-box’.  (Yahoo does have an open-source Github repository for using TensorFlow on Spark, but it currently only has 17 contributors and so was ruled out as a viable option). 

Another reason for this choice was the flexibility that Scala affords in defining simple case classes to organize the code. The use of case classes in Scala provided convenient and intuitive way to organize and bundle model inputs as part of the data the pre-processing in order to aggregate all of the disparate data types into the necessary form to be fed to the model during training. In the end, the model was able to achieve an R^2 score of .74 when predicting the landing time. 

However, more interesting than predicting delays in landing time was the ability to identify an explanation or reason to explain long delays. To help gain a better understanding of the model outputs, a visual, interactive platform was needed, which is not a strength of Scala. 

Point Python

Due to the plethora of well supported tools, the visualization/analysis/exploration of the output was left to Python. While the finished product may not be as refined as something produced using, for example, D3, building a prototype platform for data exploration using ipywidgets and interactive plots from Plotly within a Jupyter notebook or with the Dash library is relatively quick and straightforward. This approach will start to bog down when working with tens of thousands of points, but these tools allow one to relatively quickly build simple prototypes of web-based analytical applications for data exploration without having to delve into HTML and JavaScript.  

Figure 1 shows the interactive tool that was built for this project. Using the Plotly selection tools allows the user to select two different groups of flights and then compare and contrast the input values to identify key differences between the groups that lead to different landing times.

big data experts plotly chart 1
Figure 1: Jupyter Notebook with jpywidgets and Plotly charts.

Figure 2 shows one such comparison where the ceiling height was found to be significantly different between two groups with very different transit times and, thus, provided an explanation for the difference. 

big data experts landing time comparison 2
Figure 2: Comparison of ceiling height at time of landing between two groups of flights.

Conclusion

So…what language is best? It depends on the project at hand. Scala is great for building robust, maintainable, fast code, but is lacking when it come to visualization. Python is great for rapid prototyping, data visualization, and has a plethora of well-maintained libraries. Julia, while young, appears to be growing and looks to be a very promising tool when speed is the most important design element.