One function of a data scientist is to interpret data for business users. A common way to go about doing this is to retrieve the data, create some graphs, and drop it all into a PowerPoint. A different concept – the interactive web-based notebook – has revolutionized the data science industry in the past couple years, replacing archaic notions of how we present data. When we think of a notebook now, it is not the dusty, spiral-bound, sheaf of papers of yore, but instead, a flashy, customizable web interface.
The big data challenge
At the same time, an array of big data technologies – Hadoop, Spark, Hive – have also risen in popularity among programmers and data scientists alike. Even large corporations, usually hesitant to change, have adopted Hadoop for their data warehousing. Along with this change, we see business users who want more control over their data. However, to those with minimal programming experience, setup and configuration for these technologies can be daunting. Beyond that is another hurdle: command-line interfaces, which are not always the most user-friendly.
The solution: Apache Zeppelin
We see a solution through Apache Zeppelin, a marriage of big data and the notebook. The project – still in incubation – allows anyone with a web browser to navigate to a URL and begin data mining and creating visualizations, with a host of languages to choose from. Integration with Spark is built-in to Zeppelin, reducing startup time. Collaboration is easy, with multiple people able to access the same notebook and make changes in real time. Visualizations are supported by the notebook itself, erasing the need for lines of code to create simple charts. Projects such as ZeppelinHub take it one step further, allowing the sharing of notebooks and making them available to a widespread audience. They are proving that code doesn’t have to just exist in text files and IDEs for big data. Notebooks allow the writer to tell a story: the process by which one explores a dataset. More importantly, being able to show the path you chose and the conclusions you drew at each point along the way. Now, adding big data technologies into the mix gives anyone the tools to be a data scientist.
All this is not to say that there are no shortcomings. One of the biggest is security: a possible impedance to adoption at large corporations. However, time is on the side of these projects, and much promise has been shown at this point in the development cycle.
Let me know what you think about the intersection between big data and notebooks, or about Apache Zeppelin. We welcome your comments!