28 Mar 2017 • on datascience bigdata physics SAP

Looking back at SAP HANA

Shortly after my transition from academia to the enterprise world of SAP, I published an article in ITNews Africa. At the time I was in the midst of rapid cross-skilling into the world of enterprise (SAP) technology. I couldn’t help but notice how the value-adds of in-memory computing echoed the staples of big data processing in big physics (ie. Nuclear/Particle Physics experiments):

Flexibility
Scale
Layers of data structures

Looking at the article, a few years later, though embarassed by the blatant product-pushing, I’m quite proud at how the emphasis on virtual data structures/models has remained relevant.

I’m currently back to the open-source world, working with the Apache big data stack, and in our data pipeline, virtualization is king. Layers of views, with layers of cacheing on top of Hadoop, combined with the flexibility of arbitrarily complex Python transformations in Pyspark means that as a Data Scientist I’m free to ask the questions that need asking.

Whereas SAP HANA, a fantastic enterprise tool, will deliver well on traditional, well-defined use cases closely coupled to the SAP business processes, it will quickly lose it’s shine in the hands of a demanding analyst or data scientist. The diversity and complexity of queries in a modern data science workflow can only be delivered by an open source stack.

These are two different worlds that are both needed in the enterprise. Today, I’ll push another idea to which I’ve been aspiring our infrastructure towards: Databricks’ just-in-time data warehouse.
the concept of a *just in time data warehouse* conceptualized by Databricks

If you’re not familiar with what Databricks is about, I recommend heading over to their website. Of course, there are many angles with which to approach data in your organization. I warn against a strategy that does not include a modern component.