April 16th, 2015
Most of us in the data management space are familiar with relational databases and their ETL tools. That’s why, as companies implement Hadoop, IT teams sometimes use the same relational database ETL tools that have been adapted for use with Hadoop. They’re comfortable. However, if you really want to get the most out of your Hadoop investment, a bit of a mind shift – away from how you’ve always done things – is in order.
Structural and data processing dissimilarities aside, there are three main points of differentiation to consider:
With relational databases, you’re required to pre-define the schema and can only capture data that meets those criteria. In contrast, Hadoop accepts any type of data in any format, including unstructured data like email, multi-media, web pages, presentations and other business documents. That said, you’ll still need to define schema in order to use the data, but you do have the flexibility to do it with third-party tools either when you load it (on write) or when you use it (on read). We advocate that adding schema on write (when possible) will save you a lot of grief on the back end. For more about why, read this earlier post on why schema and metadata matter.
Hadoop offers more flexibility as well as presents additional challenges when it comes to data quality. With relational databases, you have a high degree of confidence that your data quality is good, as it can only be captured if it meets the schema criteria. However, with Hadoop, you capture all of the raw data, which may or may not have some “fields” missing.
For example, say you have employee records data and you find that in 20% of the records the date of birth is incomplete or missing. In a relational database, these records would likely be rejected and either you’d go in and resolve the issue or the records wouldn’t be included in the data set.
In Hadoop, such incomplete records can be stored as-is without any constraint. Where this functionality becomes important is for data users who don’t need the date of birth, but can use other data from the records (e.g., gender and geolocation). With Hadoop, it’s the user that determines the level of data quality that’s acceptable, potentially allowing businesses to derive more value from their data.
The ways you can access data are far broader in Hadoop than in SQL-based relational databases. With Hadoop, storage and the access mechanism are separate. This is critical as the Hadoop ecosystem continues to evolve. It means that you don’t have to move your data, even as access methods change. It also means that users can use different access tools on the same data, depending on their needs. As companies look for new ways to slice and dice data, capture new types of data, use larger data sets and employ predictive modeling, new approaches to understanding big data are needed. Some of these include graph analytics, machine learning, and other advanced algorithms. This flexibility is part of the reason why there’s so much creativity in the Hadoop space right now.
Because data is managed and accessed so differently in Hadoop than with traditional relational databases, savvy IT professionals are investigating new approaches when it comes to determining how they’ll implement and manage their Hadoop projects. Specifically, instead of using existing relational database tools adapted to work with Hadoop – and staying within their comfort zone – many are turning to data management tools that were developed expressly for and work more seamlessly with Hadoop. This is where Zaloni plays. Our Platform, Arena, was built from the ground up with Hadoop in mind.