By Paige Roberts
In my previous post, we pondered the merits of the philosophies of (schema on read, and schema on write. Link to previous post.) But schema, wherefore art though schema? Would not a Hadoop data lake by any other strategy be as useful?
Let’s have a look at some of the SQL in Hadoop technologies used in data lakes today. To make the decision as to which strategy you would want to use, it would help to know which Hadoop ecosystem technologies use which strategies before you choose.
Examples of the Two Schema Strategies in Hadoop Ecosystem Technologies
Drill is probably the best example of a pure schema on read SQL engine in the Hadoop ecosystem today. It gives you the power to query a broad set of data, from a wide variety of different data stores, including hierarchical data such as JSON and XML. It also gives you the flexibility to ask any question of that data. Drill is still young technology, but it shows tremendous promise to be the ideal data exploration tool of the Hadoop data lake.
Hive is the original schema on read technology, but is, in fact, a marvelous hybrid of the two technologies. It can do queries on a broad set of data types and sources. However, like Drill, it has speed issues due to the complexity of imposing schema during query. Under the covers, it used to generate MapReduce to essentially do ETL at query time, but due to the limitations of MapReduce, many Hive implementations have moved to using Tez for that same purpose. This has given Hive a nice speed boost.
In order for Hive to gain the advantages of a schema on write data store, ORC file format was created. This is a pre-structured format optimized for Hive queries. By combining strategies, Hive has gained many of the advantages of both camps.
Facebook Presto has a sort of schema on read technology that gives you the ability to query a lot of different data sets, except that they have an intermediate step that you have to do before querying that defines or fetches schemas from the various source data sets. So, it’s a bit of a hybrid as well, and has shown some really impressive speed and flexibility because of it.
Spark SQL is entirely a schema on write technology, but they have a unique way of short-cutting a lot of the slow ETL processes normally associated with schema on write. Spark, itself, is not just an in-memory database format, but also a data processing engine that can do high speed ETL processes entirely in memory. This shortens the up-front cost of the schema on write strategy. Like Hive, Spark SQL often does high speed ETL in the background at query time.
There are several other SQL on write options, each with their own way of doing things that offers some sort of unique advantage or disadvantage: Impala with Parquet, Actian Vortex with Vector in Hadoop, IBM Big SQL with BigInsights, HAWQ with Greenplum, and the list goes on.
There are, of course, the NOSQL databases as well that have their own way of handling the schema dilemma, but that’s a whole other blog post.
So, Which One Should I Choose?
As to which one to choose, that depends entirely on what your goals are, what balcony you want to climb, and what features are important to you accomplishing those goals.
Schema on read options tend to be a better choice for exploration, for “unknown unknowns,” when you don’t know what kind of questions you might want to ask, or the kinds of questions might change over time. They’re also a better option when you don’t have a strong need for immediate responses. They’re ideal for data exploration projects, and looking for new insights with no specific goal in mind.
Schema on write options tend to be very efficient for “known unknowns.” When you know what questions you’re going to need to ask, especially if you will need the answers fast, schema on write is the only sensible way to go. This strategy works best for old school BI types of scenarios on new school big data sets.
In the end, you must decide what you need most, flexibility or precision, speed or power, or some combination of each. Start by deciding what your business most needs to accomplish with its new Hadoop data lake.
No one says that you have to choose only one option. The many parts of the Hadoop ecosystem are designed to live in harmony with YARN keeping peace. Just remember that choosing to go with a Hadoop data lake may mean some trade-offs, but it doesn’t mean giving up SQL.
A good SQL query by any other name still can’t be beat.
Paige Roberts has spent a lot of her life stuffing her brain full of information about big data, data integration, data quality, and analytics software, markets, and systems, and is likely to tell you about it in great detail if you don’t run away fast enough. Find her on Twitter or LinkedIn a RobertsPaige, or check out her blog at bigdatapage.com.
Hope you enjoyed reading this blog.
Visit the Resources page for more eBooks or Whitepages