Thursday, May 17, 2012

Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of all your data are belong to us. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled “7 Key Drivers for the Big Data Market”, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. And this new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

In order to make sense of this emerging space, I’ve created two graphics designed to walk through a vision of a next-generation data architecture. At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.

The three areas are:
  • Business Transactions & Interactions
  • Business Intelligence & Analytics
  • Big Data Refinery
The graphic below illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.   


Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.

The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.

More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.

The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.

With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 degree view of customers, for example.

By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.

Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.

Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a 3rd-party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.

Let me conclude by describing how the various data processing technologies fit within this next-generation data architecture.


In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.

Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.

Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.


Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.

Key Takeaway
A next-generation data architecture is emerging that connects the classic systems powering Business Transactions & Interactions and Business Intelligence & Analytics with Apache Hadoop, a “Big Data Refinery” capable of storing, aggregating, and transforming multi-structured raw data sources into usable formats that help fuel new insights for the business.

Enterprises that get good at maximizing the value from all of their data (i.e. transactions, interactions, and observations) will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities.

No comments: