There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if
each new technology, as well as existing technologies, are pushing the meme of “all your data are belong to us”.
It is great to see the wave of innovation occurring across the landscape of
SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just
a few), but enterprises and the market in general can use a healthy dose of
clarity on just how to use and interconnect these various technologies in ways
that benefit the business.
In my post entitled “7 Key Drivers for the Big Data Market”, I
asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the
new(er) worlds of interactions and observations. And this new world brings
with it a wide range of multi-structured data sources that are forcing a new
way of looking at things.
In order to make sense of this emerging space, I’ve created two
graphics designed to walk through a vision of a next-generation data
architecture. At the highest level, I describe three broad areas of data
processing and outline how these areas interconnect.
The three areas are:
- Business Transactions & Interactions
- Business Intelligence & Analytics
- Big Data Refinery
The graphic below illustrates a vision for how these three
types of systems can interconnect in ways aimed at deriving maximum value from
all forms of data.
Enterprise IT has been connecting systems via classic ETL processing, as
illustrated in Step 1 above, for
many years in order to deliver structured and repeatable analysis. In this step,
the business determines the questions to ask and IT collects and structures the
data needed to answer those questions.
The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of
storing, aggregating, and transforming a wide range of multi-structured raw
data sources into usable formats that help fuel new insights for the business. The
Big Data Refinery provides a cost-effective platform for unlocking the
potential value within data and discovering the business questions worth
answering with this data. A popular example of big data refining is processing
Web logs, clickstreams, social interactions, social feeds, and other user
generated data sources into more accurate assessments of customer churn or more
effective creation of personalized offers.
More interestingly, there are businesses deriving value from
processing large video, audio, and image files. Retail stores, for example, are
leveraging in-store video feeds to help them better understand how customers
navigate the aisles as they find and purchase products. Retailers that provide
optimized shopping paths and intelligent product placement within their stores are
able to drive more revenue for the business. In this case, while the video
files may be big in size, the refined output of the analysis is typically small
in size but potentially big in value.
The Big Data Refinery platform provides fertile ground for
new types of tools and data processing workloads to emerge in support of rich
multi-level data refinement solutions.
With that as backdrop, Step
3 takes the model further by showing how the Big Data Refinery interacts with
the systems powering Business Transactions & Interactions and Business
Intelligence & Analytics. Interacting in this way opens up the ability for
businesses to get a richer and more informed 360 degree view of customers, for
example.
By directly integrating the Big Data Refinery with existing
Business Intelligence & Analytics solutions that contain much of the
transactional information for the business, companies can enhance their ability
to more accurately understand the customer
behaviors that lead to the transactions.
Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery.
Complex analytics and calculations of key parameters can be performed in the
refinery and flow downstream to fuel runtime models powering business
applications with the goal of more accurately targeting customers with the best
and most relevant offers, for example.
Since the Big Data Refinery is great at retaining large
volumes of data for long periods of time, the model is completed with the
feedback loops illustrated in Steps 4
and 5. Retaining the past 10 years of historical “Black Friday” retail
data, for example, can benefit the business, especially if it’s blended with
other data sources such as 10 years of weather data accessed from a 3rd-party
data provider. The point here is that the opportunities for creating value from
multi-structured data sources available inside and outside the enterprise are virtually
endless if you have a platform that can do it cost effectively and at scale.
Let me conclude by describing how the various data processing
technologies fit within this next-generation data architecture.
In the graphic above, Apache Hadoop acts as the Big Data
Refinery. It’s great at storing, aggregating, and transforming multi-structured
data into more useful and valuable formats.
Apache Hive is a Hadoop-related component that fits within
the Business Intelligence & Analytics category since it is commonly used
for querying and analyzing data within Hadoop in a SQL-like manner. Apache
Hadoop can also be integrated with other EDW, MPP, and NewSQL components such
as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana,
Microsoft SQL Server PDW and many others.
Apache HBase is a Hadoop-related NoSQL Key/Value store that
is commonly used for building highly responsive next-generation applications. Apache
Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such
as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB,
MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many
others.
Finally, data movement and integration technologies help
ensure data flows seamlessly between the systems in the above diagrams; the
lines in the graphic are powered by technologies such as WebHDFS, Apache
HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic,
Splunk, Attunity and many others.
Key Takeaway
A next-generation data architecture is emerging that connects
the classic systems powering Business Transactions & Interactions and
Business Intelligence & Analytics with Apache Hadoop, a “Big Data Refinery”
capable of storing, aggregating, and transforming multi-structured raw data sources
into usable formats that help fuel new insights for the business.
Enterprises that get good at maximizing the value from all
of their data (i.e. transactions, interactions, and observations) will put
themselves in a position to drive more business, enhance productivity, or
discover new and lucrative business opportunities.