Open Thoughts on Software, Business, Life: May 2012

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of “all your data are belong to us”. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled “7 Key Drivers for the Big Data Market”, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. And this new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

In order to make sense of this emerging space, I’ve created two graphics designed to walk through a vision of a next-generation data architecture. At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.

The three areas are:

Business Transactions & Interactions
Business Intelligence & Analytics
Big Data Refinery

The graphic below illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.

Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.

The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.

More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.

The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.

With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 degree view of customers, for example.

By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.

Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.

Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a 3^rd-party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.

Let me conclude by describing how the various data processing technologies fit within this next-generation data architecture.

In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.

Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.

Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.

Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.

Key Takeaway

A next-generation data architecture is emerging that connects the classic systems powering Business Transactions & Interactions and Business Intelligence & Analytics with Apache Hadoop, a “Big Data Refinery” capable of storing, aggregating, and transforming multi-structured raw data sources into usable formats that help fuel new insights for the business.

Enterprises that get good at maximizing the value from all of their data (i.e. transactions, interactions, and observations) will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities.

I attended the Goldman Sachs Cloud Conference and participated on a panel focused on “Data: The New Competitive Advantage”. The panel covered a wide range of questions, but kicked off covering two basic questions:

“What is Big Data?” and “What are the drivers behind the Big Data market?”

While most definitions of Big Data focus on the new forms of unstructured data flowing through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to answer the question using a simple equation:

Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:

ERP, SCM, CRM, and transactional Web applications are classic examples of systems processing Transactions. And the highly structured data in these systems is typically stored in SQL databases.

Interactions are about how people and things interact with each other or with your business. Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content are classic places to find Interaction data.

Observational data tends to come from the “Internet of Things”. Sensors for heat, motion, pressure and RFID and GPS chips within such things as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output Observation data.

With that basic definition of Big Data as background, let’s answer the question:

What are the 7 Key Drivers Behind the Big Data Market?

Business

1. Opportunity to enable innovative new business models

2. Potential for new insights that drive competitive advantage

Technical

3. Data collected and stored continues to grow exponentially

4. Data is increasingly everywhere and in many formats

5. Traditional solutions are failing under new requirements

Financial

6. Cost of data systems, as % of IT spend, continues to grow

7. Cost advantages of commodity hardware and open source software

There’s a new generation of data management technologies, such as Apache Hadoop, that are providing an innovative and cost effective foundation for the emerging landscape of Big Data processing and analytics solutions. Needless to say, I’m excited to see how this market will mature and grow over the coming years.

Key Takeaway

Being able to dovetail the classic world of Transactions with the new(er) worlds of Interactions and Observations in ways that drives more business, enhances productivity, or discovers new and lucrative business opportunities is why Big Data is important.

One promise of Big Data is that companies who get good at collecting, aggregating, refining, analyzing, and maximizing the value derived from Transactions, Interactions, and Observations will put themselves in a position to answer such questions as:

What are the behaviors that lead to the transaction?

And even more interestingly:

How can I better encourage those behaviors and grow my business?

So ask yourself, what’s your Big Data strategy?

Thursday, May 17, 2012

Big Data Refinery Fuels Next-Generation Data Architecture

Tuesday, May 15, 2012

7 Key Drivers for the Big Data Market

Shaun Connolly

Blog Archive

Delicious/connollyshaun

Favorite Blogs

Labels