Thursday, May 17, 2012

Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of all your data are belong to us. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled “7 Key Drivers for the Big Data Market”, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. And this new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

In order to make sense of this emerging space, I’ve created two graphics designed to walk through a vision of a next-generation data architecture. At the highest level, I describe three broad areas of data processing and outline how these areas interconnect.

The three areas are:
  • Business Transactions & Interactions
  • Business Intelligence & Analytics
  • Big Data Refinery
The graphic below illustrates a vision for how these three types of systems can interconnect in ways aimed at deriving maximum value from all forms of data.   

Enterprise IT has been connecting systems via classic ETL processing, as illustrated in Step 1 above, for many years in order to deliver structured and repeatable analysis. In this step, the business determines the questions to ask and IT collects and structures the data needed to answer those questions.

The “Big Data Refinery”, as highlighted in Step 2, is a new system capable of storing, aggregating, and transforming a wide range of multi-structured raw data sources into usable formats that help fuel new insights for the business. The Big Data Refinery provides a cost-effective platform for unlocking the potential value within data and discovering the business questions worth answering with this data. A popular example of big data refining is processing Web logs, clickstreams, social interactions, social feeds, and other user generated data sources into more accurate assessments of customer churn or more effective creation of personalized offers.

More interestingly, there are businesses deriving value from processing large video, audio, and image files. Retail stores, for example, are leveraging in-store video feeds to help them better understand how customers navigate the aisles as they find and purchase products. Retailers that provide optimized shopping paths and intelligent product placement within their stores are able to drive more revenue for the business. In this case, while the video files may be big in size, the refined output of the analysis is typically small in size but potentially big in value.

The Big Data Refinery platform provides fertile ground for new types of tools and data processing workloads to emerge in support of rich multi-level data refinement solutions.

With that as backdrop, Step 3 takes the model further by showing how the Big Data Refinery interacts with the systems powering Business Transactions & Interactions and Business Intelligence & Analytics. Interacting in this way opens up the ability for businesses to get a richer and more informed 360 degree view of customers, for example.

By directly integrating the Big Data Refinery with existing Business Intelligence & Analytics solutions that contain much of the transactional information for the business, companies can enhance their ability to more accurately understand the customer behaviors that lead to the transactions.

Moreover, systems focused on Business Transactions & Interactions can also benefit from connecting with the Big Data Refinery. Complex analytics and calculations of key parameters can be performed in the refinery and flow downstream to fuel runtime models powering business applications with the goal of more accurately targeting customers with the best and most relevant offers, for example.

Since the Big Data Refinery is great at retaining large volumes of data for long periods of time, the model is completed with the feedback loops illustrated in Steps 4 and 5. Retaining the past 10 years of historical “Black Friday” retail data, for example, can benefit the business, especially if it’s blended with other data sources such as 10 years of weather data accessed from a 3rd-party data provider. The point here is that the opportunities for creating value from multi-structured data sources available inside and outside the enterprise are virtually endless if you have a platform that can do it cost effectively and at scale.

Let me conclude by describing how the various data processing technologies fit within this next-generation data architecture.

In the graphic above, Apache Hadoop acts as the Big Data Refinery. It’s great at storing, aggregating, and transforming multi-structured data into more useful and valuable formats.

Apache Hive is a Hadoop-related component that fits within the Business Intelligence & Analytics category since it is commonly used for querying and analyzing data within Hadoop in a SQL-like manner. Apache Hadoop can also be integrated with other EDW, MPP, and NewSQL components such as Teradata, Aster Data, HP Vertica, IBM Netezza, EMC Greenplum, SAP Hana, Microsoft SQL Server PDW and many others.

Apache HBase is a Hadoop-related NoSQL Key/Value store that is commonly used for building highly responsive next-generation applications. Apache Hadoop can also be integrated with other SQL, NoSQL, and NewSQL technologies such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2, MongoDB, DynamoDB, MarkLogic, Riak, Redis, Neo4J, Terracotta, GemFire, SQLFire, VoltDB and many others.

Finally, data movement and integration technologies help ensure data flows seamlessly between the systems in the above diagrams; the lines in the graphic are powered by technologies such as WebHDFS, Apache HCatalog, Apache Sqoop, Talend Open Studio for Big Data, Informatica, Pentaho, SnapLogic, Splunk, Attunity and many others.

Key Takeaway
A next-generation data architecture is emerging that connects the classic systems powering Business Transactions & Interactions and Business Intelligence & Analytics with Apache Hadoop, a “Big Data Refinery” capable of storing, aggregating, and transforming multi-structured raw data sources into usable formats that help fuel new insights for the business.

Enterprises that get good at maximizing the value from all of their data (i.e. transactions, interactions, and observations) will put themselves in a position to drive more business, enhance productivity, or discover new and lucrative business opportunities.

Tuesday, May 15, 2012

7 Key Drivers for the Big Data Market

I attended the Goldman Sachs Cloud Conference and participated on a panel focused on “Data: The New Competitive Advantage”. The panel covered a wide range of questions, but kicked off covering two basic questions:

“What is Big Data?” and “What are the drivers behind the Big Data market?”

While most definitions of Big Data focus on the new forms of unstructured data flowing through businesses with new levels of “volume, velocity, variety, and complexity”, I tend to answer the question using a simple equation:

Big Data = Transactions + Interactions + Observations

The following graphic illustrates what I mean:

ERP, SCM, CRM, and transactional Web applications are classic examples of systems processing Transactions. And the highly structured data in these systems is typically stored in SQL databases.

Interactions are about how people and things interact with each other or with your business. Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content are classic places to find Interaction data.

Observational data tends to come from the “Internet of Things”. Sensors for heat, motion, pressure and RFID and GPS chips within such things as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output Observation data.

With that basic definition of Big Data as background, let’s answer the question:

What are the 7 Key Drivers Behind the Big Data Market?

1.     Opportunity to enable innovative new business models
2.     Potential for new insights that drive competitive advantage

3.     Data collected and stored continues to grow exponentially
4.     Data is increasingly everywhere and in many formats
5.     Traditional solutions are failing under new requirements 

6.     Cost of data systems, as % of IT spend, continues to grow
7.     Cost advantages of commodity hardware and open source software

There’s a new generation of data management technologies, such as Apache Hadoop, that are providing an innovative and cost effective foundation for the emerging landscape of Big Data processing and analytics solutions. Needless to say, I’m excited to see how this market will mature and grow over the coming years.

Key Takeaway
Being able to dovetail the classic world of Transactions with the new(er) worlds of Interactions and Observations in ways that drives more business, enhances productivity, or discovers new and lucrative business opportunities is why Big Data is important.

One promise of Big Data is that companies who get good at collecting, aggregating, refining, analyzing, and maximizing the value derived from Transactions, Interactions, and Observations will put themselves in a position to answer such questions as:

What are the behaviors that lead to the transaction?

And even more interestingly:

How can I better encourage those behaviors and grow my business?

So ask yourself, what’s your Big Data strategy?

Monday, February 6, 2012

Solving the Data Problem in a Big Way

I recently joined Hortonworks as VP of Corporate Strategy, and I wanted to share my thoughts as to what attracted me to Hortonworks.

For me, it’s important to 1) work with a top-notch team and 2) focus on unique market-changing business opportunities.

Hortonworks has a strong team of technical founders (Eric14, Alan, Arun, Deveraj, Mahadev, Owen, Sanjay, and Suresh) doing impressive work within the Apache Hadoop community. Hortonworks also has an impressive Board of Directors that includes folks like Peter Fenton, Mike Volpi, Jay Rossiter, Rob Bearden, as well as our most recent board member Paul Cormier (Red Hat’s President of Products and Technology).

So “top-notch team”? Check!

Regarding “unique market-changing business opportunities”, the top 3 technology areas right now are arguably: Mobility, Cloud, and Big Data. Apache Hadoop is clearly a technology in the Big Data category that is enabling a new approach to data processing (both from a capabilities perspective and an economics perspective).

I’ve spent the last few years in the Cloud space (at SpringSource and VMware), and I met with many customers who loved VMware’s Cloud Application Platform vision. One of the common questions that came up, however, was:

“What are you going to do about the data problem?”

Traditional application architectures focus on moving structured data from backend datastores to the applications that need the data. Elastic Caching Platforms such as VMware’s vFabric Gemfire help with scalability and latency issues for these types of applications.

Rather than move data to applications, Hadoop provides a platform that cost effectively stores petabytes of data and enables application logic to execute directly on that data in a massively parallel manner.

I believe Hadoop provides a very compelling solution to the “data problem” since it’s explicitly designed to deal with the volume, velocity, variety and exponential scale of unstructured and semi-structured data that businesses increasingly need to deal with. Moreover, Hadoop does this within an economic model (a la commodity servers and storage) that makes the platform useful for a wide range of problems.

While 2011 was the year where a critical mass of enterprise customers and vendors began to realize the size and scope of the opportunity and value behind this Apache Hadoop phenomenon, the wave is just getting started, and I’m excited to be a part of the fun!

Monday, January 23, 2012

Food Zombies - A Tasty Top-Down Shooter Game

Over the past few years, I've written various blog posts covering my son's interest in programming.

Billy is a high school junior now, and his efforts over the past 4 years have spanned from creating interactive virtual playworlds and sharing LUA scripting code in ROBLOX, to cool programs for the TI-84 calculator, to Block Dude Evolved for iOS devices (a port of the classic TI-84 Block Dude game), to Prom Checklist for high school girls to keep track of all their prom-related details on their iOS devices, as well as Prom Checklist West, a branded version for Cherry Hill High School West where the proceeds will go to the school's "project graduation".

His most recent effort brings him back to his area of passion...that is game development for iOS devices (iPod Touch, iPhone, etc.).

Food Zombies is his craziest, most entertaining game to date.

While Food Zombies is a classic top-down zombie shooter, it's also good clean (and unique) fun since the zombies are fast food (in the form of pizza, fries, donuts, pies, etc.):

The power-ups are, naturally, good-for-you fruits:

And since this a shooter game, it offers many different weapons to satisfy even the pickiest food-slaying cravings:

I highly recommend the flame thrower. :-)

Anyhow, if you have an Apple iPod Touch or iPhone and you can't get enough of top-down Zombie shooter games, then I recommend you satisfy your cravings by giving Food Zombies a try.

Oh yeah, one final note for those Block Dude Evolved fans out there. Billy has just finished a major 2.0 version of the game that adds in a bunch of new features. The game is working its way through Apple approvals, so stay tuned!

Saturday, July 24, 2010

Block Dude Evolved

My son's first foray into programming started a few years ago playing ROBLOX; a virtual playworld where kids can create and customize the look and behavior of their own online worlds. I wrote a couple of posts including "Professor ROBLOX: Class In Session" covering how ROBLOX is actually shaping the lives of future programmers since kids use the Lua scripting language to customize the behavior of their worlds.

My son moved on past Lua and taught himself TI Assembly Programming and Visual C++. His goal: create games for the TI-84 graphing calculator so he and his middle-school friends could play games rather than pay attention in math class. :-)

With the rise of the Apple iPod Touch and iPhone, he has launched headfirst into Objective-C and delivered "Block Dude Evolved", a recreation of the all-time classic TI calculator game called "Block Dude".

Block Dude Evolved is a puzzle game. The goal is to move your little man across obstacles and out the exit door on the level. The challenge is that you need to pick up and move blocks to help you climb over obstacles that are between you and the door. You can only step up one block at a time, so if you are facing a wall two blocks high, then you need to grab a movable block and plop it down so you can climb up. The first level is pretty simple, but the levels increase in difficulty after that.

The controls of Block Dude Evolved are pretty simple. To move the little man left or right, you just tap those sides of the screen. To climb up a block, just tap the upper portion of the screen. If you are standing next to a block that can be picked up, just tap the block and you will lift it above your head. Then you simply move to where you want to be and tap the spot where you want to drop the block. If you want to exit out of the game, just tap two fingers at the same time.

Block Dude Evolved has a Settings dialog that enables you to customize the look:

For example, you can choose the Future look:

Or the Revamped look:

Anyhow, if you have an Apple iPod Touch or iPhone and yearn for the days of classic brain-puzzle games, then I recommend you give Block Dude Evolved a try.

Tuesday, May 11, 2010

Job Trends: Spring, WebSphere, WebLogic - what a difference a year makes!

Last year I wrote "Job Trends: Tomcat, Spring, Weblogic, JBoss, EJB" where I discussed the trend towards "Lean Software" and the role that Spring plays in this important movement.

A lot has happened over the past year. CIO's have identified Virtualization and Cloud computing as their top two strategic technologies for 2010. Lean Software has become even more of a Business Technology Imperative than it was a year ago. And, the job market over the past year has been challenging at best.

With that as a backdrop, let's see what the job market looks like for Spring Java developer skills versus the other industry heavyweights.

The chart nicely illustrates that Spring Java developer skills (green line) have been on an inexorable path upwards for the past 5 years. WebSphere Java developer skills (blue line) are next and have been on a downward path for the past year and a half. WebLogic Java developer skills (orange line) round out the chart and have been relatively flat over the past few years.

Companies continue to value lightweight application infrastructure skills (i.e. Spring) since this provides them a way to create applications more quickly and therefore be more competitive. More speculatively, I believe that Virtualization and Cloud computing initiatives are accelerating this trend since these initiatives are forcing enterprises to take a hard look at how they are building and deploying applications...and to take measures (and hire talent) that dramatically simplify the process.

Since I work at the SpringSource division of VMware, I have a keen interest in the health and vibrancy of the Spring community. I'm happy to see that even in a tough job market, the demand for Spring Java developer skills continues to grow.

Credits: I used to generate the chart above. searches millions of jobs from thousands of job sites and provides a neat service that lets you see job trends for whatever search criteria you may have. My criteria was Java Developers that have Spring, WebSphere, or WebLogic skills. Click here to go to to see the latest view of my chart above.

Saturday, September 12, 2009


Things that make you go hmmmmm....I never realized that when looked at it in the mirror, 3.14 can be both mathematical and delicious:

3.14 = PIE