Hadoop Distributions – do we care which one to use?

It’s almost 9 months since Hortonworks went public and Cloudera continue to provide some insight into their growth. Along with MapR, IBM and Pivotal (EMC), Amazon’s Hadoop offerings and straight Apache Hadoop, we can get a pretty clear picture of how each distribution is doing. Of course this is Open Source (mostly), so each Distribution Vendor gets their work recycled into Apache as well as making it into their own version.

If we add up the numbers, Hortonworks expects to do something like $110m – $115m http://www.nasdaq.com/press-release/hortonworks-reports-financial-results-for-second-quarter-2015-20150805-01151. Cloudera will probably top $200m http://venturebeat.com/2015/05/04/cloudera-could-be-worth-almost-5b-with-199m-in-revenue-this-year-analysts-say/. Take these two dominant players, add in some guesses about IBM and Amazon and we are probably looking at a software subscription market (ie no services, no hardware) of less then $1bn. Enterprise customers love to know this type of data so they can be sure to back the ‘winner’. My point is – in the world of open source and big data there can be no ‘winner’.

When we talk to clients and they ask us ‘Which Distribution do you recommend’?’ we give one of two answers :-

If they have nothing deployed or, (more typically) failed in-house deployments, we always recommend Cloudera. This is because it has the most enterprise ready ecosystem, security and governance capabilities and is easy to scale quickly.
If they already have something deployed, (usually Hortonworks but sometimes Pivotal or Amazon), we tell them to keep using it and we will help them achieve a quicker time to value and /or get a return on an already existing investment.

The reason we do this is because, in my view, the Distribution that customers choose is a really insignificant part of the success or failure of their project (s). Want to use Cloudera? Go right ahead. Prefer to use Hortonworks because it’s cheaper and, maybe you have bought into the whole Open Data Platform idea ?(http://www.cmswire.com/cms/big-data/hey-cloudera-mapr-open-data-platform-is-the-real-deal-028787.php) – sounds good too.

In our experience working with clients in Telco, Retail, Finance, Government and other sectors there are 5 real factors to consider when implementing a Big Data project or trying to rescue a failed one :-

Are you buying the right hardware? I have covered in previous posts how the Hardware landscape for Big Data is changing but – with the ecosystem changing so rapidly make sure you are buying hardware that will meet the current and future needs.
Is your Big Data strategy driven by the business side of the company? Big Data projects driven by IT get stuck in all sorts of pointless discussions about – for example…. which Distribution to use!
Are your use cases clearly defined?
Does the partner you are working with have real, referenced projects and customers? I can’t tell you how many times we have worked clients with failed projects who – in reality – had been paying their consulting provider for on the job training in Big Data without knowing it.
Get ready to scale – once a Big Data infrastructure is in place the business usually demands a rapid adoption of new use cases.

As you can see, with these 5 factors, choice of the Hadoop distribution can be an afterthought. so sure, we can all get excited about ODP, or Vora from SAP http://fortune.com/2015/09/01/sap-to-bridge-big-data-gap/ or many other technology issues but – the most important factors in our experience are those above and the technology is secondary.

What do you think? Are Hadoop Distributions not that important? What do you think about ODP – hype or important.?

If you have additional questions, get in touch with us!

USA

Corporate Head Quarters

2205 152nd Avenue NE
Redmond, WA 98052
USA

+1 (425) 605 1289

Latin America

(Mexico, Colombia & Chile)

Mexico City

Córdoba 42 Int. 807, Roma Norte, Cuauhtémoc, 06700, Mexico City

+52 (55) 5255 1329

United Kingdom

London

85 Great Portland Street, First Floor, London, W1W 7LT

+44 2030 971584

Ireland

Sligo

77 Camden Street Lower, Dublin, D02 XE80, Ireland

+353 71 915 9710

Search Guard is a trademark of floragunn GmbH, registered in the U.S. and in other countries. Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries. OpenSearch is licensed under Apache 2.0. All other trademark holders rights are reserved.