nav search
Data Center Software Security Transformation DevOps Business Personal Tech Science Emergent Tech Bootnotes BOFH

Yahoo!'s big, fat clustered Google Machine Learning wedding

Yahoo! open-source code fling builds a better Google, again

By Gavin Clarke, 23 Mar 2017

Analysis Yahoo! last month married clustered compute to Google’s machine learning.

The firm’s engineers released TensorFlowOnSpark (TFoS), getting the Google Brain Team’s machine-learning framework up and running on Spark and Hadoop clusters.

Spark is the open-source cluster framework overseen by Apache and employed by Yahoo!, Netflix and others processing petabytes of data across thousands of nodes.

TFoS code is available on GitHub under an Apache licence and for use on Amazon’s EC2.

The idea of TFoS is deep learning on massively clustered systems – and all the benefits of processing and storage that entails – only in a Google-free setting and using an architecture that’s “easy” to build and that also delivers fast throughput. Not something that in building gets the job done, but with tradeoffs in the plumbing, like complexity or bottlenecks.

They’ve also given Google’s TensorFlow a speed boost: an Infiniband-friendly protocol, freeing it from a rather restrictive marriage to Ethernet.

Using TFoS you can run TensorFlow free of Google’s cloud and have it share servers already running other big-data apps and processes rather than dedicated clusters.

It follows last year’s release of CaffeOnSpark from Yahoo! – Caffe being a deep learning framework.

There already exist, of course, at least two projects aimed at getting Google’s ML framework running on Spark – SparkNet and TensorFrame.

According to Yahoo!, though, they have problems – all related to hardwiring TensorFlow to Spark, a fact that makes installations complicated to set up and difficult to repeat. Also, TensorFlow processes cannot communicate directly with each other, which leads to delays and latency.

Yahoo!’s answer is to use TFoS direct tensor communications among TensorFlow processes and therefore not to rely on Spark drivers.

Yahoo!’s framework reads directly from HDFS files using file readers and QueueRunners while Spark RDD data is fed into a TensorFlow graph using the feed_dict mechanism.

The search firm’s engineers have upgraded the TensorFlow plumbing for speed, using InfiniBand in-memory access.

TensorFlow runs over Ethernet officially, but Yahoo!’s own Hadoop clusters employed Infiniband along with Ethernet. Yahoo!’s code is now on GitHub.

Yahoo! has also released grpc protocol so that any Remote Direct Memory Access (RDMA) can go via InfiniBand. An RDMA rendezvous manager will now ensure tensors are written into remote servers’ memory with the time taken to reduce the creation of tensor buffers.

Will this have any impact? Amazon Web Services last year moved to embrace MXNet as its chosen machine-learning framework. AWS has the scale of its cloud to make MXNet matter. Google, of course, has TensorFlow itself, which it open-sourced but which it will use as a means to cover itself in data and compute from those using the service. Facebook, meanwhile, is working on its own machine-learning models via its Applied Machine Learning to process billions of data points in posts.


If you think of Yahoo! at all these days, you’ll probably do so for the wrong reasons: massive hacks, gargantuan losses of personal data, the company’s failure as an early pioneer to see Google coming. In the near future, this one-time pioneer in internet search and ads will see its head added to the collection owned by faceless US telco Verizon, which includes that of another one-time internet pioneer and business giant – America Online.

While all true, this simple analysis would misrepresent and overlook the impact of Yahoo!’s engineering achievements. Over the decades Yahoo! has contributed substantially to the greater good, publishing its own code as open source.

Arguably Yahoo!’s greatest legacy once it is a division of Verizon will be big data, after one of its engineers – Doug Cutting – wrote an open-source implementation of Google’s MapReduce that became Hadoop. What followed was an entire ecosystem of startups and projects crunching data at scale – Cloudera, Hortonworks, MapR to name three in a market some calculate will be worth $50bn by 2020.

Admittedly, it has been a slow start, with Hadoop held back by lack of skills and missing features that would let otherwise “ordinary” users install and run Hadoop. This has slightly helped firms like Cloudera and Hortonworks sell services. Also, Hadoop is no simple software purchase: Hadoop is a “platform sell”, meaning if you buy in, you are committing the architectural direction of your organisation to Hadoop, too – data, tools, development. This, on the other hand, has not helped firms like Cloudera and Hortonworks.

That said, Hadoop users are no longer restricted to the Silicon Valley elite. It’s now employed by high-street names like Barclays and M&S in the UK.

Yahoo! used Hadoop for years on its ads and search services.

If past performance is an indicator of potential, TFoS is a potentially grand parting gesture from a firm that has achieved greater success with its code than with its business. ®

The Register - Independent news and views for the tech community. Part of Situation Publishing