Isilon and a question of Big Data
Or was that ingestion?
Interview Xiotech technology VP Rob Peglar has moved to Isilon, now an EMC business, to become chief technology officer (CTO) for the Americas.
We interviewed Rob and asked him questions that reveal quite a lot about Isilon's prospects, big data, the role of flash in scale-out filers, reduplication and Isilon, and what we should think about archiving data from Isilon clusters.
El REg Why did you join Isilon?
Rob Peglar: Primarily, for a personal reason - to take the CTO Americas role. Secondarily, significant parts of the industry are moving towards greater use of file-based storage and the resultant use (gathering, analysis, reduction) of data stored in files. Isilon is an innovator and leader in that space and I joined to help end users realize new capabilities in their use of file data as well as be a key participant in the next generation(s) of file-based storage architectures.
El Reg What does the CTO Americas do that's different from the overall CTO?
Rob Peglar: CTO Americas role is an allied position to the corporate CTO (Paul Rutherford). Isilon has a thrice-distributed CTO function in world geographies; Americas (basically, the Western Hemisphere), EAME and Asia-Pacific (AP). These roles have an outward (i.e. towards end users and channels) function as well as an inward (i.e. towards products, roadmap, strategy, engineering, etc.) function. In my role, I will be facing customers and channels to give them a thorough understand of not only what Isilon does, how and why we do it, and so on, but also higher-level industry trends, techniques, technologies, and executive-level briefings on the strategic implications of file data to businesses and organizations.
El Reg Is big data in general different from big data in the HPC world and, if so, how?
Rob Peglar: In general, it is. While there are some similarities – both being unstructured data, for example – there are typically differences between big data in the commercial/business world and big data in the traditional HPC/supercomputing world. I am fortunate to have experience in both worlds, dating back to 1978 on the traditional HPC side. HPC typically involves the analysis of very large but ‘fixed’ sets of data, i.e. a dataset describing an initial condition. That data is then ingested and subjected to an iterative process, typically a very large job which simulates and analyzes the forward-in-time progress of the computation, performing a certain computational model based on the initial condition.
During the job, large intermediate files are produced to save the job’s state and its data at a given time step. This process is often referred to as ‘checkpointing’. Checkpoints are taken because HPC jobs may run for weeks at a time; restarting a job from its initial condition is to be avoided, for all the obvious reasons. The end result of the HPC job may actually be very little data; just a set of results or a visualisation, computed over a given time interval. Or, the net result may be another very large dataset which would then in turn undergo yet another set of analysis, perhaps by a different job.
Contrast this with commercial/business ‘big data’ as being generated and stored by what I call ‘constantly running’ applications, e.g. web hits, cookie-based widgets, error logs, transaction logs, streaming apps, and the like. This kind of data, while unstructured like its HPC cousins, is constantly changing and being appended to by the outside world.
Data analysis jobs in this world typically take a ‘chunk’ of this big data and attempt to reduce it for specific analysis, pattern matching, searching, and/or general data mining, seeking to understand the data itself for a business purpose. The key to this kind of big data is that it’s constantly evolving, whereas data in the HPC world typically doesn’t. Both types of big data, however, require large, reliable and – the seminal characteristic by far – scalable storage.
El REg How is Isilon's scale-out NAS product better than competing products from HP (IBRIX), IBM (SONAS), BlueArc, and Dell (Exanet and DSFS)?
Rob Peglar: There's much more to your question than meets the eye. Your piece of 11 April had a nice overview, though.
El Reg What role does flash have to play in scale-out filers?
Rob Peglar: A very interesting one. Flash, or fast non-volatile memory in general, has an interesting role in scale-out. Most of its impact is currently around holding metadata, and it’s quite useful for that. Isilon in particular can use flash nicely in that the backend communication path is already very fast and scalable, using Infiniband. Internodal messages traverse very quickly and efficiently. Combine this with holding node-based metadata in flash, and insuring all nodes are in sync via InfiniBand, is a solid architectural solution.
Doing metadata synchronisation using rotating disk is less efficient because of the inherent latency involved, and the interposition of write cache. However, using flash devices to quickly get metadata onto stable storage is highly efficient. At scale, this becomes an overriding concern. One can easily sync two nodes’ metadata using HDD, for example, but that is not scale-out, and adding an extra layer of file system overhead (e.g. an aggregation layer) on top of legacy file systems to simulate scale-out is highly inefficient. Scale-out starts at three nodes and goes to N. The current challenge is to increase N without adding quadratic latency, and flash helps greatly here.
A secondary role for flash in scale-out is to hold data itself which is quite read-intense, in particular big data after it has been mapped (the map stage of map/reduce) and now is undergoing processing, again mostly reads. There is much research and development going on currently in this area, especially as flash devices become denser and less expensive to procure.
El Reg Do scale-out filers need an integrated archiving/backup back-end system to store cold data, perhaps in a deduplicated form?.
Rob Peglar: In general, the answer is yes. Cold data is only one use case; the other is more strategic, i.e. the preservation of important/critical data, albeit infrequently used (‘cold’). Data of high importance must be archived not only for protection’s sake but also for legal and/or security concerns. Thus, any system, scale-out or not, must be so protected. Scale-out has a very important role to play here because it can serve as both primary and secondary repository – i.e. archive to scale-out. Archiving in particular lends itself to scale-out approaches by its very nature – typically always adding data to a permanent archive.
Archive is also typically the ‘repository of last resort’, so protection is paramount. This is another reason why scale-out is a superior approach; it adds not only disk protection but node protection as well, thus isolating the archive at large from any set of individual failures. Isilon in particular has developed an M+N approach to scale-out, thus minimizing the probability of data loss not only due to drive (media) failure but also node failure (e.g. power outage, cable pulls, human error, etc.)
This is a superior approach to tape archive, for example, because the failure of a given tape library means the cartridges contained therein – the media of last resort – are inaccessible and must be physically removed and transported to another library of similar characteristics. This is not scale-out. Scale-out archives imply one copy of the archival data, and protection via architecture is paramount.
El Reg Would scale-out filers benefit from having deduplicated files, assuming that did not reduce performance?
Rob Peglar: In general, the answer is no. First, the assumption is incorrect; any data reduction technique of the three known (compression, deduplication, incrementalization) has a performance (time) implication. In addition, deduplication also has a space implication; the tradeoff is metadata versus data. At large scale, deduplication metadata becomes very significant. For example, holding hashes (CRCs) of each 4KB of data – a very common granularity – implies 4 trillion items of metadata for a data repository of small size, 4PB.
If each hash structure (CRC & disk pointer, i.e. given a hash, where is its data?) is only 64 bits, or 8 bytes, which is quite small, this means 32TB of hash metadata which must be completely consistent across all nodes at all times. One must not only store that 32TB of data in stable and protected storage, but that storage must also be very fast, nearly as fast as the CPU’s ability to ingest it. It’s cost-prohibitive to have each node with 32TB of RAM just to hold hashes. Plus, even if you did have 32TB of RAM, it also means the CPUs in each node having to read 16TB worth of metadata (in the worst case) for each and every write access to a file, no matter how small – to perform the dedup hash check - and that searching alone is non-trivial, taking significant time.
The basic problem is that data is growing faster than the CPU’s ability to efficiently process the metadata for it. This is why at scale, deduplication is a non-optimal technique. It may save some ‘end’ space, but consider ‘big data’ as discussed before. This data is often highly unique and rarely can be deduplicated. For example, web hits and traffic from end users. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.
Constant streams of new data being ingested are therefore rarely duplicated across the user universe. So, for ‘big data’, deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn’t save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off. But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.
El Reg Could a company have a single logical file store with big data being a sectioned off part of that, rather than having a physically separate big data silo?
Rob Peglar: Most certainly. The ‘sectioning’ is trivial – it could be as simple as a single directory, under which all ‘big data’ is held. Isilon has a huge advantage here, in that the entire filesystem is coherent and sits under a single name – ‘ifs’. One could easily have /ifs/bigdata for all analytic data, and then /ifs/users for home directories and such, /ifs/exchange for mail, and so on. Each directory has its own attributes (metadata) regarding protection levels, tiering/residence, movement, QoS, replication, snaps, and so on.
One realizes the advantages of having multiple file systems for different purposes without the management nightmare of having to administer hundreds or thousands of different filesystems under different mount points, held on different nodes, and so on. At scale, there is a clear advantage to single namespace and single filesystem.
El Reg It was surprising to find out that deduplication was not a useful technology for big data. The dismissal of tape as the best big data archive media was also interesting to hear. We wonder if big data system vendors such as IBM and Oracle, with tape libraries in their product portfolio, will have the same view.
Also, flash is set to play an increasingly important role in big data storage as it will in enterprise storage, generally. Lots of grist here for Rob Peglar's Isilon mill to grind out for customers as he undertakes the CTO America's role. ®