nav search
Data Center Software Security Transformation DevOps Business Personal Tech Science Emergent Tech Bootnotes BOFH

This is how you count all the trees on Earth

NASA project combines algorithms and AWS

By Drew Cullen, 9 Dec 2015

The Earth has three trillion trees and this is seven times more than was thought, until very recently, to exist. It is also half the number that existed before humanity went to work on the environment.

We can deduce this through algorithms used to crunch reams of photographic data captured by satellite. But the headline figure is broad-brush and the results need more detail. Also, the methodology for processing the satellite imagery is time-consuming and expensive.

Climate scientists seek to better understand the impact of deforestation. What is the actual biomass of the world’s trees? How tall are they? How broad? The biggest question: how much carbon is stored in those trees?

These questions are altogether easier to ask than to answer. Scientists may have supercomputers at their disposal, but “high performance computing systems are optimised for large-scale simulations and not for data analysis,“ Dr Dan Duffy, head of the NASA Center for Climate Simulation (NCCS), wrote recently.

So, how do we improve the methodology for counting trees and for measuring their height and trunk diameter?

Two American scientists – Dr. Jim Tucker, of NASA Goddard Space Flight Center in Maryland, and Dr. Paul Morin, who works at the University of Minnesota – came up with a cunning plan. They proposed a tree-counting project that combined better algorithms with cheaper and quicker data processing. The latter would utilise a mixture of NASA Goddard’s private cloud and “bursting” to the public cloud, by way of AWS.

Counting trees

The idea combined climate science with computer science and Intel liked it so much that it awarded a grant to the project through its Heads in The Clouds initiative. And the scientists liked the name so much that in turn they dubbed the project "Head in The Clouds".

Funding in place, they plumped for sub-Sarahan Africa for their test run because trees are so sparse here and difficult to detect with traditional Earth observing systems. Also, there were workers available on the ground to double-check the results by manually counting trees.

There are 11 UTM (or, more formally, Universal Transverse Mercator) zones in Sub-Sarahan Africa. For the test run, the project leaders selected one area in one zone: central Niger, in UTM 32.

Head in the Clouds had at its disposal images from a constellation of four commercial satellites; for the trial the project leaders - Goddard, Morin and Dr Tsengdar Lee, NASA high end computing program manager - concentrated on data captured by just one, called “Quickbird”.

The data is organised in 112 (16 x 7) 100km X 100km tiles per UTM. Each tile has 16 (4 x 4) 25km x 25km sub-tiles per tile. The sub-tiles are then divided into 225 (25 x 25) 1km x 1km chunks.

Of course trees cross those boundaries. So the initial work for NASA’s computer scientists was to make a mosaic of those images to create a single seam and then make this “stand up” before deploying the counting algorithm.” This process is called orthorectified mosaicing.

Next step was to process the results. The data presents itself as a classically embarrassingly parallel problem and as such is suitable for parallel processing on general purpose server clusters – or processing via cloud services.

Hoot Thompson takes up the story. He has a very long job title - Advanced Technology Lead, NASA Center for Climate Simulation, NASA Goddard Space Flight Center. But among other things he is responsible for NASA Goddard’s very own "private cloud. "It’s more of a managed VM environment – we don’t have OpenStack,” he told us.

The cloud consists of about 300 servers, all hand-me-downs from the supercomputer facility at NASA Goddard. “As the organisation decommissions hardware I get it – not Haswells, I get Westmeres, for instance. But they make a very nice hypervisor,” said Hoot.

Atop the hardware NASA runs its own platform-as-a-service for specific science projects. This is called ADAPT – or Advanced Data Analytics Platform. The hardware architecture delivers near HPC levels of performance, with great emphasis placed on high speeds and low latency in node to node communication, according to Hoot. As with the public cloud, storage is extensible.

We took the view that we shouldn’t influence the climate while we are studying it - Hoot Thompson

Cloud bursting

But NASA scientists use those 400 hypervisors for very many projects - and Hoot estimated that it would take about 10 months to analyse all the data using NASA systems exclusively. Hence the requirement to burst out into the commercial cloud for Head in the Clouds. The goal was to finish the first iteration of the analysis in one month.

“I have been investigating cloud bursting for years,” Hoot said. “But I didn’t have to invent a project. We have a project that needs it.”

Hoot was cautious at first about network bandwidth and so opted for AWS East, the closest Amazon data centre to NASA Goddard. He has now transferred processing to AWS in Oregon, citing the facility’s zero carbon footprint. “We took the view that we shouldn’t influence the climate while we are studying it,” he said.

For the project NASA used AWS for cloud bursting. Post-mosaicing, the data was pre-staged in AWS and processed in 200 instances using AWS spot pricing. All the jobs ran successfully over five-to-six hours and were not interrupted or "preempted" by other users outbidding for resource.

Each job consumed about 4.3GB peak of memory using a single core and then all the results were pushed to S3, using Cycle Computing’s DataMan software.

Spot pricing

Cycle Computing “helps us with spot pricing and helps keep the pricing down. It reduces costs by 50-90 per cent from on-demand instances,” said Hoot. The startup also helps Head in the Clouds to manage the data. “Very little comes back,“ he said. This is a cool feature as AWS charges much more for returning than for receiving data.

The upshot? The test run cost a measly $80, which means that NASA can process data collected for an entire UTM zone for just $250. The cost for all 11 UTM zones in sub-Sarahan Africa and the use of all four satellites comes in at just $11,000.

“We have turned what was a $200,000 job into a $10,000 job and we went from 100 days to 10 days [to complete],” said Hoot. “That is something scientists can build easily into their budget proposals.”

So what next for NASA’s Head in the Clouds? There are plenty of tweaks but the main work is to find better ways to count trees and their biomass, for instance, by deducing the height of trees from the shadows they cast. “The algorithms are a work in progress,“ said Hoot. “There is six months of work to do.” ®

The Register - Independent news and views for the tech community. Part of Situation Publishing