As I was recently pondering EMC’s Data Lake Federation announcement, I stumbled upon this article that discusses Roy Hill Mining Company’s (RHC) use of automated trains in the Australian Outback. The big challenge for RHC and its partner, GE, is that the mine is in a remote location where temperatures exceed 130 degrees Fahrenheit (54 Celsius) during the day. This inhospitable environment creates a challenge for human workers and an opportunity for big data.
RHC will extract 55 million tons of iron ore each year which they must transport over 200 miles to Port Headland for distribution. GE provided a novel solution including remotely operated trains to transport the ore. As an aside, I have to say that the idea of a full-size RC Train is just really cool; it is the dream of almost every child to have access to such a gadget. RHC’s robo-trains will spend about 8 hours each day travelling to and from the mine and in order to ensure optimal performance, each train will constantly capture data from 250 sensors including metrics such as speed and temperature.
GE says that these 250 sensors will generate roughly 9 million data points every hour the trains operate. RHC will have 5 trains traveling the 8-hours to and from Port Headland daily. As a result, RHC’s engines will generate 360 million data points in a day or 1.8 billion data points in a week! That is some serious data. A very rough calculation suggests that the GE engines would require about 200 terabytes of storage per day or 1 petabyte per week. Clearly the scope of storage and associated analytics are significant, and I wanted to explore this challenge in more detail.
RHC/GE’s transport mechanism is critical because a failure could negatively impact mining operations and hence productivity. As a result, GE will be using the telemetry information to maximize the performance and reliability of its robo-trains. When thinking about how these metrics could be analyzed, four big data approaches come to mind each of which will provide increasing intelligence at a cost of greater storage requirements.
Simplistic – The most basic method for GE to measure train performance is to look at absolute values. For example, they could monitor a motor’s temperature and determine that if it crosses a certain threshold that maintenance action is required. While informative, this approach does not take significant advantage of big data analytics because it is based on data from a single point in time. (e.g. what is the temperate at time 0 and how does it compare to the warning level.)
Basic analytics – This improves upon the simplistic approach by using trending models to identify problems further in advance than the simplistic scenario. Returning to the previous example, GE could model a motor’s temperature over time and so instead of just looking at one data point, they could look at temperature trends to help identify potential issues further in advance. This approach should help better forecast problems and more importantly reduce train failures.
Comparative analytics – RHC will have 20 GE locomotives in active use. (Five trains running with each having four locomotives.) In order to further optimize performance and improve reliability, GE could use big data analytics to compare sensor values across all 20 trains. Hence rather than looking at a single value in the simplistic approach or at the trend of one train in the basic case, GE could compare all of the motors across all engines to see how each is performing. Variations in metrics would enable earlier detection of potential failures. As an added benefit, GE could also leverage the data to improve the locomotive performance across RHC’s fleet. (For example, they may find that a set of locomotives are performing better than others and could use that information to assess whether there are learnings or configuration changes that could be applied to other trains to improve results.)
Massive scale comparative analytics – Finally, GE could further extend upon the comparative analysis by including all trains of the same type across the world. In this scenario, they could not only gain further insights into specific components like motors, but also significantly expand upon the efficiency analysis used in the comparative analytics model. This is the most powerful use of big data, and is highly beneficial because any learnings gained can be applied across the globe to all trains monitored by GE, and thus small improvements in efficiency can result in massive global savings.
In summary, RHC is pioneering a new method to transport natural resources in partnership with GE. The new approach can reduce costs, accelerate resource deliveries and improve reliability. A key hallmark of the strategy is powerful big data analytics driven by large volumes of sensor data. The four scenarios highlighted in this post illustrate how larger data sets and analyses can deliver increased business value. In order for GE to maximize the value of their sensor data, they must have a powerful storage infrastructure that can grow over time and an equally advanced analytics engine to allow for actionable business results. EMC’s new Federated Data Lake provides a compelling offering that combines highly scalable Isilon Storage with advanced Pivotal analytics. Regardless of the technology that you choose, it is clear that data analytics as illustrated by RHC and GE can provide new insights that were previously unobtainable.