A software tool has been developed in Java that uses the popular open source Hadoop MapReduce framework. MapReduce is a popular paradigm for problems that can be easily decomposed into independently computable chunks of data. The Hadoop framework parallelizes the data across a set of compute nodes on a distributed-memory cluster allowing use of high-performance computing hardware. This tool has been deployed at 8-ID with a full analysis workflow pipeline, which allows users to integrate data collection into the measurement process. The software is very efficient in terms of performance and while data analysis is initiated only after data collection is complete, data are analyzed in near real-time, which enables users to readily view their experimental results while they are conducting measurements rather than after completing them. The software provides a MATLAB user interface that hides the complications of the low-level programming language from the general user, but allows customization. |
The X-ray photoelectron correlation spectroscopy (XPCS) technique deals with processing “Big Data” generated from high frame-rate area detectors operating at 100-1000 frames per second, where each frame typically contains one million pixels, resulting in a sustained raw data throughput of 0.2-2 GB/s. Initial data reduction via sophisticated compression algorithms reduces the data rate by a factor of ten, but a typical dataset is comprised of 10K-100K frames, which still require tens of gigabytes in total. Nonetheless, users need to see the temporal autocorrelation functions from their samples in near real-time so they can make intelligent choices for subsequent measurements.
Distribution & Impact |
Since January 2013, this real-time XPCS software is used for analysis 95% of the general user experiments at 8-ID with high reliability. Most APS XPCS users are able to go home with time correlation data ready for interpretation. A new “two-time correlation” analysis capability is being added by a contractor |
Funding Source |
This project has been produced using operational funding from the APS, contract DE-AC02-06CH11357. |
Related Publications |
F. Khan, Hammonds, J., Narayanan, S., Sandy, A., Schwarz, N., “Effective End-to-end Management of Data Acquisition and Analysis for X-ray Photon Correlation Spectroscopy,” Proceedings of ICALEPCS 2013 – the 14th International Conference on Accelerator and Large Experimental Physics Control Systems, San Francisco, California, 10/07/2013 – 10/11/2013. F. Khan, Schwarz, N., Hammonds, J., Saunders, C., Sandy, A., Narayanan, S., “Distributed X- ray Photon Correlation Spectroscopy Analysis using Hadoop,” NOBUGS 2012, the 9th NOBUGS Conference, Poster Session, Didcot, United Kingdom, 09/24/2012 – 09/26/2012. |
Additional Development Goals |
While the software has the tools required to do most common XPCS analysis, in the near-term, working with a contractor, we are extending the capability to include “two-time correlation” essential for measuring dynamics in materials under non-equilibrium conditions. The long-term plan is to provide this capability as software as a service (SaaS) to the APS user community through a cloud-computing platform. APS-SSG is currently working to prototype this using Argonne’s Magellan cloud computing resource. |
Implementation Methodology |
Low latency between data acquisition and analysis is of critical importance to any experiment. The combination of a faster parallel algorithm and a data pipeline for connecting disparate components (detectors, clusters, file formats) enabled us to greatly enhance the operational efficiency of the x-ray photon correlation spectroscopy experiment facility at the Advanced Photon Source. The improved workflow starts with raw data streaming directly from the detector camera, through an on-the-fly discriminator implemented in firmware to Hadoop’s distributed file system in a structured HDF5 data format. The user then triggers the MapReduce-based parallel analysis. For effective bookkeeping and data management, the provenance information and reduced results are added to the original HDF5 file. Finally, the data pipeline triggers user-specific software for visualizing the data. The whole process is completed shortly after data acquisition - a significant improvement of operation over the previous setup. The faster turnaround time helps scientists to make near real-time adjustments to the experiments. |