Systems Research (HPC/ML)
Published:
Research on scaling machine learning algorithms, graph databases, geosimulations, in-situ analytics, …
- Scaling Machine Learning Algorithms
- Gaussian Process (GP) Learning: GP based approaches are increasingly being used as a kernel machine learning tool for nonparametric regression and classification. We developed several novel algorithms based on GP learning for classification, change detection and anomaly detection. Despite great advantages of GP in various machine learning tasks, in particular for spatial and temporal data, its wide adaption by the GIS community is limited due to its high computational complexity, O(t3), and memory footprint, O(t2), where t is the length of time series. In applications like monitoring crop biomass at regional scales, one has to deal with billions of time series, and the solution requires computation of covariance matrix and its inversion for each time series. To overcome these limitations, we chose an exponential covariance function that captures not only the periodic nature of crop growth (phenology) and but also efficient as the resulting covariance is a Toeplitz matrix. Our new solution, GPChange, has reduced complexity to O(t2) and memory to O(t). In addition, we developed mixed parallel algorithms to take advantage of heterogeneous compute clusters at Oak Ridge leadership computing facility. Our experiments showed significant reduction in computing time from days to seconds compared to the standard Cholesky decomposition-based GP learning on a 128 node SGI Altrix ICE 8200 cluster. This solution was operationalized at ORNL for regional scale continuous biomass monitoring using MODIS satellite based daily NDVI time series. In addition to publishing several papers in leading conferences and journals, the project won the ORNL Lab Director's best LDRD project award for the year 2010.
- Varun Chandola, Ranga Raju Vatsavai: A scalable gaussian process analysis algorithm for biomass monitoring. Statistical Analysis and Data Mining 4(4): 430-445 (2011)
- Varun Chandola, Ranga Raju Vatsavai: A Gaussian Process Based Online Change Detection Algorithm for Monitoring Periodic Time Series. SIAM Data Mining (SDM) 2011: 95-106
- Varun Chandola, Ranga Raju Vatsavai: Scalable Time Series Change Detection for Biomass Monitoring Using Gaussian Process. NASA Conference on Intelligent Data Understanding (CIDU) 2010: 69-82. (Selected as one of the six best papers at the NASA/CIDU and published in the special issue of Statistical Analysis and Data Mining Journal)
- Gaussian Process (GP) Learning: GP based approaches are increasingly being used as a kernel machine learning tool for nonparametric regression and classification. We developed several novel algorithms based on GP learning for classification, change detection and anomaly detection. Despite great advantages of GP in various machine learning tasks, in particular for spatial and temporal data, its wide adaption by the GIS community is limited due to its high computational complexity, O(t3), and memory footprint, O(t2), where t is the length of time series. In applications like monitoring crop biomass at regional scales, one has to deal with billions of time series, and the solution requires computation of covariance matrix and its inversion for each time series. To overcome these limitations, we chose an exponential covariance function that captures not only the periodic nature of crop growth (phenology) and but also efficient as the resulting covariance is a Toeplitz matrix. Our new solution, GPChange, has reduced complexity to O(t2) and memory to O(t). In addition, we developed mixed parallel algorithms to take advantage of heterogeneous compute clusters at Oak Ridge leadership computing facility. Our experiments showed significant reduction in computing time from days to seconds compared to the standard Cholesky decomposition-based GP learning on a 128 node SGI Altrix ICE 8200 cluster. This solution was operationalized at ORNL for regional scale continuous biomass monitoring using MODIS satellite based daily NDVI time series. In addition to publishing several papers in leading conferences and journals, the project won the ORNL Lab Director's best LDRD project award for the year 2010.
- Scaling GeoSimulations:
- Geosimulations are widely used in the urban growth modeling, environmental studies, disease spread modeling, traffic modeling, and land-use and land-cover changes. Geosimulations are increasingly becoming sophisticated, however computational complexity has also increased along with the increase in spatial/temporal extents and finer spatial resolutions. To address these computational and I/O challenges, we developed various parallelization strategies for scaling cellular automation based FUTURES model in a distributed computing environment. In particular, we developed intelligent strategies for data partitioning, task scheduling, and task synchronization. These strategies have resulted in a highly scalable pFUTURES model, which made it possible to study the urban simulations for large spatial and temporal extents at a finer spatial resolution than previously reported in the literature. In addition, we developed an adaptive mesh refinement strategy, FUTURES-AMR, that further reduced computation and memory requirements for large scale simulations. Typical data parallel approaches in geosimulations use static partitioning strategy along with load-balancing, however in many practical situations some data partitions (tiles) may require that the simulations be run at finer spatial resolutions than the other tiles (e.g., to account for important and critical events such as flooding which is localized to fewer tiles). To account for such scenarios, we developed a novel provisioning system, FUTURES-DPE, that dynamically allocates additional computing resources to the required tiles at run-time. This extension along with our computational steering work, for the first time allowed geosimulation modelers to explore what-if scenarios on-the-fly. The FUTURES-AMR work is nominated for best paper award at the top ranking biennial GIScience 2018 conference.
- Ashwin Shashidharan, Ranga Raju Vatsavai, Ross K. Meentemeyer: "FUTURES-DPE: towards dynamic provisioning and execution of geosimulations in HPC environments." The ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL/GIS) 2018: 464-467
- Ashwin Shashidharan, Ranga Raju Vatsavai, Derek B. van Berkel, Ross K. Meentemeyer: "FUTURES-AMR: Towards an Adaptive Mesh Refinement Framework for Geosimulations." The International Conference on Geographic Information Science (GIScience) 2018: 16:1-16:15 (nonimanted for the best paper)
- Qiang Zhang, Ranga Raju Vatsavai, Ashwin Shashidharan, Derek B. van Berkel: Agent based urban growth modeling framework on Apache Spark. BigSpatial@SIGSPATIAL 2016: 50-59
- Ashwin Shashidharan, Derek B. van Berkel, Ranga Raju Vatsavai, Ross K. Meentemeyer: "pFUTURES: A Parallel Framework for Cellular Automaton Based Urban Growth Models." The International Conference on Geographic Information Science (GIScience) 2016: 163-177
- Geosimulations are widely used in the urban growth modeling, environmental studies, disease spread modeling, traffic modeling, and land-use and land-cover changes. Geosimulations are increasingly becoming sophisticated, however computational complexity has also increased along with the increase in spatial/temporal extents and finer spatial resolutions. To address these computational and I/O challenges, we developed various parallelization strategies for scaling cellular automation based FUTURES model in a distributed computing environment. In particular, we developed intelligent strategies for data partitioning, task scheduling, and task synchronization. These strategies have resulted in a highly scalable pFUTURES model, which made it possible to study the urban simulations for large spatial and temporal extents at a finer spatial resolution than previously reported in the literature. In addition, we developed an adaptive mesh refinement strategy, FUTURES-AMR, that further reduced computation and memory requirements for large scale simulations. Typical data parallel approaches in geosimulations use static partitioning strategy along with load-balancing, however in many practical situations some data partitions (tiles) may require that the simulations be run at finer spatial resolutions than the other tiles (e.g., to account for important and critical events such as flooding which is localized to fewer tiles). To account for such scenarios, we developed a novel provisioning system, FUTURES-DPE, that dynamically allocates additional computing resources to the required tiles at run-time. This extension along with our computational steering work, for the first time allowed geosimulation modelers to explore what-if scenarios on-the-fly. The FUTURES-AMR work is nominated for best paper award at the top ranking biennial GIScience 2018 conference.
- In Situ Analytics:
- Hi-resolution simulations on exascale supercomputers are producing petabytes of data. Storing and retrieving such large datasets for analytics is a challenging task. In situ analytics aims at on-the-fly data compression and summarization before the data hits secondary storage, thus minimizing data movement. Summarization and compression at current and future scales requires a framework for developing and benchmarking algorithms. We developed a novel framework by integrating existing, production-ready projects in VTK and demonstrated scaling results of two particular algorithms that serve as exemplars for summarization: a wavelet-based data reduction filter and a generator for creating image-like databases of extracted features (e.g., isocontours in this case). Both these solutions support browser-based, post-hoc, interactive visualization of the summary for decision-making. We also demonstrated weak scaling on a distributed multi-GPU system.
- David C. Thompson, Sébastien Jourdain, Andrew C. Bauer, Berk Geveci, Robert Maynard, Ranga Raju Vatsavai, Patrick O'Leary: In Situ Summarization with VTK-m. ISAV@SC 2017: 32-36
- Hi-resolution simulations on exascale supercomputers are producing petabytes of data. Storing and retrieving such large datasets for analytics is a challenging task. In situ analytics aims at on-the-fly data compression and summarization before the data hits secondary storage, thus minimizing data movement. Summarization and compression at current and future scales requires a framework for developing and benchmarking algorithms. We developed a novel framework by integrating existing, production-ready projects in VTK and demonstrated scaling results of two particular algorithms that serve as exemplars for summarization: a wavelet-based data reduction filter and a generator for creating image-like databases of extracted features (e.g., isocontours in this case). Both these solutions support browser-based, post-hoc, interactive visualization of the summary for decision-making. We also demonstrated weak scaling on a distributed multi-GPU system.