Research Projects
AI & High Performance Data Mining - Materials Informatics - Healthcare Informatics - Social Media Analytics - Bioinformatics
Our ability to collect high volume, velocity, and variety of data (popularly known as big data) in practically all fields has greatly surpassed our analytical capability to make sense of it, underscoring the emergence and popularity of the fourth paradigm of science, which is data-driven science. My research on artificial intelligence (AI) and high performance data mining aims at a coherent integration of high performance computing (HPC) and data mining, so as to address big data challenges and enable large-scale data-driven discovery in various application domains, with most of the applications research done in collaboration with domain experts in respective fields. Below I briefly describe the projects I am involved in, grouped by research area.
AI & High Performance Data Mining
We envision creating a library of a variety of highly optimized data mining algorithms, which have the ability to deal with big data on tens of thousands of processors with good scalability, leveraging emerging architectures and memory storage technologies, to make data mining algorithms amenable to extreme scale.
Co-PI, “EAGER: XAISE: Explainable Artificial Intelligence for Science and Engineering”, National Science Foundation (NSF), $300,000, 2023-2025.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI)
Overview: This project is aimed at exploring a high-risk, high-payoff approach of ML-DL integration to realize explainable AI in terms of the four NIST principles.Co-PI, “RAPIDS2: A SciDAC Institute for Computer Science, Data, and Artificial Intelligence”, Department of Energy (DOE), $650,000 (Total $28,750,000), 2020-2025.
Team Members: Wei-keng Liao (PI), Ankit Agrawal (Co-PI), Alok Choudhary (Co-PI)
Collaborators: Rob Ross (Lead PI), Prasanna Balaprakash
Overview: The objective of RAPIDS2 is to assist the Office of Science (SC) application teams in overcoming computer science, data, and AI challenges in the use of DOE supercomputing resources to achieve scientific breakthroughs. For more information, please visit the main project webpage.Co-PI, “PROTEUS: Machine Learning Driven Resilience for Extreme-scale Systems”, Department of Energy (DOE), $1,248,115, 2018-2023.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI), Reda Al-Bahrani
Overview: This project is aimed at developing scalable algorithms and software to enable enhanced resilience, efficient checkpointing, and program restart, along with the ability to dual-use the data for detailed analysis.Co-PI, “Scalable, In-situ Clustering and Data Analysis for Extreme Scale Scientific Computing”, Department of Energy (DOE), $1,219,899, 2015-2021.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI), Dianwei Han, Sunwoo Lee, Qiao Kang, Kaiyuan Hou, Arindam Paul
Overview: The goal of this project is to develop scalable algorithms and software for spatio-temporal data clustering, machine learning based data transformation and reduction, anomaly detection, learning data distributions, for in-situ implementation and execution.Co-PI, “SHF:Medium:Collaborative Research: Scalable Algorithms for Spatio-temporal Data Analysis”, National Science Foundation (NSF), $709,342 (Total $934,342), 2014-2019.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI), Dianwei Han, Qiao Kang, Sunwoo Lee, Steve Rangel, Arindam Paul, Dipendra Jha, Zijiang Yang, Reda Al-Bahrani
Collaborators: Salman Habib, Katrin Heitmann, Tom Peterka
Overview: This project develops innovative, scalable, and sustainable data analytics algorithms to enable analysis and mining of massive data on high-performance parallel computers.Co-PI, “EAGER: Scalable Big Data Analytics”, National Science Foundation (NSF), $300,000, 2013-2016.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI), Dianwei Han, Yusheng Xie, Zheng Yuan, Qiao Kang, Sunwoo Lee
Overview: The project explored the holistic ecosystem or a virtuous cycle that optimizes data generation, organizes this data, performs knowledge discovery, and leads to actionable insights.Co-PI, “Scalable Data Management, Analysis, and Visualization (SDAV) Institute”, Department of Energy (DOE), $750,000 (Total $25,000,000), 2012-2019.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI), William Hendrix, Mostofa Patwary, Dianwei Han, Yusheng Xie, Zheng Yuan, Kaiyuan Hou, Dipendra Jha, Reda Al-Bahrani, Qiao Kang, Justin Liao
Collaborators: Arie Shoshani (Lead PI), Salman Habib
Overview: The SciDAC SDAV Institute actively worked with application teams to assist them in achieving breakthrough science and provide technical solutions in the data management, analysis, and visualization regimes that are broadly applicable in the computational science community. For more information, please visit the main project webpage.Senior Researcher, “Expeditions in Computing: Understanding Climate Change: A Data Driven Approach”, National Science Foundation (NSF), $900,000 (Total $10,000,000), 2010-2016.
Team Members: Alok Choudhary (PI), Wei-keng Liao, Ankit Agrawal, Kui Gao, William Hendrix, Mostofa Patwary, Saba Sehrish, Zhengzhang Chen, Chen Jin, Prabhat Kumar, Dianwei Han, Steve Rangel, Diana Palsetia, Qiao Kang, Pranjal Daga
Collaborators: Vipin Kumar (Lead PI), Arindam Banerjee, Nagiza Samatova, Fred Semazzi, Auroop Ganguly, Abdollah Homaifar
Overview: This Expeditions project aimed to address key challenges in the science of climate change by developing methods that take advantage of the wealth of climate and ecosystem data available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. For more information, please visit the main project webpage.Research Participant, “Scalable and Power Efficient Data Analytics for Hybrid Exascale Systems”, Department of Energy (DOE), $705,000, 2010-2014.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal, Kui Gao, William Hendrix, Seung Woo Son, Prabhat Kumar, Sanchit Misra, Ramanathan Narayanan, Abhishek Das, Dan Honbo, Yuhong Zhang, Siddharth Gupta
Collaborators: Nagiza Samatova, John Wu
Overview: This project developed a library of functions and software to accelerate data analytics, mining, knowledge discovery for large-scale scientific applications, using hybrid architectures which include many-core systems, GPUs, and other accelerators. For more information, please visit the project webpage.
Related Research Products: [Publications] [Software]
Materials Informatics (AI for Materials)
The over-arching goal here is to use artificial intelligence to better understand processing-structure-property-performance (PSPP) linkages, i.e., develop explainable, data-driven AI models for predicting the properties of a given material (forward models), and to discover and design new materials with a given target property (inverse models).
PI, “AI-Driven Nanocombinatorics for Accelerated Structural Characterization: Automated High-Throughput Nanoparticle Library Screening and Analytics (Renewal)”, Center for Nanocombinatorics, Northwestern University, $100,000, 2024-2025.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao (Co-PI), Alexandra Day, Muhammed Nur Talha Kilic, Youjia Li
Collaborators: Vinayak Dravid (Co-PI), Roberto dos Reis (Co-PI), Chad Mirkin, Carolin Wahl
Overview: The goal of this project is to integrate AI with structure characterization for automated quality check and segmentation of images of nanoparticles deposited in megalibraries.Co-PI, “EAGER: XAISE: Explainable Artificial Intelligence for Science and Engineering”, National Science Foundation (NSF), $300,000, 2023-2025.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI)
Overview: This project is aimed at exploring a high-risk, high-payoff approach of ML-DL integration to realize explainable AI in terms of the four NIST principles, for materials science and engineering applications.PI, “AI-Driven Nanocombinatorics for Accelerated Structural Characterization: Automated High-Throughput Nanoparticle Library Screening and Analytics (Renewal)”, Center for Nanocombinatorics, Northwestern University, $100,000, 2023-2024.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao (Co-PI), Alexandra Day, Vishu Gupta, Yuwei Mao, Muhammed Nur Talha Kilic
Collaborators: Vinayak Dravid (Co-PI), Roberto dos Reis (Co-PI), Chad Mirkin, Carolin Wahl
Overview: The goal of this project is to integrate AI with structure characterization for automated quality check and segmentation of images of nanoparticles deposited in megalibraries.Co-PI, “Nitrogen Activation at Catalyst Surfaces to Catalyze Net-Zero: An AI-Driven Approach”, Center for Nanocombinatorics, Northwestern University, $100,000, 2023-2024.
Team Members: Ankit Agrawal (Co-PI), Alok Choudhary (Co-PI), Wei-keng Liao (Co-PI), Vishu Gupta, Alec Peltekian
Collaborators: Pengfei Ou (Co-PI), Ted Sargent (Co-PI)
Overview: The goal of this project is to integrate AI with computation and experiment to characterize nitrogen activation.PI, “AI-Driven Nanocombinatorics for Accelerated Structural Characterization: Automated High-Throughput Nanoparticle Library Screening and Analytics”, Center for Nanocombinatorics, Northwestern University, $100,000, 2021-2022.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Alexandra Day, Vishu Gupta
Collaborators: Vinayak Dravid (Co-PI), Roberto dos Reis (Co-PI), Chad Mirkin, Carolin Wahl
Overview: The goal of this project is to integrate AI with structure characterization for automated quality check of images of nanoparticles deposited in megalibraries.PI, “AI-Driven Nanocombinatorics for Functional Characterization and Optimization: Predictive Modeling and Active Learning of Catalysis in Megalibraries”, Center for Nanocombinatorics, Northwestern University, $90,000, 2021-2022.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Vishu Gupta, Alexandra Day
Collaborators: Daniel Apley (Co-PI), Linsey Seitz (Co-PI), Neil Schweitzer (Co-PI), Chad Mirkin, Jordan Swisher
Overview: The goal of this project is to integrate AI with functional characterization for catalytic performance prediction and optimization for nanomaterials deposited in megalibraries.PI, “Collaborative Research: AI-Driven Multi-Scale Design of Materials under Processing Constraints”, National Science Foundation (NSF), $379,022 (Total $651,462), 2021-2025.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Yuwei Mao, Muhammed Nur Talha Kilic
Collaborators: Pinar Acar (Lead PI)
Overview: The objective of this project is to improve the knowledge of materials design by developing a multi-scale methodology that combines physics-based models of thermo-mechanical processing and materials with artificial intelligence (AI) and machine learning (ML).PI, “Center for Hierarchical Materials Design (CHiMaD): Phase II” (Agrawal Subproject), National Institute of Standards and Technology (NIST), $953,221 (Total $25,000,000), 2019-2023.
Team Members: Ankit Agrawal (Subproject PI), Alok Choudhary (Subproject PI), Wei-keng Liao, Arindam Paul, Dipendra Jha, Zijiang Yang, Vishu Gupta, Yuwei Mao, Alexandra Day, Muhammed Nur Talha Kilic, Alec Peltekian
CHiMaD Collaborators: Peter Voorhees (Lead PI), Greg Olson, Chris Wolverton, Wei Chen, Laura Bartolo, Ian Foster, Abhinav Saboo
NIST Collaborators: Carelyn Campbell, Francesca Tavazza, Kamal Choudhary, Andrew Reid, Gilad Kusne, Alden Dima
External Collaborators: Wei Xiong, Stefanos Papanikolaou
Overview: CHiMaD is a NIST-sponsored center of excellence for advanced materials research to enable accelerated design of novel materials and their integration to industry, one of the primary goals of the U.S. Government’s Materials Genome Initiative (MGI). We are leading the data mining and analytics group at this center. For more information, please visit CHiMaD.PI, “Data-driven Analytics for Understanding Materials Properties”, Toyota Motor Corporation, $300,000, 2019-2020.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Zijiang Yang, Arindam Paul, Dipendra Jha, Reda Al-Bahrani
Collaborators: Tetsushi Watari, Daisuke Ichigozaki, Mitsutoshi Akita, Hideaki Kume, Yoshinori Suga
Overview: This is a Northwestern-Toyota collaborative project aimed at exploring data-driven analytics for understanding materials properties.PI, “Digital Innovation Design (DID)”, Defense Logistics Agency via Steel Founders Society of America, $200,000, 2019-2020.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI)
Collaborators: Greg Olson
Overview: This is seed funding to investigate data-driven analytics in steels as part of the Navy hull steel initiative.PI, “The investigation of machine learning for material development”, Toyota Motor Corporation, $200,000, 2017-2018.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Zijiang Yang
Collaborators: Tetsushi Watari, Daisuke Ichigozaki, Kei Morohoshi, Yoshinori Suga
Overview: This is a Northwestern-Toyota collaborative project aimed at investigating machine learning techniques for materials science.Senior Personnel, “BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, & Disseminate”, National Science Foundation (NSF), $123,847 (Total $989,700), 2017-2022.
Team Members: Ankit Agrawal, Alok Choudhary, Wei-keng Liao
Collaborators: Peter Voorhees (PI), Ian Foster (Lead PI)
Overview: The Midwest Big Data Spoke (MBD Spoke) for Integrative Materials Design (IMaD) connects researchers in industry, universities, and government to the people and services needed to easily find, access, and use data, tools, and services for materials design.PI, “Advanced Materials Center for Excellence: Center for Hierarchical Materials Design (CHiMaD)” (Agrawal Subproject), National Institute of Standards and Technology (NIST), $505,358 (Total $25,000,000), 2014-2018.
Team Members: Ankit Agrawal (Subproject PI), Alok Choudhary (Subproject PI), Wei-keng Liao, Kasthurirangan Gopalakrishnan, Ruoqian Liu, Amar Krishna, Arindam Paul, Dipendra Jha, Zijiang Yang
CHiMaD Collaborators: Peter Voorhees (Lead PI), Greg Olson, Chris Wolverton, Wei Chen, Laura Bartolo, Ian Foster, Abhinav Saboo
NIST Collaborators: Carelyn Campbell, Francesca Tavazza, Kamal Choudhary, Andrew Reid, Gilad Kusne, Alden Dima
External Collaborators: Wei Xiong, Stefanos Papanikolaou
Overview: CHiMaD is a NIST-sponsored center of excellence for advanced materials research to enable accelerated design of novel materials and their integration to industry, one of the primary goals of the U.S. Government’s Materials Genome Initiative (MGI). We led the data mining tool group at this center. For more information, please visit CHiMaD.PI, “Data-driven analytics for understanding processing-structure-property-performance relationships in steel alloys”, Northwestern Data Science Initiative, $45,000, 2016-2017.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Zijiang Yang
Collaborators: Greg Olson
Overview: This project aimed at mining publicly available steel data for understanding processing-structure-property-performance relationships in steel alloys.Co-PI, “Scaling up the screening of molecular networks in the rational design of optically active materials”, Northwestern Data Science Initiative, $9,000 (Total $45,000), 2016-2017.
Team Members: Ankit Agrawal (Co-PI), Alok Choudhary (Co-PI), Wei-keng Liao
Collaborators: Kevin Kohlstedt (PI), George Schatz
Overview: This project aimed at building upon an existing molecular network methodology to describe charge transport networks in disordered materials, by scaling it up to the length scale of an entire active layer of a photovoltaic cell or organic light-emitting diode.Co-PI, “SIMPLEX: Data-driven Discovery of Novel Thermoelectric Materials”, Defense Advanced Research Projects Agency (DARPA), $601,250 (Total $1,559,999), 2015-2018.
Team Members: Alok Choudhary (PI), Ankit Agrawal (Co-PI), Wei-keng Liao, Alona Furmanchuk
Collaborators: Greg Olson (Lead PI), James Saal, Jeff Doak
Overview: This project aimed at developing tools for scientific data analysis to facilitate big hypothesis generation and accelerate scientific discovery, in particular for the design of high-performance thermoelectric materials that are capable of converting heat into electricity and vice-versa.Co-PI, “MURI: MANAGING THE MOSAIC OF MICROSTRUCTURE: Image analysis, data structures, mathematical theory of microstructure, and hardware for the structure-property relationship”, Air Force Office of Scientific Research (AFOSR), Department of Defense (DOD), $750,000 (Total $5,658,616), 2012-2018.
Team Members: Alok Choudhary (PI), Ankit Agrawal (Co-PI), Wei-keng Liao, Kasthurirangan Gopalakrishnan, Ruoqian Liu, Amar Krishna, Arindam Paul
Collaborators: Marc De Graef (Lead PI), Surya Kalidindi, Veera Sundararaghavan
Overview: This proposal aimed to create break-through concepts and methodologies for elucidating the microstructure-properties link in materials to enable materials design by bringing together cutting-edge theories and techniques from materials science, mathematics and information science. For more information, please visit the main project webpage.
Related Research Products: [Publications] [Software]
Healthcare Informatics (AI for Healthcare)
The goal here is to develop methods to effectively analyze the huge amounts of heterogenous healthcare-related data, like electronic health records, genomics/proteomics data, professional biomedical literature, social media postings, etc. to extract actionable healthcare insights, and incorporate them in practice with the help of our collaborators.
PI, “Social Media mining of caregiver experiences: Opportunity for preventing caregiver burnout”, Northwestern Data Science Initiative, $25,000, 2017-2017.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Reda Al-Bahrani
Collaborators: Margaret Danilovich
Overview: This project aimed at mining publicly available social media Twitter data, specifically to learn and analyze informal caregiver experiences using a data-driven automated qualitative approach.PI, “Analyzing caregiving experience on Twitter”, Feinberg School of Medicine, $14,246, 2015-2016.
Team Members: Ankit Agrawal (PI), Alok Choudhary, Wei-keng Liao, Reda Al-Bahrani
Collaborators: Margaret Danilovich
Overview: This was seed funding to mine publicly available social media data and analyze informal caregiver experiences on Twitter.
Other Collaborators: Jai Raman, David Baker, Karl Bilimoria, Mark Russo
Related Research Products: [Publications] [Software]
Social Media Analytics (AI for Social Media)
There is an increasing need to uncover the wealth of information hidden in huge amounts of publicly available textual information in the form of social media websites, forums, blogs, research publications and reports, and so on. In recent years, we have done significant research on sentiment analysis, web-text clustering, recommendation systems, behavioral targeting, and other related problems. Read more here.
PI, “Social Media mining of caregiver experiences: Opportunity for preventing caregiver burnout”, Northwestern Data Science Initiative, $25,000, 2017-2017.
Team Members: Ankit Agrawal (PI), Alok Choudhary (Co-PI), Wei-keng Liao, Reda Al-Bahrani
Collaborators: Margaret Danilovich
Overview: This project aimed at mining publicly available social media Twitter data, specifically to learn and analyze informal caregiver experiences using a data-driven automated qualitative approach.PI, “Analyzing caregiving experience on Twitter”, Feinberg School of Medicine, $14,246, 2015-2016.
Team Members: Ankit Agrawal (PI), Alok Choudhary, Wei-keng Liao, Reda Al-Bahrani
Collaborators: Margaret Danilovich
Overview: This was seed funding to mine publicly available social media data and analyze informal caregiver experiences on Twitter.Co-PI, “EAGER: Discovering Knowledge from Scientific Research Networks”, National Science Foundation (NSF), $256,000, 2011-2014.
Team Members: Alok Choudhary (PI), Wei-keng Liao (Co-PI), Ankit Agrawal (Co-PI), Zhengzhang Chen, Lu Liu, Yu Cheng, Kunpeng Zhang, Kathy Lee, Diana Palsetia, Yusheng Xie, Lalith Polepeddi
Overview: This project developed an infrastructure to collect, clean and mine scientific data from publications, discussion forums, etc. to construct an enriched scientific research network to discover dynamics of scientific progress and new trends.
Related Research Products: [Publications] [Software]
Bioinformatics
Collaborators: Xiaoqiu Huang, Sanchit Misra
Sequence-structure-function relationships form the basis of almost everything in bioinformatics, but it is far from well-understood. Sequence data is the most widely available and ever-increasing form of data in bioinformatics. We have conducted significant research in applying high performance data mining techniques on sequence data, and released several software for the same, such as pairwise statistical significance estimation for biological sequence alignment, which is used for the purpose of identifying related sequences, and a fast sequence mapping tool for long read mapping.
Related Research Products: [Publications] [Software]