Big data, machine learning and artificial intelligence are revolutionizing numerous fields, and materials science is no exception. In this timely moment, NOMAD summer will introduce novice and advanced researchers (in academia and industry) to data-driven computational methods but also practical - and readily usable - tools for novel materials discovery developed within the Novel Materials Discovery (NOMAD) Centre of Excellence.
An enormous amount of materials data, with millions of CPU hours spent every day in HPC centers worldwide, are already stored in data repositories. These data represents an invaluable resource. But how to extract knowledge from it?
The NOMAD Center of Excellence (https://NOMAD-CoE.eu) develops tools to obtain insight into physical processes in materials. Converting inputs and outputs produced by many different computer codes into a common format ensures that they can be compared to each other. This makes data ready for the next steps, that are urgently needed in academia and industry and that are the focus of this Summer School: making Big Data of materials comprehensible to the outside world.
In particular, this school will introduce both novice and advanced researchers in academia and industry to methods and practical tools to:
1. upload, share, and download materials science data using the NOMAD Repository and Archive;
2. visualize physical processes and complex relationships between materials properties with Advanced Graphics;
3. search and retrieve the vast amount of computed materials properties using the NOMAD Encyclopedia;
4. identify correlations and structure in big data of materials, towards the final goal of predicting novel materials with tailored properties with NOMAD Analytics-Toolkit.
The school will feature 7 sessions on the different topics listed below. Each session will be comprised of talks (45 min) by the invited speakers introducing the topics and hands on sessions guided by tutors from the NOMAD team. In the attached program the topical sessions including the team presenting each topic (speakers and tutors) are detailed. Additionally, we will include keynote talks by international researchers on the forefront of data-driven materials science to speak on relevant timely topics.
Workshop information provided by CECAM
(i) Data Repositories, Archives and Metadata
Repositories are the prerequisite for hosting, organizing, and sharing materials data. We will provide an overview of existing data collections with an emphasis on the NOMAD Repository, which is part of the NOMAD Center of Excellence . We will address the question how the existing input and output files, produced by many different computer codes, are transformed into a common format (using NOMAD Metadata ) as realized in the NOMAD Archive. In this way, different calculations can be compared to each other. Participants will learn how to use Repository and Archive. We will also collect feedback, particularly from industrial users, about their special needs using these tools and data.
Speakers will include M. Scheffler, and J. Vreeken.
(ii) Advanced graphics
This area will focus on the analysis and visualization of time-dependent periodic, molecular structures in combination with scalar or vector fields in three-dimensional space, as produced by electronic structure simulation codes. In addition, the visualization of abstract data in a multidimensional space of parameters, as, for example, encountered in the data-analytics context will be covered. NOMAD provides tailored software tools, both, in a traditional remote visualization context as well as in different virtual reality (VR) environments. Users will learn about powerful open- source visualization tools (VisIt  and Paraview ) which are deployed and customized by NOMAD but can also be operated as standalone software. VisIt, in particular, combines tailored molecular visualization features (including support for handling periodic structures) with general-purpose capabilities for the analysis of scalar and vector fields.
Furthermore, we will demonstrate the benefits from the immersive visualization using virtual reality (VR) devices. Specifically, users will be able to explore various three-dimensional (e.g. Fermi surfaces and crystal structures), four-dimensional (e.g. time-dependent simulations), and six-dimensional data sets (e.g. electron-hole interactions) by using various virtual reality viewers (e.g. HTC Vive, Samsung GearVR, Google Cartboard). The NOMAD visualization team has a multi-year track record in applying these methods and tools in different scientific contexts and providing training courses to a scientific audience.
Speaker: M. Rampp.
(iii) NOMAD Encyclopedia
How to access the vast amount of materials data in a user-friendly way? The NOMAD Encyclopedia is an infrastructure, that is developed within the NOMAD CoE for this purpose. It displays all possible information on the computed materials, thus facilitating to extract knowledge from the data. It serves two main purposes, which are the comprehensive characterization of single materials, and the search for materials exhibiting certain features or combinations of various properties.
For example, users will be able to directly compare calculations performed with different approximations and get information about the underlying methodology and associated error bars. The graphical user interfaces also allow for tracing results back to the respective calculations. The Encyclopedia displays data from the NOMAD Archive and also incorporates graphics tools of the visualization team.
Speakers will include C. Draxl, and G. Huhs.
(iv) Preparation and analysis of high-throughput simulations
Despite the many millions of calculations already available in databases and repositories all over the world, a huge area of materials on the one hand and properties on the other hand is basically unexplored. As a matter of fact, high-throughput calculations will always be an important issue to meet, for instance industrial needs or to create data for assessing error bars related to methodology, approximations, or numerical noise. Various tools have been suggested to facilitate such tasks, like the Atomic Simulation Environment (ASE) , Aiida , etc. We will address this issue by providing a tutorial that comprises both the preparation of such high-throughput studies but also the analysis of the related results.
Speakers will include G. Ceder, and C. Carbogno.
(v) Big-data analytics
The major part of the school will be dedicated to data-analytics tools, i.e. various ways of extracting knowledge from materials data. They embrace machine-learning, statistical-learning, and data-mining methods. The topics listed below will be covered in terms of lectures as well as hands-on tutorials:
- Structure prediction by compressed sensing: Experience suggests that many properties of materials are determined by a few key variables. Physical intuition, which can help finding them, is in many cases difficult to develop because of the task complexity. Compressed sensing is a recent technique in the field of signal processing. It allows to extract, in an unbiased form, the smallest possible set from a huge pool of variables for the statistical learning of materials properties, for a given accuracy of the property prediction.
- Neural networks: Recently, neural networks revolutionized the field of artificial intelligence, outperforming existing machine learning algorithms in a variety of tasks such as for example speech recognition, and natural language generation. We will explain how neural network can be applied to relevant materials science problems like crystal-structure recognition or atomization energy prediction.
- Subgroup discovery: This is a data mining technique for identifying subgroups of materials according to some property of interest. With its help, interpretable descriptors or variables describing the subgroups can be uncovered. .
- Cluster expansion: The cluster expansion technique allows to build models for a quick calculation of materials properties. It uses as input both the structural and compositional information, and calculated properties of materials. These are readily available in the Archive. As an example, this technique will be used to model the formation energies of alloys, in order to uncover phase transitions and the stability of materials.
- First Principles Molecular Dynamics with Machine-Learned Forces: We present a molecular dynamics (MD) scheme which combines first-principles and machine-learning techniques in a single information-efficient approach. Forces on atoms are learned from first-principles forces available in databases, using Bayesian techniques. Thus, the workload of costly MD simulations is minimized, allowing to tackle problems currently beyond reach.
- Cross validation: a topic that will be given high relevance in all the presented applications (lectures and tutorials) is cross-validation, i.e., a set of strategies to quantify the ability of a learned model to make accurate prediction on data that are not included in the training set. Cross validation adapts to the specific method, but effort will be made it to present this topic in a unified way.
Speakers will include L. Ghiringhelli, R. Ouyang, A. Ziletti, M. Boley, R. Ramprasad.
Besides focusing the summer school on the tools developed within the NOMAD CoE, we will include also external researchers on the forefront of data-driven materials science.
Selected topics are
- Exploratory Data Analysis and Causal Inference (J. Vreeken)
- Data-driven Rational Materials Design (R. Ramprasad)
- High-throughput Calculations (G. Ceder)
- Dimensionality reduction for Big-Data analytics (M. Ceriotti)