The SimGrid History

SimGrid has been an active project for more than 10 years. Since its beginnings, SimGrid has evolved, changed, and the tackled new systems. For instance, at its birth, SimGrid was dedicated to Grid simulation and was oriented toward scheduling simulations. Nowadays SimGrid deals with many kinds of distributed systems, ranging from grids to peer-to-peer systems, and allows to create hybrid systems combining different characteristics.

Many people have asked about the origins of the SimGrid project, about the history of its development up to now, and about the plans for the future. Here it is, in (perhaps excruciating) details.

1999: SimGrid v1

In 1999 Henri Casanova joined the AppLeS research group in the Computer Science and Engineering Department at the University of California at San Diego, as a postdoc. The AppLeS group, led by Francine Berman, focused mostly on the study of practical scheduling algorithms for parallel scientific applications on heterogeneous, distributed computing platforms. Shortly after Henri joined the group, he faced the need to run simulations instead of or in addition to merely running real-world experiments. At that time, Arnaud Legrand, a 1st year graduate student at Ecole Normale Superieure de Lyon, France, spent 2 months in the summer in the AppLeS group as a visiting student. He worked with Henri that summer on a research project as part of which he implemented an ad-hoc simulator.

After Arnaud left UCSD, Henri realized that it was very likely that every researcher in the AppLeS group would eventually need to run simulations, and that they would most likely all end up rewriting the same code at one point or another. He took apart the simulator that Arnaud had developed, and packaged it as a more generic simulation framework with a simple API, and called it SimGrid v1.0 (a.k.a. SG). This version was simple, and in retrospect a bit naive. However, it was surprisingly useful to study "centralized" scheduling (e.g., off-line scheduling of a DAG on a heterogeneous set of distributed compute nodes). SimGrid v1.0 was described in "SimGrid: A Toolkit for the Simulation of Application Scheduling, by Henri Casanova, in Proceedings of CCGrid 2001". Henri became the first user of SimGrid and used it for several research projects from then on.

2001: SimGrid v2

By 2001 time Arnaud was engaged in his Ph.D. thesis work and started studying "decentralized" scheduling heuristics, that is, the ones in which scheduling decisions are made by more or less autonomous agents that typically have only partial knowledge of the applications and/or computing platform. Although simulating decentralized scheduling with SimGrid v1.0 was actually possible (and done by one Ph.D. student at UCSD in fact!), it was extremely cumbersome and limited in scope. So Arnaud built a layer on top of SG, which he called MSG (for Meta-SimGrid). MSG added threads and introduced the concept of independently running simulated processes that performed computations and communication tasks in possibly asynchronous fashion. MSG was described in "MetaSimGrid: Towards realistic scheduling simulation of distributed applications, by Arnaud Legrand and Julien Lerouge, LIP Research Report". This resulted in the following layered architecture:

                (user code)
                -----------
                | MSG |   |
                -------   |
                |    SG   |
                -----------

With Henri and some of his students using SG and Arnaud using MSG, the project started having a (tiny) user base. It was time to be more ambitious and to address one of the key limitations of SG: its inability to simulate multi-hop network communications realistically. In the summer 2003 Loris Marchal, a 1st year graduate student at Ecole Normale Superieure, came to UCSD to work with Henri. During that summer, based on results in the TCP modeling literature, he implemented a macroscopic network model as part of SG. This model dramatically increased the level of realism of SimGrid simulations and was initially described in: "A Network Model for Simulation of Grid Applications, by Loris Marchal and Henri Casanova, LIP research report". By the end of 2003 the work at UCSD and at Ecole Normale was merged in what became SimGrid v2, as described in: "Scheduling Distributed Applications: the SimGrid Simulation Framework, by Henri Casanova, Arnaud Legrand, and Loris Marchal, in Proceedings of CCGrid 2003".

2004: SimGrid v3

SimGrid v2, with its much improved features and capabilities, garnered a larger user base and many friends and collaborators of Arnaud and Henri started using it for their research. On these friends was Martin Quinson, then a Ph.D. student at Ecole Normale Superieure, who was working in the area of distributed resource monitoring systems. As part of his Ph.D. Martin attempted to develop a network topology discovery tool and quickly found out that it was difficult and required prototyping in simulation. Faced with the perspective of first implementing a throw-away prototype in simulation and then reimplementing the whole thing for production, Martin started working on a framework that would easily compile the same code in "simulation mode" or in "real-world mode". He found this ability to be invaluable when developing distributed systems and built his framework, called GRAS, on top of MSG (for the simulation mode) and on top of the socket layer (for the real-world mode). GRAS is described in "GRAS: A Research & Development Framework for Grid and P2P Infrastructures, by Martin Quinson, in Proceedings of PDCS 2006". This led to the following layered software architecture:

        (user code for either SG, MSG or GRAS)
        -----------------------------
        |   |     |    GRAS API     |
        |   |     -------------------
        |   |     |GRAS S | |GRAS R |
        |   |     --------- ---------
        |   |    MSG      | |sockets|
        |   --------------| ---------
        |        SG       |
        -------------------

At this point, with more users running more complex simulations, it became clear that the initial SG foundation inherited from SimGrid v1 was too limiting in terms of scalability and performance. In 2005 Arnaud took the bull by the horns and replaced SG with a new simulation engine called SURF, thus removing the SG API. Users reported acceleration factors of up to 3 orders of magnitude when going from SG to SURF. Furthermore, SURF is much more extensible than SG ever was and has enabled the evolution of simulation models used by SimGrid. Although it made sense at the time to re-implement GRAS on top of SURF, it was never accomplished due to the "too many things to do not enough time" syndrome. Martin added a layer on top of GRAS called AMOK, to implement high-level services needed by many distributed applications, thus leading to the new overall layered architecture:

    (user code for either MSG or GRAS -- using AMOK or not)
                           -------
                           | AMOK|
         -------------------------
         |     |    GRAS API     |
         |     -------------------
         |     |GRAS S | |GRAS R |
         |     --------- ---------
         |    MSG      | |sockets|
         --------------| ---------
         |   SURF      |
         ---------------

This architecture culminated in SimGrid v3! One development worth mentioning is that of SimDAG, written by Christophe Thiery during an Internship with Martin Quinson. Many users indeed had asked functionality similar to what the SG API provided in SimGrid v1 and v2, to study centralized scheduling without all the power of the MSG API. SimDAG provides an API especially for this purpose and was integrated in SimGrid v3.1, leading to the following layered architecture:

    (user code for either SimDag, MSG or GRAS)
                                -------
                                | AMOK|
       --------------------------------
       |      |     |    GRAS API     |
       |      |     -------------------
       |      |     |GRAS SG| |GRAS RL|
       |      |     --------- ---------
       |SimDag|    MSG      | |sockets|
       |--------------------| ---------
       |        SURF        |
       ----------------------

SimGrid 3.2, the current publicly available version as this document is being written, implements the above architecture and also provides a (partial) port to the Windows operating system.

2011: project status and work

As the project advances, it has become increasingly clearer that there is a need for an intermediate layer between the base simulation engine, SURF, and higher level APIs. In the previously shown software architecture MSG plays the role of an intermediate layer between SURF and GRAS, but is itself a high-level API, which is not a very good design. Bruno Donassolo, during an internship with Arnaud, has developed an intermediate layer called SIMiX, and both GRAS and MSG have been rewritten on top of it.

Another development is that of SMPI, a framework to run unmodified MPI applications in either simulation mode or in real-world mode (sort of GRAS for MPI). The development of SMPI, by Mark Stillwell who works with Henri, is being greatly simplified thanks to the aforementioned SIMiX layer. Finally, somewhat unrelated, is the development of Java bindings for the MSG API by Malek Cherier who works with Martin. The current software architecture thus looks as follows:

    (user code for either SimDAG, MSG, GRAS, or MPI)
       ----------------------------------
       |      |   |jMSG|    |AMOK|      |
       |      |   -----|    ------      |
       |SimDag| MSG    | GRAS    | SMPI |     (Note that GRAS and SMPI also run on top of
       |      ---------------------------      sockets and MPI, not shown on the figure)
       |      |           SIMiX         |
       ----------------------------------
       |              SURF              |
       ----------------------------------

While the above developments are about adding simulation functionality, a large part of the research effort in the SimGrid project relates to simulation models. These models are implemented in SURF, and Arnaud has refactored SURF to make it more easily extensible so that one can experiment with different models, in particular different network models. Pedro Velho, who works with Arnaud, is currently experimenting with several new network models. Also, Kayo Fujiwara, who works with Henri, has interfaced SURF with (a patched version of) the GTNetS packet-level simulator.

The current architecture in the CVS tree at the time this document is being written is as follows:

    ----------------------------------
    |      |   |jMSG|    |AMOK|      |
    |      |   ------    ------      |
    |SimDag| MSG    | GRAS    | SMPI |  (Note that GRAS and SMPI also run on top of
    |      |        |     -------    |   sockets and MPI, not shown on the figure)
    |      |        |     |SMURF|    |  
    |      ---------------------------  
    |      |          SIMiX          |
    ----------------------------------
    |         SURF interface         |
    ----------------------------------
    |    SURF kernel   |    | GTNetS |
    | (several models) |    |        |
    --------------------    ----------

As of the end of 2011, projects have evolved widely:

  • ns-3 can now be used for packet-level simulation
  • Model Checking is constantly evolving
  • New network models and description enable to have a slow memory footpring and fast simulation times
  • SMPI works great now!
  • Many tools for tracing and simulations have been developped

Ongoing work

The primary short-term future direction is to develop a distributed version of SIMiX to increase the scalability of simulations in terms of memory. This can be done using the GRAS "real world" functionality to run SIMiX in a distributed fashion across multiple hosts, thus allowing running simulations that are not limited by the amount of memory on a single host. The simulation itself would still be centralized and sequential, meaning that a single simulated process would run at a time. Bruno Donassolo is currently working on this idea, which is currently called SMURF.

Future directions

Fundings allows us to make SimGrid evolve constantly, grow and make it better and useful. One of the constant challenges in this project is its duality: it is a useful tool for scientists (hence our efforts on APIs, portability, documentation, etc.), but is it also a scientific project in its own right (so that we can publish papers).

2008: USS-SimGrid

USS-SimGrid has been funded for three years (2009-2011) by the The French National Research Agency (ANR) under contract no. ANR-08-SEGI-022.

The USS-SimGrid project aims at Ultra Scalable Simulations with SimGrid. The main goal of this project was to allow its use in the simulation of desktop grids and peer-to-peer settings.

The planned work spans in several axis, splited in work packages:

  • Improving the models used in SimGrid: increasing their scalability (WP1) and easing their instanciation (WP2) ;
  • Providing associated tools for experimenters, such as result analysis assistants (WP3) and test campaign managers (WP4) ;
  • Increasing the simulator scalability by parallelization (WP5) and optimization.

We aim at producing a scientific instrument directly usable by a large community. We work in close loop with end-users to ensure that the tool is well adapted to their needs (WP6).

More information on the USS-SimGrid home page.

2012: SONGS

SONGS is funded for four years (2012-2015) by the The French National Research Agency (ANR) under contract no. ANR-11-INFRA-13.

As demonstrated by the USS SimGrid project funded by the ANR in 2008, simulation has proved to be a very effective approach for studying such platforms. Although even more challenging, we think the issues raised by petaflop/exaflop computers and emerging cloud infrastructures can be addressed using similar simulation methodologies. The goal of the SONGS project is to extend the applicability of the SimGrid simulation framework from Grids and Peer-to-Peer systems to Clouds and High Performance Computation systems. Each type of large-scale computing system will be addressed through a set of use cases and lead by researchers recognized as experts in this area. Any sound study of such systems through simulations relies on the following pillars of simulation methodology: Efficient simulation kernel; Sound and validated models; Simulation analysis tools; Campaign simulation management.

More information on the SONGS webpage

2017: HAC-SPECIS

The goal of the HAC SPECIS (High-performance Application and Computers: Studying PErformance and Correctness In Simulation) project is to answer methodological needs of HPC application and runtime developers and to allow to study real HPC systems both from the correctness and performance point of view. To this end, we gather experts from the HPC, formal verification and performance evaluation community.

Context

In the last decades, both hardware and software of modern computers have become increasingly complex. Multi-core architectures comprising several accelerators (GPUs or the Intel Xeon Phi) and interconnected by high-speed networks have become mainstream in the field of High-Performance. Obtaining the maximum performance of such heterogeneous machines requires to break the traditional uniform programming paradigm. To scale, application developers have to make their code as adaptive as possible and to release synchronizations as much as possible. They also have to resort to sophisticated and dynamic data management, load balancing, and scheduling strategies. This evolution has several consequences:

  • First, this increasing complexity and the release of synchronizations is even more error-prone than before. The resulting bugs may almost never occur at small scale but systematically occur at large scale and in a non deterministic way, which makes them particularly difficult to identify and eliminate.
  • Second, the dozen of software stacks and their interactions have become so complex that predicting the performance (both in term of time, resource usage and energy) of the system as a whole is extremely difficult. Understanding and configuring such systems has therefore become a key challenge.

We believe these two challenges related to correctness and performance can be answered by gathering the skills from experts of formal verification, performance evaluation and high performance computing. The goal of the HAC SPECIS Inria Project Laboratory is to answer the methodological needs raised by the recent evolution of HPC architectures by allowing application and runtime developers to study such systems both from the correctness and performance point of view.