Welcome to the SMPI/PARAVER integration
This project is on hold since 2015. The web page is kept for reference. Paraver traces unfortunately miss some important informations for SMPI and the integration is thus compromised unless heavier tracing is done in extrae, which is not desirable for performance reasons.
Table of Contents
1. Installation
Here are the different files needed for this integration.
- smpi-paraver/pjdump2prv.pl
- smpi-paraver/prv2pj.pl
- smpi-paraver/dimemas-wrapper.sh
- smpi-paraver/install.sh
- smpi-paraver/smpi2pj.sh
- Makefile
For this to work "system wide", these perl and sh scripts should be in the PATH. Eventually, they will be shipped with SMPI.
2. Achievements
2.1. April 2013 (Grenoble)
- First prototype of a prv 2 csv conversion
2.2. October 2014 (Grenoble @ BSC, for Mont-Blanc)
- First prototype of an integration of SMPI with Paraver
2.3. November 2014 (Grenoble and BSC @ Chicago, for JLESC)
- Discussions between Judit and Arnaud on how to model a new machine in Simgrid.
- Improve state naming with Harald
- Minor cleanups
2.4. June 2015 (Grenoble @ BSC, for JLESC)
- Moved this project from blog entry to the contrib section of SimGrid
- Got access on Marenostrum and ensure we can compile and run
pj_dump
andSMPI
. - Major reorganization and cleanups of the scripts, revamp of the documentation.
- Judit had an issue with a very simple trace she had generated using
directly
smpicc
. The trick is I was not converting Send and Recv yet as BigDFT had only collective operations. This is now fixed. - Jesus confirmed me he was running into this TCP_RTO issue as well on some machines, that fixing it was hard and that being able to account for it would be definitely useful to understand how applications can be sensitive to it.
Jesus provided me with large traces to play with.
tar jtf /home/alegrand/Work/SimGrid/bsc/lulesh.tar.bz2
lulesh2.0_p1000_n500_t1.chop1.pcf lulesh2.0_p1000_n500_t1.chop1.prv lulesh2.0_p1000_n500_t1.chop1.row lulesh2.0_p1331_n666_t1.chop1.pcf lulesh2.0_p1331_n666_t1.chop1.prv lulesh2.0_p1331_n666_t1.chop1.row lulesh2.0_p216_n108_t1.chop1.pcf lulesh2.0_p216_n108_t1.chop1.prv lulesh2.0_p216_n108_t1.chop1.row lulesh2.0_p512_n256_t1.chop1.pcf lulesh2.0_p512_n256_t1.chop1.prv lulesh2.0_p512_n256_t1.chop1.row lulesh2.0_p729_n365_t1.chop1.pcf lulesh2.0_p729_n365_t1.chop1.prv lulesh2.0_p729_n365_t1.chop1.row
It turns out my
prv2pj.pl
script fails for the moment on these traces as they comprise two MPI process per node. I could easily fix it but for now I have many issues with send an receive matching.
2.5. July 2015 (Grenoble @ BSC, for Harald Servat's PhD defense)
I kept investigating this Isend/Irecv/Wait matching issue and it seems much more difficult than expected. I think there is not enough information in the paraver format to do it properly.
3. Roadmap
3.1. TODO Interaction between Paraver and SMPI [0/2]
[ ]
Make a model of Mare Nostrum, the Mont-blanc prototype, so that BSC staff can really play with SMPI. (Edit: this was discussed in Chicago with Judit. I explained here the SimGrid XML plaform representation and she will try to play with SMPI and come back to me with questions).[ ]
Convert the 12 GB Nancy LU trace (700 process on 3 clusters) to paraver to see whether the behavior exhibited by ocelotl can be observer in Paraver. This involves slightly modifying the paje to paraver converter which was designed for SMPI paje traces.
This trace was on flutin and I got it here: file:///exports/nancy_700_lu.C.700.pjdump.bz2
Most of these issues are specific to this trace so it can be ignored by others than me.
[ ]
Fix the state name conversion and the event conversion[ ]
Inpjdump2prv.pl
there is probably something wrong with the number of communicators. I use$nb_nodes
at the moment.[ ]
The resulting prv starts from the pjdump and I forgot to sort it. Could we give an option to pjdump so that it sorts it according to time?[ ]
Do not use state 0 as it's reserved for computation[ ]
Create a state and event for MPI application (derived from being outside MPI calls)[ ]
clock resolution issue
4. Description of the interaction between Paraver and SMPI
We explain in this document how SMPI can be used as an alternative to Dimemas within the paraver framework. To this end, we need to make sure that SMPI can simulate paraver traces and output paraver traces.
Ideally, we would modify SMPI to that it can parse and generate such traces. It's an option we keep in mind as it would be much cleaner and faster but that would require to
- scavenge dimemas trace parsing (in C/C++) and meld it with SMPI trace replay.
- make sure SMPI can generate directly the paraver format.
This is potentially a lot of work to do within our time frame so instead we decided to go for simple trace conversions, i.e., a paraver to SMPI time-independent trace format conversion and a Paje to paraver conversion.
Some simple sample traces are available here:
4.1. Paraver to CSV and SMPI format Conversion
Method
Juan Gonzalez provided us a description of the Paraver and Dimemas
format. The Paraver description is available here, i.e., from the
Paraver documentation. Remember the pcf
file describes events, the row
file defines the cpu/node/thread mapping and the prv
is the trace with
all events. I reworked my old script to convert from paraver to csv,
pjdump and SMPI time-independant trace format during the
night. Unfortunately, on the morning, Juan explained me I should not
trust the state records but only the the event and communication
records. Ideally, I should have worked from the dimemas trace instead
of the paraver trace to obtain SMPI trace but at least, this allowed
me to get a converter to csv/pjdump, which is very useful to Damien
for framesoc/ocelotl.
So I really struggled to make it work and had to make several assumptions and "Uggly hacks" (indicated in the code). In particular, something that is really uggly at the moment is that the V collective operations where send and receive are process specific appear as many times as there are process and since I translate on the fly, I do not produce a correct input for SMPI. The easiest solution to handle this is probably to have two pass but nevermind for a first proof of concept.
head paraver_trace/bigdft_8_rl.csv
State, 1, MPI_STATE, 0, 10668, 10668, 0, Not created State, 2, MPI_STATE, 0, 5118733, 5118733, 0, Not created State, 3, MPI_STATE, 0, 9374527, 9374527, 0, Not created State, 4, MPI_STATE, 0, 17510142, 17510142, 0, Not created State, 5, MPI_STATE, 0, 5989994, 5989994, 0, Not created State, 6, MPI_STATE, 0, 5737601, 5737601, 0, Not created State, 7, MPI_STATE, 0, 5866978, 5866978, 0, Not created State, 8, MPI_STATE, 0, 5891099, 5891099, 0, Not created State, 1, MPI_STATE, 10668, 25576057, 25565389, 0, Running State, 2, MPI_STATE, 5118733, 18655258, 13536525, 0, Running
TODO Regression tests
Currently, it works well on an old small 8 node BigDFT paraver trace.
TODO Cleanups
A few uggly things had to be done here (reduce, alltoallV, no handling of p2p operations, second/nanosecond issue, …) and need to be cleaned.
TODO Extrae extension
Maybe it would be interesting to have an option that allows extrae to trace all the parameters ?
TODO Distinguish between MPI process and nodes
Give a try to lulesh2.0_p216_n108_t1.chop1
.
TODO Correctly handle (I)sends and (I)recv
Here is an excerpt from lulesh2.0_p216_n108_t1.chop1.prv
2:10:1:10:1:625871180:50000001:3 # 2:...50000001:3 is MPI_Isend 1:10:1:10:1:626136517:626252559:1 2:10:1:10:1:626136517:50000001:0 # 2:...50000001:0 is Outside MPI 3:10:1:10:1:625871180:626136517:46:1:46:1:429496:677301531:104544:1024 # This is a communication starting at the same # time as the MPI_Isend (625871180) and ending # at 677301531. the 104544 is the size and the # 1024 is the tag # This tels us that the emiter is process 10:1:10:1 # Whil the receiver is process 46:1:46:1 1:10:1:10:1:626252559:626484354:10 2:10:1:10:1:626252559:50000001:3 # again, another MPI_Isend 1:10:1:10:1:626484354:626601813:1 2:10:1:10:1:626484354:50000001:0 # computing oustide MPI, blabla ... ... ... # and way later.... ... 1:46:1:46:1:677298906:677301531:8 2:46:1:46:1:677298906:50000001:5 # And here finally comes the MPI_wait on the receiver side 1:46:1:46:1:677301531:677416782:1 2:46:1:46:1:677301531:50000001:0 # followed by computations # But I could not find anything about the # corresponding Irecv...
So to summarize, here are where different information about this particular communication appear in the file:
cd ~/Work/SimGrid/bsc/ grep -n -e 625871180 -e 677301531 -e 677298906 ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/ /'
870 1:10:1:10:1:625786054:625871180:1 872 1:10:1:10:1:625871180:626136517:10 873 2:10:1:10:1:625871180:50000001:3 876 3:10:1:10:1:625871180:626136517:46:1:46:1:429496:677301531:104544:1024 10046 1:46:1:46:1:677297322:677298906:1 10048 1:46:1:46:1:677298906:677301531:8 10049 2:46:1:46:1:677298906:50000001:5 10050 1:46:1:46:1:677301531:677416782:1 10051 2:46:1:46:1:677301531:50000001:0
There is thus no way to do it an online conversion without storing every communication. So I'll go for a two pass conversion. I first parse all the "3:" lines that contain information about "who, what and when" and then I'll use these information when convertin the "2:" lines that explain how the communication is done.
Unfortunately, even when doing this, I can get what I think it the corresponding wait but not the receive operation. And this is where things get strange. The first Irecv operations on this node appear way later:
grep -n -e '2:46:1:46:1:.*:50000001:[4]' ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/ /' | head
38519 2:46:1:46:1:730783842:50000001:4 38535 2:46:1:46:1:730792925:50000001:4 38539 2:46:1:46:1:730796759:50000001:4 38547 2:46:1:46:1:730800800:50000001:4 38553 2:46:1:46:1:730804050:50000001:4 38561 2:46:1:46:1:730807217:50000001:4 38569 2:46:1:46:1:730810342:50000001:4 38575 2:46:1:46:1:730813384:50000001:4 38593 2:46:1:46:1:730829009:50000001:4 38601 2:46:1:46:1:730832676:50000001:4
Actually, the first point-to-point operations on this process are Isends and Wait and then, way later, there are Irecv.
grep -n -e '2:46:1:46:1:.*:50000001:[3456]' ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/ /' | head -n 55
1541 2:46:1:46:1:643680033:50000001:3 1550 2:46:1:46:1:644017869:50000001:3 1560 2:46:1:46:1:644360913:50000001:3 1572 2:46:1:46:1:644700749:50000001:3 1591 2:46:1:46:1:645320878:50000001:3 1610 2:46:1:46:1:645643089:50000001:3 1637 2:46:1:46:1:645925591:50000001:3 1647 2:46:1:46:1:645953216:50000001:3 1652 2:46:1:46:1:645976341:50000001:3 1657 2:46:1:46:1:645999716:50000001:3 1662 2:46:1:46:1:646020758:50000001:3 1672 2:46:1:46:1:646042967:50000001:3 1682 2:46:1:46:1:646067342:50000001:3 1692 2:46:1:46:1:646085550:50000001:3 1702 2:46:1:46:1:646104551:50000001:3 1711 2:46:1:46:1:646127342:50000001:3 1724 2:46:1:46:1:646148468:50000001:3 1739 2:46:1:46:1:646170051:50000001:3 1744 2:46:1:46:1:646188135:50000001:3 1749 2:46:1:46:1:646202593:50000001:3 1762 2:46:1:46:1:646216468:50000001:3 1773 2:46:1:46:1:646241302:50000001:3 1778 2:46:1:46:1:646254093:50000001:3 1783 2:46:1:46:1:646267802:50000001:3 1796 2:46:1:46:1:646279260:50000001:3 1803 2:46:1:46:1:646291969:50000001:3 1808 2:46:1:46:1:646306552:50000001:6 10049 2:46:1:46:1:677298906:50000001:5 10079 2:46:1:46:1:677416782:50000001:5 10100 2:46:1:46:1:677530782:50000001:5 21162 2:46:1:46:1:707872381:50000001:5 21231 2:46:1:46:1:708029966:50000001:5 21459 2:46:1:46:1:708461719:50000001:5 34674 2:46:1:46:1:728184448:50000001:5 34694 2:46:1:46:1:728210073:50000001:5 34703 2:46:1:46:1:728236323:50000001:5 34712 2:46:1:46:1:728246740:50000001:5 34716 2:46:1:46:1:728256157:50000001:5 34730 2:46:1:46:1:728263990:50000001:5 34736 2:46:1:46:1:728273282:50000001:5 34744 2:46:1:46:1:728284823:50000001:5 34756 2:46:1:46:1:728293115:50000001:5 34768 2:46:1:46:1:728301365:50000001:5 34778 2:46:1:46:1:728313407:50000001:5 34796 2:46:1:46:1:728320824:50000001:5 34813 2:46:1:46:1:728330115:50000001:5 34821 2:46:1:46:1:728334740:50000001:5 34857 2:46:1:46:1:728352824:50000001:5 34869 2:46:1:46:1:728358116:50000001:5 38485 2:46:1:46:1:730759633:50000001:5 38489 2:46:1:46:1:730765425:50000001:5 38497 2:46:1:46:1:730769883:50000001:5 38501 2:46:1:46:1:730774300:50000001:5 38519 2:46:1:46:1:730783842:50000001:4 38535 2:46:1:46:1:730792925:50000001:4
So in the previous code, there are first 26 Isend, then a Waitall that probably generates the 26 corresponding 26 Wait (but how can I know which requests were actually provided to the Waitall ???) and then finally the corresponding series of MPI_Irecv that will be waited way later. It turns out that on the receiver side, MPI handles the receptions while doing some wait on the Isend but I have thus absolutely no way to match them and to know in which orders the receive were done and on which particular receives the receiver was waiting (even if it seems that in the previous particular case it did not happen).
Actually, after discussing about this with Judit, it appears that the trace was cut. So the wait on the receiver side we see actually correspond to Irecvs that are not present in the trace (and not to previous Isends as I inititially thought). E.g.,
grep -n -e '3:.*46:1:46:1:.*' ~/Work/SimGrid/bsc/lulesh2.0_p216_n108_t1.chop1.prv | sed 's/:/ /' | head -n 54
876 3:10:1:10:1:625871180:626136517:46:1:46:1:429496:677301531:104544:1024 1369 3:9:1:9:1:636610547:636635631:46:1:46:1:456247:728238073:1584:1024 1544 3:46:1:46:1:643680033:643922576:10:1:10:1:241942:648985350:104544:1024 1555 3:46:1:46:1:644017869:644238078:82:1:82:1:248060:679151189:104544:1024 1570 3:46:1:46:1:644360913:644574248:40:1:40:1:109029:709356245:104544:1024 1581 3:46:1:46:1:644700749:644918042:52:1:52:1:300746:702191368:104544:1024 1594 3:46:1:46:1:645320878:645333170:45:1:45:1:455956:702425925:104544:1024 1633 3:46:1:46:1:645643089:645908674:47:1:47:1:202569:730677120:104544:1024 1645 3:46:1:46:1:645925591:645948883:39:1:39:1:119654:710792058:1584:1024 1650 3:46:1:46:1:645953216:645969966:4:1:4:1:189402:708996725:1584:1024 1655 3:46:1:46:1:645976341:645992758:9:1:9:1:268400:715031158:1584:1024 1660 3:46:1:46:1:645999716:646016675:53:1:53:1:691480:727903557:1584:1024 1670 3:46:1:46:1:646020758:646036925:88:1:88:1:403516:711836359:1584:1024 1680 3:46:1:46:1:646042967:646058425:83:1:83:1:238664:730162810:1584:1024 1690 3:46:1:46:1:646067342:646081717:51:1:51:1:356288:714355232:1584:1024 1700 3:46:1:46:1:646085550:646100717:76:1:76:1:244486:730628399:1584:1024 1707 3:46:1:46:1:646104551:646119176:81:1:81:1:313561:715956508:1584:1024 1719 3:46:1:46:1:646127342:646144801:41:1:41:1:290240:727544484:1584:1024 1735 3:46:1:46:1:646148468:646164718:16:1:16:1:297752:719185976:1584:1024 1742 3:46:1:46:1:646170051:646185510:11:1:11:1:89850:719665701:1584:1024 1747 3:46:1:46:1:646188135:646200635:3:1:3:1:202611:711050599:24:1024 1757 3:46:1:46:1:646202593:646214676:75:1:75:1:265069:722029411:24:1024 1769 3:46:1:46:1:646216468:646228301:5:1:5:1:478618:728457736:24:1024 1776 3:46:1:46:1:646241302:646251802:77:1:77:1:639886:733204626:24:1024 1781 3:46:1:46:1:646254093:646265802:15:1:15:1:314170:721488302:24:1024 1792 3:46:1:46:1:646267802:646277427:87:1:87:1:468891:722540417:24:1024 1801 3:46:1:46:1:646279260:646289885:17:1:17:1:71365:728537857:24:1024 1806 3:46:1:46:1:646291969:646304427:89:1:89:1:544023:728380413:24:1024 2076 3:88:1:88:1:648286682:648319557:46:1:46:1:462622:728258657:1584:1024 2206 3:4:1:4:1:648613124:648631708:46:1:46:1:453288:728228656:1584:1024 2658 3:51:1:51:1:651750721:651793430:46:1:46:1:480413:728274907:1584:1024 3363 3:83:1:83:1:654689958:654751209:46:1:46:1:476538:728265782:1584:1024 4280 3:82:1:82:1:656929413:657166832:46:1:46:1:433705:677419823:104544:1024 5292 3:52:1:52:1:660879922:661167926:46:1:46:1:440705:707876673:104544:1024 9355 3:45:1:45:1:675422726:675439059:46:1:46:1:443830:708035007:104544:1024 10276 3:81:1:81:1:678370724:678409683:46:1:46:1:486497:728295032:1584:1024 11017 3:39:1:39:1:681302334:681349627:46:1:46:1:449997:728192031:1584:1024 11410 3:75:1:75:1:682741815:682787982:46:1:46:1:500747:728336699:24:1024 12859 3:76:1:76:1:686359862:686386779:46:1:46:1:483788:728287407:1584:1024 13284 3:5:1:5:1:687447840:687458590:46:1:46:1:503830:728355032:24:1024 14768 3:17:1:17:1:691480245:691684582:46:1:46:1:515247:730771675:24:1024 15926 3:41:1:41:1:694613356:694662273:46:1:46:1:489372:728303032:1584:1024 16621 3:87:1:87:1:696696110:696711985:46:1:46:1:512122:730766883:24:1024 17045 3:16:1:16:1:698185630:698201922:46:1:46:1:492039:728315449:1584:1024 17677 3:40:1:40:1:700177777:700446073:46:1:46:1:437288:707713047:104544:1024 17774 3:11:1:11:1:700480087:700788134:46:1:46:1:495164:728322324:1584:1024 20077 3:53:1:53:1:705672360:705713485:46:1:46:1:459580:728248865:1584:1024 21080 3:3:1:3:1:707627740:707643615:46:1:46:1:497997:728332115:24:1024 21595 3:89:1:89:1:708657448:708778949:46:1:46:1:519164:730776092:24:1024 26894 3:15:1:15:1:717486244:717532578:46:1:46:1:509372:730762717:24:1024 33288 3:47:1:47:1:723885205:725113262:46:1:46:1:446997:727742403:104544:1024 38242 3:77:1:77:1:730635045:730651337:46:1:46:1:506622:730746717:24:1024 48766 3:46:1:46:1:771492475:771554143:10:1:10:1:728395763:778474104:209088:2048 48833 3:46:1:46:1:771814853:771848228:40:1:40:1:730372678:799469231:209088:2048
As can be seen, except for the last two entries, all the logical receive time are completely bogus. From this, we can make the following assumption (that will be quite annoying if the flow control of the code is non-deterministic):
- If we find a Isend, a Wait or an Irecv whose date do not correspond
to a communication event
3:
, we should just skip all of them. - If we find a waitall, well, we need to think about it to be sure. :(
We will still have the issue that we should actually collect the
request id to do correctly the matching and handle gracefully wait_any
and wait_all
4.2. Let's try to replay on SMPI
Method
This it the platform file I currently use for replaying:
The script used for calling smpi is actually quite simple: smpi2pj.sh
DONE Use command line arguments
TODO Trace the running state (outside MPI)
Currently, it does not appear in the paje trace
4.3. Pjdump/smpi to Paraver Conversion
Method
Here is my uggly script with many hardcoded values: pjdump2prv.pl
DONE Collective naming
Improve the conversion to export events so that collective operation names are the same and things are easily comparable. This was done in Chicago with Harald.
DONE Factorization
There was originallytwo scripts (pjdump2prv.pl
and
pjsmpi2prv.pl
). I've finally merged them into only one: pjdump2prv.pl
TODO Links
Add links (arrows) so that bandwidth can be computed in paraver
4.4. Gluing everything together to allow calling SMPI
Method
The Dimemas wrapper called by paraver is file:///usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh
Here is how I did proceed. I made a copy of it.
mv /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh /usr/local/stow/wxparaver-4.5.4-linux-x86_64/bin/dimemas-wrapper.sh.backup
Basically, what I wanted to do is something like
perl prv2pj.pl sh smpi2pj.sh >/dev/null perl pjsmpi2prv.pl
So here is an equivalent version inspired from the dimemas wrapper: dimemas-wrapper.sh
TODO Library issue
When running inside paraver, I can't call pj_dump
from my perl
script. When trimming the fat to get an error messge, here is what you
can get.
---> Input file is a paje trace. Running /home/alegrand/bin/pj_dump /tmp/EXTRAE_Paraver_trace_mpich.sim.trace 2>&1 -----> /home/alegrand/bin/pj_dump: /usr/local/stow/wxparaver-4.5.4-linux-x86_64/lib/paraver-kernel/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/alegrand/lib/libpaje.so.1) ---> Intermediary file /tmp/EXTRAE_Paraver_trace_mpich.sim.pjdump
I think this is due to the fact that paraver is often statically compiled and must be doing something strange with dynamic libraries pre-loading.
TODO Better integration
Currently, we replace the dimemas wrapper and the platform file is hardcoded… This should be changed to allow to specify platform and deployment.