高性能计算/分布式计算等大量计算需要程序运行几天、几周甚至几个月,如果期间因为电力或者不可避免的问题导致程序中断会浪费大量的时间和人力,还有超级计算机在这段时间里运行的电力成本。我们没有遇到过电力问题,不过我们最近遇到的场景是实验室需要做强制安全检查,要求关闭所有电脑,所以我们需要一种工具能设置断点暂停程序、把状态保存到硬盘、再按照要求恢复,就像程序员用 IDE 调试程序一样,设置断点、单步跟踪(或恢复运行)。
在多主机、多线程的复杂分布式计算环境,给程序设置断点不是一件容易的事情,因为程序的某部分可能在其他主机上运行。DMTCP: Distributed MultiThreaded CheckPointing 是我们目前正在考察的一个工具之一,我们喜欢它的一个原因是它不需要修改 Linux 内核,不依赖内核和内核模块。
先安装一些编译 DMTCP 需要用到的软件包:
$ sudo apt-get install build-essential
下载 DMTCP 源代码后,解压、配置、编译、安装:
$ wget http://ufpr.dl.sourceforge.net/project/dmtcp/dmtcp/1.2.7/dmtcp-1.2.7.tar.gz $ tar zxvf dmtcp-1.2.7.tar.gz $ cd dmtcp-1.2.7 $ make $ sudo make install
我们先来看一个例子程序,转到 dmtcp-1.2.7 里的 test 目录运行 dmtcp1,这个例子很简单,按照顺序输出 1 2 3 … 数字:
$ cd test $ ./dmtcp1 1 2 3 4 5 6 ^C
接下来我们要试验的是,中断这个例子程序,然后看看能不能恢复它。步骤是:
开另外一个窗口或者 screen 运行控制端 dmtcp_coordinator,用 l 查看当前节点状态:
$ dmtcp_coordinator dmtcp_coordinator starting... Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Type '?' for help. ? COMMANDS: l : List connected nodes s : Print status message c : Checkpoint all nodes i : Print current checkpoint interval (To change checkpoint interval, use dmtcp_command) f : Force a restart even if there are missing nodes (debugging only) k : Kill all nodes q : Kill all nodes and quit ? : Show this message l Client List: #, PROG[PID]@HOST, DMTCP-UNIQUEPID, STATE [7845] NOTE at dmtcp_coordinator.cpp:1039 in onConnect; REASON='worker connected' hello_remote.from = b51cf8-7846-516fbedc(-1)
然后重新用 dmtcp_checkpoint 运行 dmtcp1 这个例子程序:
$ dmtcp_checkpoint ./dmtcp1 dmtcp_checkpoint (DMTCP + MTCP) 1.2.7 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) 1 2 3
回到控制端,l 一下就会看到多了条结点状态,dmtcp1 程序[进程号 7846]运行在主机名为 vpseedev 的主机上。
l Client List: #, PROG[PID]@HOST, DMTCP-UNIQUEPID, STATE 1, dmtcp1[7846]@vpseedev, b51cf8-7846-516fbedc, RUNNING
现在我们在控制端设置一个检查点(checkpoint),用 c 命令:
c [7845] NOTE at dmtcp_coordinator.cpp:1315 in startCheckpoint; REASON='starting checkpoint, suspending all nodes' s.numPeers = 1 [7845] NOTE at dmtcp_coordinator.cpp:1317 in startCheckpoint; REASON='Incremented Generation' UniquePid::ComputationId().generation() = 1 [7845] NOTE at dmtcp_coordinator.cpp:643 in onData; REASON='locking all nodes' [7845] NOTE at dmtcp_coordinator.cpp:678 in onData; REASON='draining all nodes' [7845] NOTE at dmtcp_coordinator.cpp:684 in onData; REASON='checkpointing all nodes' [7845] NOTE at dmtcp_coordinator.cpp:694 in onData; REASON='building name service database' [7845] NOTE at dmtcp_coordinator.cpp:713 in onData; REASON='entertaining queries now' [7845] NOTE at dmtcp_coordinator.cpp:718 in onData; REASON='refilling all nodes' [7845] NOTE at dmtcp_coordinator.cpp:747 in onData; REASON='restarting all nodes' [7845] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client disconnected' client.identity() = b51cf8-7846-516fbedc
然后用 Ctrl + c 强制中断这个正在运行的 dmtcp1:
$ dmtcp_checkpoint ./dmtcp1 dmtcp_checkpoint (DMTCP + MTCP) 1.2.7 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) 1 2 3 4 5 6 7 8 ^C
这样在当前目录 test 下会生成一个临时文件用来保存程序镜像(稍后会看到就是通过这个文件来恢复程序的):
$ ls -l ckpt_dmtcp1_b51cf8-7846-516fbedc.dmtcp -rw------- 1 vpsee vpsee 2532431 Apr 18 09:37 ckpt_dmtcp1_b51cf8-7846-516fbedc.dmtcp
用 dmtcp_restart 恢复就会看到 dmtp1 这个例子程序(从它中断的地方)继续运行了:
$ dmtcp_restart ckpt_dmtcp1_b51cf8-7846-516fbedc.dmtcp dmtcp_checkpoint (DMTCP + MTCP) 1.2.7 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) 9 10 11 12 ^C
DMTCP 甚至允许我们把运行了一半的程序暂停、保存到硬盘、拷贝到其他服务器上、继续运行,很酷吧~