Brian Helenbrook and his group hammered out a working qsub script for using Fluent on the cluster. It’s well commented, so be sure to read through it and set parameters as appropriate for your own use. If you have questions, Brian can be reached at bhelenbr@clarkson.edu
Jobs running slow? Disable processor affinity in mvapich2
We’ve discovered the hard way that one of mvapich2′s default setting does not play nicely in this environment.
When you’re running MPI jobs, be sure to disable processor affinity. This allows the queue and OS scheduler to place your job on the least-loaded CPU core(s) to maximize the amount of time your job spends on the CPU.
Disabling this setting is accomplished by setting an environment variable either in your shell or in the script used to launch your job:
export MV2_ENABLE_AFFINITY=0
You’ll see the impact easily because your jobs will be enabled to use more than two cores (if you asked for more than two) and they’ll run those cores at nearly 100% utilization.
Many thanks to Brian Helenbrook for his persistence and patience in working through this problem.
Checking for Hanging Jobs
MPI jobs often leave zombie processes on the machine. To check what processes you have running on all of the processors, use the command
rocks run host ‘ps -U ‘
Deleting Hanging Jobs
If you know you have hanging jobs on the cluster (see post checking for hanging jobs), you can delete them in several ways. The brute force, kill everything method is ( rocks run host ‘killall <executable name>’ )
This will kill all of your processes that have the name <executable name>. Don’t do this if you have some legitimate jobs running in the queue and some hanging process running an executable of the same name.
To be more selective, you can use the following commands ( ssh compute-X-X ) where X-X is the compute node that has the hanging job. Then type ( ps -U <username> ) to see the process names and ids.
Then to delete a particular process type ( kill <pid> ) where pid is the id that came from the ps command. Lastly type ( exit ) to logout of that node.
Lots and lots of little jobs
Ever run into the situation where you had a lot of jobs you needed to submit that were only the slightest bit different from each other? Turns out there’s a nifty way that SGE can help with that. Check out this wiki post to learn more:
http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto