Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Brian Helenbrook and his group hammered out a working qsub script for using Fluent on the cluster. It’s well commented, so be sure to read through it and set parameters as appropriate for your own use. If you have questions, Brian can be reached at bhelenbr@clarkson.edu

 

Jobs running slow? Disable processor affinity in mvapich2


We’ve discovered the hard way that one of mvapich2′s default setting does not play nicely in this environment.

When you’re running MPI jobs, be sure to disable processor affinity.  This allows the queue and OS scheduler to place your job on the least-loaded CPU core(s) to maximize the amount of time your job spends on the CPU.

Disabling this setting is accomplished by setting an environment variable either in your shell or in the script used to launch your job:

export MV2_ENABLE_AFFINITY=0

You’ll see the impact easily because your jobs will be enabled to use more than two cores (if you asked for more than two) and they’ll run those cores at nearly 100% utilization.

Many thanks to Brian Helenbrook for his persistence and patience in working through this problem.

 

Checking for Hanging Jobs


MPI jobs often leave zombie processes on the machine. To check what processes you have running on all of the processors, use the command

           rocks run host ‘ps -U ‘

Deleting Hanging Jobs


If you know you have hanging jobs on the cluster (see post checking for hanging jobs), you can delete them in several ways. The brute force, kill everything method is (   rocks run host ‘killall <executable name>’   )
This will kill all of your processes that have the name <executable name>. Don’t do this if you have some legitimate jobs running in the queue and some hanging process running an executable of the same name.
To be more selective, you can use the following commands (    ssh compute-X-X      ) where X-X is the compute node that has the hanging job. Then type (      ps -U <username>     ) to see the process names and ids.
Then to delete a particular process type (    kill <pid>     ) where pid is the id that came from the ps command.  Lastly type  (   exit   ) to logout of that node.

Lots and lots of little jobs


Ever run into the situation where you had a lot of jobs you needed to submit that were only the slightest bit different from each other?  Turns out there’s a nifty way that SGE can help with that.  Check out this wiki post to learn more:

http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto