Configuring Ganglia
My group would like to get more metrics out of it than we're currently being provided with, such as the ability to see an hour interval any time in the recorded past instead of just the previous hour, and we would also like to see activity even at the thread level to help us identify problems as they arise across our clusters. Would it be possible and if so how to configure Ganglia to provide more parameters and unlock additional features? Help greatly appreciated
Comments are currently closed for this discussion. You can start a new one.
Support Staff 2 Posted by David E. Young on 07 Jun, 2010 11:42 AM
I'll investigate these questions. There are no immediate answers; sorry about that. This will take some research.
Support Staff 3 Posted by David E. Young on 07 Jun, 2010 01:19 PM
Created a lighthouse ticket.
Support Staff 4 Posted by David E. Young on 10 Jun, 2010 12:56 PM
Hey Wes. Would you try and elaborate on what thread-level metrics you're interested in? This will help me focus on how we might obtain this information and via what type of interface. Thanks.
-- david
Support Staff 5 Posted by David E. Young on 02 Jul, 2010 12:32 PM
Hey Wes. I saw your posting on the ganglia support list. I've written gmond plugins in both Python and C, and find the C approach generally less onerous. I can send you the source and compiled gmond modules if you like so you can take a look. We have two modules; one collects mysqld metrics, the other some metrics from SMB. These modules should be considered experimental, but they seem to run fine here.
I built the C modules on a Fedora 11 box and run them on CentOS 5.4 with Ganglia 3.1.7. They're 64-bit and will not run with the ganglia version as shipped with our original CentOS DVD.
Let me know.
-- david
Support Staff 6 Posted by David E. Young on 02 Jul, 2010 12:39 PM
I should qualify the comment regarding running with ganglia 3.1.1. You may certainly compile the modules on a 32-bit CentOS 5.4 machine and they should (probably) run just fine. I just haven't tried that so I can't make promises.
7 Posted by Wes Stevens on 02 Jul, 2010 05:08 PM
Yea sure, would love to have them thanks!
8 Posted by Wes Stevens on 02 Jul, 2010 08:35 PM
Hey Dave I'm not sure if this is the best place to ask this, but I ran the makefile on the ganglia_plugins package you provided this morning and I get this compile error:
[wjs@A3797200 ganglia_plugins]$ make
gcc -c -g -fno-strict-aliasing -fPIC -fpic -fno-omit-frame-pointer -std=gnu99 -Wall -I. -I/usr/include/ganglia -I/usr/include/apr-1 -m64 -D_GNU_SOURCE -D_LARGEFILE64_SOURCE utils.c -o lib64/utils.o
In file included from utils.c:10:
utils.h:5:25: error: gm_protocol.h: No such file or directory
make: *** [lib64/utils.o] Error 1
Not sure what it does, my guess is something related to communication with the gmon daemon, but I can't find gm_protocol.h anywhere, although it's included in utils.h. Thanks again in advance.
9 Posted by Wes Stevens on 02 Jul, 2010 09:16 PM
Ah nevermind Harry found me the ganglia development package that contains this stuff.
10 Posted by Wes Stevens on 06 Jul, 2010 04:53 PM
Hey Dave do you know if the SaffronOne_0 and SaffronOne_1 metrics will work if Saffron is configured to run as a single process and we run an ingest? Any other caveats I might not be aware of?
I'm also taking a look at your C code and wondering if you are still developing this stuff and you might have later versions of this available?
Lastly I'm wondering if there is any documentation that you guys may have for this project, I've been taking a look at it and just the sheer amount of code can take me a while to digest exactly what was done here, although I've got it running and displaying metrics in Ganglia just fine. Thanks
Support Staff 11 Posted by David E. Young on 06 Jul, 2010 06:53 PM
If you're running single-process then the metrics won't be available. The current version assumes a standard deployment. This is mostly because the smb metrics module gets all of its information from the proc filesystem; there is no API as of yet to smb itself.
You have the latest version. I'm on different tasking right now, but if you folks see bugs or a need for new features then I should have time to look at these requests.
There was zero ganglia documentation (at least that I found) on how to write metric modules in C. In the ganglia 3.1.7 source tree there is module shell directory (ganglia-3.1.7/gmond/modules/example), plus of course the modules shipped with ganglia. Those are what I used to get going. It's actually not as bad as it might first appear. If you look at the skeleton module you'll see see that there isn't all that much code to hook into the ganglia API.
Finally, remember that you have two alternatives to a C approach: a) Python; and b) gmetric. I found Python to impact gmond a bit more than I liked; I prefer the tight integration of C and knowing the metric modules are running as fast/efficiently as possible (assuming of course my code is good). Gmetric is the old interface/way to insert metrics into gmond. It doesn't format as nicely as the API, but it is a possible alternative. Google around for some tutorials; I played with gmetric a bit.
-- david
12 Posted by Wes Stevens on 09 Jul, 2010 04:52 PM
Hey Dave, at some point in the future we may be interested in changing the smb metrics to suit our configuration, however my biggest fear is overburdening you and keeping you from doing other important work, especially if that work is helping the rest of my team. I am not really sure what to do at this point, so I figure I will wait until my team starts raising a fuss. May have other questions in the future though. Thanks
Support Staff 13 Posted by David E. Young on 12 Jul, 2010 12:28 PM
Don't worry about that. At present, part of my standing tasking is extensions to ganglia, so unless something changes drastically here I'm available to look at these extensions. Just let me know. Thanks!
-- david
14 Posted by Wes Stevens on 12 Jul, 2010 04:57 PM
Probably my biggest problem I'm having right now is generating our own simple Python template for generating our own metrics. We think such a template will really give us an idea where we wanna go as far as a final template, whether we want it in C or Python. I have followed this tutorial almost to a t with 0 results:
http://tech.xlab.si/2008/10/writing-custom-python-metric-modules-fo...
despite how incredibly simple and straightforward the tutorial is. The only way I way I significantly stray from the tutorial is putting the test.py in /usr/lib64/ganglia/python_modules instead of using /usr/local/share/ganglia/python_modules, I'm not sure if that's keeping everything from working, but I do not have permission to use /usr/local/share. We when we run gmond with debugger option we get an error that it cannot find metric_init in test.py, which doesn't make any sense because it is clearly there, it must not be finding test.py altogether or something else.
In relation to what we've been talking about, we haven't had any of the SMB metrics plotting data aside from the Tomcat ones. This doesn't mean all of them don't work, but I strongly suspect at least most of them don't. Even during ingests they show nothing. If you have any questions about our configuration please let me know, perhaps even Jim Fleming could help you there.
Thanks again in advance
Support Staff 15 Posted by David E. Young on 12 Jul, 2010 05:53 PM
I generally see all of the SMB metrics plot. On the occasions I haven't, restarting gmetad/gmond on my test cluster (sometimes several times) clears that up. I don't think it's a problem with the module, but you can verify by looking in /var/log/messages. Modsmb will write diagnostics to the system logger. All of this assumes you're using ganglia 3.1.7 (64-bit); I haven't tested with any other version. Ganglia can be flaky; even the newest version.
I also had a python metric module working on ganglia 3.1.1 (the initial version shipped with our original DVD). On that dvd, ganglia is 32-bit because at that time only 32-bit rpms were available.
The ganglia 3.1.7 RPMs we sent you folks were compiled without python support, so unless you've rebuilt them you'll have trouble. Also, ganglia 3.1.1 only supports python up to version 2.3.
I don't know enough about the environment you're working in to offer anything more constructive. Let me know.
-- david
16 Posted by Wes Stevens on 12 Jul, 2010 06:05 PM
Indeed this environment meets all those requirements, it is 3.1.7, 64-bit, the rpm has python support, we have python 2.5 installed, it's just not working.
I have rebooted gmond and gmetad daemons dozens of times and even run ingests and we have yet to have any of the non-Tomcat metrics plot anything.
As far as the SaffronOne_0 and SaffronOne_1 metrics we think they don't work because our Saffron is configured for a single process and this is two processes, probably not showing anything because the plugin isn't made to recognize a process name for a single process instance, but I dunno I haven't investigated.
As for the Enterprise or ClusterCoordinator stuff, I have no idea why their not plotting anything.
Support Staff 17 Posted by David E. Young on 12 Jul, 2010 06:11 PM
Oh, of course. Yes, as mentioned previously the modsmb plugin will only plot SMB metrics if you're running multi-process. That includes all smb processes, not just SaffronOne. HostManager is not currently plotted at all because it really doesn't impact the system.
As for python, that I have not tried (yet) with 3.1.7. Unless things have changed, in addition to your module-specific config file, you must also place modpython.conf in /etc/ganglia/conf.d. There's a sample for this in the examples directory that came with the modsmb tarball.
-- david
18 Posted by Wes Stevens on 12 Jul, 2010 06:21 PM
Could we get it to plot data for a single saffron process then? Like, just SaffronOne, without any postfix like 0 or 1.
As for the modpython.conf in /etc/ganglia/conf.d, we have indeed had this located there, as well as test.conf. Perhaps we should set up a meeting on webex? I guess I would need Harry's approval.
Support Staff 19 Posted by David E. Young on 12 Jul, 2010 06:36 PM
Plotting a specific class instance within a single VM (such as SaffronOne) would require some type of API to either the Java or VM layer of SMB, neither of which exists right now.
When I get a bit of time (I'm on a specific task right now) I'll see about getting python support for ganglia onto my test cluster. A webex is probably a bit premature right now; I'd go to the ganglia forum first.
This is a good project; have fun with it.
-- david
Support Staff 20 Posted by David E. Young on 14 Jul, 2010 12:17 PM
Hi Wes. I found a bit of time yesterday to try python with ganglia 3.1.7. I installed the example.py module that comes with the distribution and it works ok. I noticed one subtlety that's easy to miss, though. In modpython.conf you'll see this line:
params = "/usr/lib64/ganglia/python_modules"
make sure you've created the python_modules directory in /usr/lib64/ganglia, and drop your python modules in there. Alternatively, of course, you can change the params attribute.
hth, david
Support Staff 21 Posted by David E. Young on 14 Jul, 2010 01:29 PM
Also, you've probably read this but I'm going to throw it into the mix anyhow:
http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_...
-- david
22 Posted by Wes Stevens on 14 Jul, 2010 06:43 PM
Already had test.py located in /usr/lib64/ganglia/python_modules and also read that tutorial a while back. Here's the actual error message I get:
[wjs@A3797200 ~]$ /usr/sbin/gmond -d10
loaded module: core_metrics
loaded module: cpu_module
loaded module: disk_module
loaded module: load_module
loaded module: mem_module
loaded module: net_module
loaded module: proc_module
loaded module: sys_module
loaded module: multicpu_module
loaded module: smb_module
loaded module: python_module
loaded module: mysql_module
[PYTHON] Can't find the metric_init function in the python module [test].
...
Unable to find the metric information for 'test'. Possible that the module has not been loaded.
...
......
.........
EDIT: /var/log/messages says the same thing and doesn't reveal anything new
23 Posted by Wes Stevens on 14 Jul, 2010 07:04 PM
Here's more information that might help (in following the tutorial http://tech.xlab.si/2008/10/writing-custom-python-metric-modules-for-ganglia/)
[wjs@A3797200 ~]$ ls -1 /etc/ganglia/conf.d
diskusage.pyconf.off
modgstatus.conf
modmysql.conf
modpython.conf
modpython.conf~
modpython.conf.save
modsmb.conf
multicpu.conf
tcpconn.pyconf
test.pyconf
test.pyconf~
[wjs@A3797200 ~]$ ls -1 /usr/lib64/ganglia/python_modules
compile.py
compiler.py
example.py
example.pyc
multidisk.py
multidisk.pyc
tcpconn.py
tcpconn.pyc
test.py
test.py~
test.pyc
[wjs@A3797200 ~]$ cat /etc/ganglia/conf.d/modpython.conf
/*
params - path to the directory where mod_python
should look for python metric modules
the "pyconf" files in the include directory below
will be scanned for configurations for those modules
*/
modules {
module {
name = "python_module"
path = "modpython.so"
params = "/usr/lib64/ganglia/python_modules"
}
}
include ('/etc/ganglia/conf.d/*.pyconf')
[wjs@A3797200 ~]$ cat /etc/ganglia/conf.d/test.pyconf
modules {
module {
name = "test"
language = "python"
}
}
collection_group {
collect_once = yes
time_threshold = 20
metric {
name = "test"
title = "Prints some string"
}
}
[wjs@A3797200 ~]$ cat /usr/lib64/ganglia/python_modules/test.py
import os
def getString(name):
test_file = "/home/notec/wjs/Desktop/test"
try:
p = os.popen(test_file, 'r')
return p.read()
except IOError:
return "Error"
def metric_init(params):
d1 = {'name': 'test',
'call_back': getString,
'time_max': 90,
'value_type': 'string',
'units': '',
'slope': 'both',
'format': '%s',
'description': 'Example metric'}
return [d1]
def metric_cleanup():
pass
Support Staff 24 Posted by David E. Young on 14 Jul, 2010 07:11 PM
Tell you what. Tar up all of your python stuff (.py files, config files) and send me the tarball. I want to make sure I've got the exact stuff you do; cut'n'pasting the stuff might alter the formatting some way. I'll take a look...
-- david
25 Posted by Wes Stevens on 14 Jul, 2010 07:25 PM
Sure thing (see attached)
Support Staff 26 Posted by David E. Young on 14 Jul, 2010 08:01 PM
Ok, Wes. I reproduced your problem but don't understand it. Especially after finding a solution. I simply renamed your module to getstr.py (rather than test.py), and changed the metric name to Get_String. Works just fine.
With your original code I also got this message from gmond:
Exception AttributeError: "'module' object has no attribute 'metric_init'" in 'garbage collection' ignored
Fatal Python error: unexpected exception during garbage collection
Makes no sense does it? I told you ganglia could be flaky. Try the attached version.
-- david
27 Posted by Wes Stevens on 14 Jul, 2010 08:41 PM
Alright no longer getting the error. However, no luck getting Ganglia to show the Get_String metric anywhere, rebooted a bunch of times with no effect. Are you able to get this?
Support Staff 28 Posted by David E. Young on 15 Jul, 2010 12:36 PM
Yes, I see it on the node view page under Time and String Metrics. Note the first line:
Prints some string Frodo Baggins
Last Boot Time Tue, 06 Jul 2010 08:03:29 -0400
Gexec Status OFF
Gmond Started Thu, 15 Jul 2010 08:34:02 -0400
Last Reported 0 days, 0:00:10
Machine Type x86_64
Operating System Linux
Operating System Release 2.6.30.10-105.2.23.fc11.x86_64
Uptime 9 days, 0:30:53
Support Staff 29 Posted by David E. Young on 15 Jul, 2010 01:22 PM
Oh, one more thing. I'm sure you saw this, but when I was experimenting with your code I changed the os.popen() to reference a script named /tmp/frodo. That script simply echoes 'Frodo Baggins' to stdout. Also, note that the script must (obviously) be executable by everyone. I made this mistake.
-- david
Support Staff 30 Posted by David E. Young on 15 Jul, 2010 05:15 PM
And another note. I think I understand why your original code, when named test.py, didn't work. Python has a package named 'test' in its standard path. If the ganglia libpython module simply imports python metric modules, then what probably happened is the python 'test' package was imported rather than your module.
Only thing that makes sense right now.
-- david