back to article What's wrong with network monitoring tools? Where do I start...

For as long as I can remember I've worked in an environment where there's a screen on the wall showing the status of the company's systems. Or actually, in one case, showing the status of the company's systems unless there was a test match on. From time to time that information's been useful. Unfortunately, most of the time we …

COMMENTS

This topic is closed for new posts.
  1. Ruairi

    What I'd really like to see is event correlation in an intelligent way...

    By parsing flow data in almost-real time, looking for patterns in syslog, interface changes (ie: flap, or an interface counter going +/- X% across samples), and snarfing up accounting data. Hell, even take an iBGP feed of updates from my eBGP peers, and a feed of OSPF LSA's and correlate an event with a specific set of updates. There's so much room for correlation, there's just nothing about that I've found that works for me.

    I think overall, we have all the tools we need to do this, but the time needed to integrate them all, make them talk nicely, and set intelligent thresholds, relative thresholds and even a little historical predication based on previous events is just not worth it. Lately, instead of spending time on this, I'm fighting to get nfsen/observium/smokeping/homegrown scripts to talk together and give me a coherent view of my traffic patterns.

    I'm battling with the stupidity of SNMP traps, SNMP's format and the absurdity of 5 minute samples when I have 40Gbit interfaces.

    Monitoring makes me stabby.

  2. Steve Davies 3 Silver badge
    Mushroom

    The same can apply to System monitoring tools

    but the less said about SNMP the better.

    From the devices that:-

    - have to be polled for status rather then issue traps (WTF?)

    - Kit where the maker won't release the MIb because it is a 'trade secret'

    - Kit that uses a different MIB from the one supplied by the maker

    And so it goes on and on and on.

    Finally we get to a Building Management System supplier who wants a cool $100,000 just to 'develop' an SNMP Module for his system when we know that they already have it.

    ----> to the lot of them.

    1. Anonymous Coward
      Anonymous Coward

      Steve Davies 3 - Re: The same can apply to System monitoring tools

      Sorry but you still have to poll a device in SNMP. How exactly would you know the temperature inside your core switch ? Are you expecting to receive a trap every few minutes telling you it is still 35 degrees Celsius ? This means you should have a way of configuring the interval between sending traps. And how about an interface for which you did not receive any trap, is it up or down ?

      1. Steve Davies 3 Silver badge

        Re: Steve Davies 3 - The same can apply to System monitoring tools

        Serveral of the devices I work with have the ability to configure the emitter times for various status params such as CPU temp. One device will tell you say every hour that it is with +- 5C. If it goes outside that range you get a trap.

        Being frank, for a lot of devices SNMP functionality is very much an afterthought. For others it is core functionality and well documented and supported.

        One system was sending traps with a totally different MIB to the one documented. It took the manufacturer three frigging months to accept this fact and issue a fix to their system.

        Getting SNMP setup properly for all the systems on the network took almost four months.

        1. Destroy All Monsters Silver badge
          Flame

          Re: Steve Davies 3 - The same can apply to System monitoring tools

          Kit where the maker won't release the MIb because it is a 'trade secret'

          KILL IT WITH FIRE.

          YOU KNOW THAT ... the thing sucks donkey balls if anything as low-brow as a MIB is considered "Intellectual Property" in any way, shape, or form.

    2. SteveR

      Re: The same can apply to System monitoring tools

      See my comment below about CloudView NMS below. They add us features for free - serious SNMP support for all the SNMP versions and GUI screens for many standard MIBs

  3. TheWeenie

    Not sure I agree with this entirely.

    A network monitoring tool - provided it's been set up and maintained by someone that knwos what they're doing - is invaluable. Nested maps, sensible views, common-sense approach to polling and retention, and the right choice of software can all work well. Granted, that's a lot of variables though.

    The most common issues I've seen are under-investment, under-performance, and under-specification.

    Your network is mission-critical. You have tens of thousands of pounds a year spent on contracts, maintenance etc - spend a few more on some decent monitoring tools. Don't just install a single piece of software on an old bit of tin and expect it to work. Invest in high-performance equipment and the right software (there's plenty out there to choose from), and budget money for annual maintenance and some time (say, 0.25FTE) to administer and update it.

    Once you've got your baselines in place you can look at some of the more esoteric stuff out there - remote probes for end-point monitoring; inline taps; netflow; buffer-replay captures etc.

    Oh and I'm anal as hell about this stuff, so all maps are drawn by hand. That way I know what I see is actually what's there.

    It's like anything we do in IT - you get out of a system what you put in.

  4. Stuart Castle Silver badge

    Usability, We've heard of it..

    I think the problem with the GUIs for a lot of products written for sys admins or other technical people (I'm including engineers, mathematicians and other disciplines here, not only computing) is that the programmers of the software believe the functionality is all they need to worry about. The GUI is something they can just tack on.

    Even the big vendors (some of whom spend a *lot* on user interface design) fall foul of this. Anyone who has used Apple's workgroup manager, or a lot of Microsoft's system admin tools can tell you that.

    I think the problem is that a lot of people think the good user interface design is just a trendy thing talked about by designers and other "meeja" types in poncy bars in West London, so they tend to just knock up their own which includes access to all the functions, but they forget that the average user doesn't have access to the development team, so may have difficulty finding the obscure place they've just put that menu item. I've seen various Mathematics, Engineering and Computer Systems Analysis/Design packages that are like this, and have options in the most obscure places.

    To a large extent, UI design *is* just a trendy thing talked about by designers, as stated above. This is the other part of the problem.

    Some designers come up with an incredibly pretty GUI but have no understanding of what the user needs, or any real understanding of the product. We had a NAS once that would regularly stop serving files via SMB. I logged on to the web interface, and saw all sorts of pretty dials and bar graphs telling me everything from the amount of work being done by the CPU through RAM used, Storage space used, number of reads/writes to each disk, various internal temperatures right down to the fact that both PSUs were online. What it never warned me of though is that Samba had crashed. As it turned out Samba was more likely to crash if the storage was full, but I kept a network drive on Windows connected to one of the shares and noticed this.. I was not connected to the web interface.

    The problem is that GUIs need to be designed by people who have a good concept of how the average user thinks (which tends to exclude programmers - no offence intended to anyone) but who also have a good idea of how the program or system should work. I'll admit that it can be difficult finding someone who has both of the above qualities but it is doable. Probably the best way to do it is simply talk to people who use your product. Find out what they need to know and provide it to them in a way they are happy with.

    I know a lot of people don't like Apple, but I'd like to cite one of their products, Apple Remote Desktop, as an example of good user interface design. By default when you run it, you get a list showing the computers you are monitoring, their current status, their IP, who is logged on, the application that currently has focus and the version of OSX they are running. If you need more info,you double click on a computer. If you need to do something to one of the PCs, you can click one of the buttons on the toolbar, and drag the list of the computers you want that action performed on to the window that is shown.

    A product like that is never going to be easy for beginners because a lot of what it does is built around using Unix scripts (so you do need some scripting knowledge), but the UI is (IMO) quite simple and very effective.

  5. Anonymous Coward
    Anonymous Coward

    Have a look at the Solarwinds suite. What I am seeing in the new Network Performance Monitor, Server and Application Monitor and Network Traffic Analyser (Flows) makes me wish I had these functionality for the last 15 years ...

    Being able to see L3 traffic end to end is bloody brilliant, being able to see on a nice big graphical map where all the offices are and which link between offices is simples and silly things like making alert dependencies, if I have 40 devices behind a router and the router goes down, I don't want the monitoring system to fire out 41 alerts for me to wade through, I want to know the routers gone down and I have to fix that.

    Network config manager backing up all router / switch configs, how many times have I seen msp's looking after some one elses kit and never backing up those config, let alone the amount of people not backing up there own configs.

    That is a very small sub set of those apps features, there is another feature where you, as a infrastructure manager, can make feature requests on the forums and other customers, can vote on features that will go into future builds.

    Like the L3 monitoring. But yeah there are a thousand things missing, not all routing protocols are there and it does take some experience to set it up properly (Properly.. that is an ongoing process) and Yes I am a fan boi about Network monitoring / management. Network Management and Monitoring are two very different things though. Trying to use Managment tools for monitoring jobs .. will probably get you there, but it's hard work..

    (Zenoss is next on my list to do some testing on and Netforts LanGuardian is a software DPI tool that neatly fits into your network, very little setup)

    Anon because this is a sector I work in :)

    1. Jay 2
      Linux

      Zenoss

      I use Zenoss Core (the FOSS one) on a more system/application point of view. I've been using it since v1 and it's now v4.2.x. It takes a little while (and some Python) to get used to it, but it seems to be OK for my needs. I've written customised SNMP-based collector plugins to talk nicely to boxes running Dell OMSA, HP SMH or a plain shell script for VMs.

      For all things syslog, instead of spamming Zenoss direct, we route all syslog messages via some central logger servers which run syslog-ng. syslog-ng is a replacement for normal syslog and it's pretty powerful in that you can create filters to match patterns etc and also re-arrange the actual message or override certain parts of it. This is handy when an application message, for example, says ERROR, but for some strange reason it actually has a level of critical. So in our case we tidy up messages (and filter out the crap) to make sure they're vaugely useful when they hit Zenoss. We've also customised the actual dashboard alerts and emails that Zenoss produces to make it a bit more useful.

      However our US colleagues just seem to throw *everything* at Zenoss without a care, as such the front end is unusable as it has thousands of messages on it, and it just spams us all constantly as they take no care in trying to get quality alerts out of it. The result is very much "boy who cried wolf" and I've just set a rule up in Notes to ignore anything that comes from their servers to stop my mailbox hitting 100% every weekend.

    2. John Sanders
      Linux

      Solarwinds

      It is hugely expensive for what it does, and "most" of what it does can be done using open source software.

      Yes, I know it is not point and click and you need people who understand scripting proper, but it tends to produce better results in the long term.

      For most ISPs Solarwinds and the like are ineffective and as I mentioned before way too expensive.

  6. Anonymous Coward
    Anonymous Coward

    not always about the monitor, but the response to an alert

    Sometimes it's not about the monitoring systems, but what processes get triggered when when they generate an alert...

    The commercial ones are generally tied into an operations team and any alerts they generate have to indicate a real problem that has to go through the whole 'incident' rigmarole, so the thresholds can't usually be set to a level of "lets check that, just in case"

    We wrote ourselves a simple MQ Q depth monitor that beeps and tell you if any Q hits a predetermined threshold. Even a small queue buildup, on a high throughput system, can indicate a backend having issues. The threshold for ours it set so that we know there might be an issue about 5 mins before it red flags on the 'real' monitors and is considered a problem by the company. It also might be nothing, a fraction of a second blip, that we couldn't set the official systems to alert on. We know it happend and we check it out, regardless if there was a problem or not.

    When something goes wrong, users may be aware of slow running, or a few timeouts, but generally we are alerted by our monitor and have someone well on the way to fixing the issue before they start calling in.

  7. Anonymous Coward
    Anonymous Coward

    So basically what you all want is a package which makes your job something which can be done by a 10 year old? "Flashing amber screen one of your WAN links is down - Call link providers." type solutions... hmmm.... and it's not even Friday...

  8. Stretch

    Your job relies on the lack of availability of the above. If such as thing existed you would find it only in use in India, as quickly real professionals such as yourself would be burnt in favour of cheap idiots.

  9. Uncle Siggy

    Square Wheel

    Instead of recreating the square wheel that conditions admins to ignore rather than respond to a high noise to signal ratio, I suggest a different route. We use Splunk to scrape our logs and deliver diagnostic payloads to the people who need them. Our developers have caught bugs before they hit production because the queries are tested along with the code. Want to see a trend before services tip over and be notified? What about researching dependency trees? Making assumptions based on static graphs, rather than interrogating logs? More "classic" monitoring is the answer to nothing.

  10. Anonymous Coward
    Anonymous Coward

    The rootof the problem is that this is all about Operations.

    The guys that specify what they want to the vendor, the guys that control the purse strings, are rarely interested in making life for Operations staff better. This is because buiness cases pass when they are about revenue growth. Cost reduction is far harder to prove and make look attractive.

    Cost reduction business cases need to be done by the Operations team. But most operations teams are not populated with folks who are thinking of ways to make their group more efficient, and hence reduce the need for staff.

    The software exists to do all that you ask. Including analysing the collective alarms and instructing what to do about it based upon rulesets. You can even configure the monitoring software to integrate with your provisioning/maintenance interfaces and take corrective action automatically if you dare. But it all takes Professional Services to tailor it. Then more Services to migrate you onto it from what you have now. Then more services to upkeep it maintained and keep tailoring it as your network evolves.

    Unless you invest in people to train and do it in-house, but then that's even more OPEX to justify in your business case, which is the hard bit for Operations guys to do in a way attractive enough to the business to sign off.

  11. Anonymous Coward
    Anonymous Coward

    YES!

    No further comment.

  12. Brian Miller

    I'm sorry, Dave. I'm afraid I can't do that.

    Love the wish list! Especially all the packet capturing.

    Honestly, a lot of what you want is not software, but hardware. Seriously expanded hardware. "What was the traffic for the last five minutes?" On what again, on how many ports in the system? You want something that the NSA would love, and only the NSA would be able to pay for it. Routers have 256Mb to 512Mb of memory, and switches have practically none. And you want the last five minutes of traffic available for all of those ports?? Insert appropriate Cheech and Chong quote here.

    The reason that you haven't seen things like this is because companies don't devote a lot of resources to creating monitoring tools. When I worked for a "very large" firm that produced such a package, the development team wasn't very big. What you have asked for is rather close to Los Angeles asking for fiber, WiFi, and unicorns for everyone.

    Sure, what you want is technically feasible. But at what cost? "I want a fancy flying fortress for two Cracker Jacks box tops."

    "I'm sorry, Dave. I'm afraid I can't do that."

  13. AustinAggie

    The problem is you are managing the network over the network

    Several of these complaints are about better software, some hardware, but the real problem is that you are talking about managing network devices over the network they provide. The model is inherently flawed and guaranteed to give you the least (if any) information at the moments when you need it the most.

    A solution is to take the monitoring of network devices out-of-band using the console port and monitoring intelligence right in the rack with the gear your care about. With local monitoring, local evaluation against expected responses, and an out-of-band connection back to your NOC in the case of the network actually going down, you move to a management by exception role. Utilizing the console connection like you would if you were sitting at a crash cart in front of the device also means that basic actions can be automated to manage a device - basic stuff (cycle an interface, recognize and recover a router in ROMmon, etc.).

    Anyway, my company is Uplogix. Sorry for the self-promotion, but it didn't sound like you have seen the light yet when it comes to monitoring and I figured I could brighten your day.

  14. Smidget

    SNMP : Let's face it, nobody with any sense is about to try to produce an alternative

    Not sure what it says about the folks at the IETF, but they have been working on NETCONF (http://datatracker.ietf.org/wg/netconf/) for a good few years following on from the 2002 IAB Network Management Workshop (http://tools.ietf.org/pdf/rfc3535.pdf). Monitoring is covered by NETCONF Event Notifications (http://www.rfc-editor.org/rfc/pdfrfc/rfc5277.txt.pdf).

    Give it time ...

  15. Pu02

    WTF?

    Dude what you completely miss is the massive mountain in the middle of your critical path

    - Because any comprehensive monitoring system presents a massive security risk comparable only to the usefulness of the data it monitors (and sometimes manages)

    With the monitoring and control functions handled over the same wires, the nature of the traffic, let alone uncontrolled access (and storage!) of data and metadata of all classifications... or all the attack vectors around anything requiring read and control requiring privileged access... all this makes effective monitoring (let alone management) very hard to sell (or write, maintain and polish comprehensive 'single view' solutions for) when there is always some bright spark asking in detail about security or running pen-tests in the background.

    Big Corporates would rather hire an army of network monkeys than deliberately implement a system that allows an intruder to scope, access and attack their data. The net monkeys automagically give them the excuses they need to keep their positions in the Executive team. A monitoring system just tells everyone the truth and gets those responsible for screw ups fired. Who will sign off on the capex for that? Shareholders? They barely have a say in strategy..

    Anyone with investments in this space not hiring pros has their proverbial ass hanging over a pan as an approaching Tsunami sucks the contents of the pipes from underneath them...

  16. SteveR
    Holmes

    It is all about your NMS

    Well interesting article. But it is all about the tools you use. Our choice is CloudView NMS http://www.cloudviewnms.com . For example, your issue with 40 GB interfaces and their SNMP presentation is solved there... It took us long time and intensive search to find this one. We wanted to make complex things simple and CloudView was the only one which had the exact set of features we needed out-of-box. They claim it is scaleable to thousands of nodes (we have 750 so far) , can monitor/manage practically anything because it is based on standards (SNMP, sysLog, TL-1 , e-mail alerts....more). Web interface with multiple remote user profiles. They monitor "service path" across multiple devices And unlike others they do not charge per size of your network - very important feature from my point of view. The manual is not that good but their e-mail support is great and enthusiastic. They add features by our request...

This topic is closed for new posts.