Occasionally it was observed that systems would be just stuck in 'connect',
provide a backup system to detect and forcibly kick the console in such a case.
In theory, pyghmi should be doing a self-health check. It has been discovered at scale that
this self-health check may encounter issues. For now, try to workaround by having another
health check at the confluent level, deferred by console activity. It's also spaced far apart
so it should not significantly add to idle load (one check every ~5 minutes, spread out).
Previously, offline nodes would be rechecked automatically on average every 45 seconds. Extend this
to on average 180 seconds, to reduce ARP traffic significantly when there are a large volume of
undefined nodes. The 'try to connect on open' behavior is retained, so this would mean a longer loss
of connectivity only in a background monitored session.
Knowing ahead of time that confluent is the sort of app that, despite
best efforts, is filehandle heavy, auto-attempt to raise soft to
be equal to hard limit. A sufficiently large cluster (i.e. more than 2000
nodes) would still need to have limit adjusted at system level for now.
gdbm backend does not support the 'iterkeys' interface directly,
requiring instead to manually traverse. Unfortunately, dbhash
does not implement the gdbm interface for this, so we have
to have two codepaths.
Now that the problematic use of an os pipe is no more,
go ahead and patch pyghmi in a straightforward way. This
was needed for the sake of pyghmi plugins that use a webclient.
If a plugin iterates a datetime object, decode to ISO-8601 string
on the way out. This allows plugins to work directly with datetime
objects and allow the messaging layer to normalize it to ISO-8601
Messages that were not a node (e.g. confluent users) erroneously
had data put into 'databynode'. Correct the mistake by omitting
the insertion of databynode when the message is clearly not a node
related thing.
IPMI plugin was issuing redundant calls to remove the same
watcher. Track that a session has already unhooked to
avoid double unhook (which runs at least a slight risk
of unhooking the wrong handler (*if* it were allowed).
IPMI health requests are relatively expensive. It's
also pretty popular and therefore prone to be the target of
inadvertantly aggressive concurrent requests. Mitigate the harm
by detecting concurrent usage and having callers share an answer.
When receiving a USR1 signal, it did usefully provide
'the' current stack, useful for diagnosing really hard
hangs. However, it's frequently informative to see all
the thread stack traces, so add that data to the diagnostic
feature.
When a terminal closes and notifies server, it was
pulling the rug out from asyncsession consoles.
Make asyncsession aware that the console may be gone
and discard tracking it rather than give a 500.
When an error (to be fixed) happened while updating expiry,
an asyncsession failed to have a reaper scheduled for cleanup.
Correct this by putting the reaper schedule right after the
cancellation.
Further, an async being destroyed did not reap related console
sessions. Add code to reap related console sessions when
the async session gets destroyed.
When the read_recent_text ran off a cliff looking for buffer data,
it left the current textfile handle in a bad state. This caused
the buffer rebuild to fail completely in a scenario where all the
current logs put together don't have enough data to satisfy the
buffer. Fix this by making the handle more obviously broken, and
repairing while seeking out data.
Users have noted and complained that log data was lost, and didn't have old data. This changes
the default behavior to be indefinite retention. Users noting a lot of logs using space have a nice
intuitive indication of old files to delete, and the option remains for those to request a log expiration.
Before the connection would fail and log to trace without anything
particularly informative for the client (they just saw 'unexpected error'.
Provide a more informative behavior for the client.
If exiting from a shell session, the databuffer will contain needed info for the client
to work properly. Preserve databuffer existence. Responsibility for deleting the
object should be in the hands of the caller.
The rollback support and replaydid not follow more than one log back. Do the work to recurse
into older and older files, until big enough buffer or run out of files.
If a user can connect, but gets removed mid session, traces were
being generated. Correct by recognizing the circumstance and returning
the appropriate error to the client.
When initializing security key, a background thread may occur. Sometimes,
the system would go to daemonize while that thread was still running, and
the whole system could exit. Leading to incomplete write to globals as well
as leaving the daemon looking at the data copied over from pre-fork and
seeing the last state of that thread forever frozen. Make sure the background
threads are fully done prior to exiting.
It seems it is possible in some circumstance for the thread id to become stale,
perhaps due to a different threadid executing the code for some reason.
Just in case, ensure the same exact value that was added is later discarded.