The secret of system administration is listening.
Performance of the underlying hardware and
operating system
can either impose significant penalties
or, on the other hand, offer significant reliability
and performance boosts,
depending on how it is handled.
However, we should not be enticed by technology
into ignoring the human factor,
which in fact will normally have a more decisive effect
on performance.
In some large shops, the system administration (SA)
and database administration (DBA) roles are kept separate.
But tighter integration of these two roles
can offer significant benefits in terms of reliability and
performance.
For instance,
the restore options provided with many
relational database
products force restores to be done for the entire instance
or not at all.
Adding in backup procedures at the operating system level
can offer significant improvements in flexibility,
and also provide insurance against bugs and other failures
in the database vendor's own products.
The rule that communication is key applies between
administrators and operators as well.
At one point John Ashmead was in charge of about twenty-five
VAXes at a remote site.
As was the custom at the company,
he attempted to manage all administration
electronically and by phone,
only going physically over to the site
when there was a serious problem.
As an experiment,
he tried swinging by the site on the way into work
in the morning
and chatting with the operators,
about problems on the machines of course
but also about nothing in particular.
|
Over time, he noticed he was actually having to spend
significantly less time trouble-shooting at the site.
In the morning discussions, hiccups and trends "too small"
to merit a formal report would come up.
Often these led to pre-emptive strikes against problems.
And the continual interaction
ensured that the operators and he were "on the same page."
They knew exactly what was required under various circumstances
and would get started even while the report was being called in.
Uptimes went up; trouble-shooting times went down.
Obviously the most significant of all.
Several examples:
-
Three scheduled downtimes are less disruptive than one unscheduled.
If they know it is coming,
the users will find other ways to spend their time.
But if they don't,
they will lose not only the time associated with the interruption,
but also the time spent refocusing once the machine is back up,
and the time spent wondering about what they might have forgotten
in the shock of the crash.
One implication is that problematic hardware should be scheduled
for fixes at the first convenient opportunity
rather than waiting to be sure it really is bad.
-
Technical trickery is not a substitute for talking with people.
For instance,
at one point Ashmead found one of his VAXes was
running out of terminal lines on a machine.
The traditional recourse was to time out those who had not touched their
line in X minutes.
As an experiment,
he tried talking with the
people using the machine and explaining the problem.
It turned out that management itself was the problem: the managers
for the workgroup had not realized that the lines were
a precious resource
and they were keeping their own people from getting in.
Once this was clear,
they stayed off voluntarily: no more need to drop lines via software.
-
At any client site it will normally happen that one or two people,
even though they are not members of IT,
will be particularly techno-savvy.
Extra time with and training for these people will head many problems
off.
|