even partial lessons are lessons
Last week I was at corporate HQ, where the rest of my group is, for a few days. Everything about the trip in on Monday was a model of efficiency -- the plane got in early, getting off the plane was faster than usual, Uber came right away, traffic was light -- so I got to the office about half an hour earlier than any of us expected me to.
Given that, I was a little surprised to be greeted with "oh thank heavens you're here!".
The previous weekend there'd been a catastrophic power failure and many of our servers came tumbling down. (I didn't hear the gory details. We have what I understand to be the usual precautions, and yet...) The small team responsible for that infrastructure was understandably frazzled. My teammates were happy to see me because the (internal) documentation servers are not managed by that team but by us. But their main custodian, G, was on vacation, and another person who knows relevant stuff, J, was on vacation, and that left me. I know some of the systems well but not others -- which put me ahead of anybody not on vacation. Okay.
Our doc infrastructure team has two newer members, an experienced writer who joined the company last fall and a recent grad who joined the company last month and the infrastructure team a couple weeks ago. The former has been focusing on git as my backup, and the latter is solidly in learning mode.
So first we did the usual dance of "this is not the right dock for my laptop / these are not the right monitor cables / why TF can't Windows see both of these monitors? / network, we have network right?". Once I could actually use my laptop, I settled down to investigate -- with the two newer team members watching everything I did and taking notes. It was kind of like pair programming, I think.
I think one of the most important technical skills one should have is debugging or diagnostic skills, so this is what I set out to teach my coworkers -- not explicitly, but by narrating the whys of what I was doing, I realized that this is what I was doing. There was plenty of backtracking, but they learned why I did the things I did even if they didn't turn out to be the right things. Like when I used ssh to connect to the server, got wonky display stuff, and realized I was talking to a Windows machine -- oops! And, err, our Windows server has sshd running on it? Today I learned. (Switched to remote desktop after that.)
The web server isn't responding -- well, is it running? The process list shows httpd; ok, where on this machine is the web server running? On Linux you can easily get the path for a process; on Windows I saw no way to do it, so off to Google and the right search terms, which took me to an answer on Stack Overflow (naturally), so that got me to the right directory and thus the server logs. At one point somebody said I must know a lot about web servers, but actually I don't -- not modern ones, anyway. But I know how to look for stuff, including response codes in the server logs. (Which told us that the server thought it was serving content just fine, even though the browser was getting errors -- even a local browser.)
There was a lot of this sort of digging. The web server was particularly mysterious because, it turned out, it was serving some content just fine but not most of it, and a chunk of our investigation revolved around unsuccessfully trying to find differences among those cases. We noted and otherwise ignored, for now, that builds were running slowly -- running is better than not running and, well, priorities. Eventually we split up and my teammates did some exploration and experiments on their own, coming to me with questions when needed. They had good instincts, yay.
We were not able to solve the problem with the web server that day. We were able to characterize some of it, but we bumped into the wall of specific missing knowledge. I wrote up what we knew and where we were blocked for the infrastructure list, and we decided that we could live with internal builds being down for a few days, we were not going to bother G on vacation, and we've identified some areas where we need to improve our internal documentation. (We do have internal documentation and we were consulting it. But there are some gaps, we learned. That happens.)
We had a team outing planned for Wednesday that G was going to be able to join us for, and everybody agreed not to say anything about this to him because we didn't want to ruin his vacation. But Tuesday night he checked email to confirm plans for the outing, saw the email thread on the infrastructure list, and fixed it.
He'll be back tomorrow and then I can ask him WTF is nginx. (I mean ok, I googled, but I have no real idea how it fits into anything on this server.)
no subject