As adeven’s primary “tech support”, I’d like to discourage call-center connotations and phrases like “turn it off; turn it on again” from rolling off your tongue. We’re a lean, mean startup, and our internal tools for troubleshooting are called tail
, head
and grep
.
Our servers process almost a billion mobile attributions every month, and ~400 million log lines every day. These are designed for high verbosity and readability, to be consistent in structure and to provide just the right data to simplify our troubleshooting. Unlike a system with restrictions and opinions, our log lines can explain everything with minimal overhead.
This post isn’t meant as a Stack Overflow answer, but more as an insight into the tools and applications we use on a daily basis. If you’re reading this as a sysadmin, a CTO, or an engineer as irritable as my dear engineering colleagues, I’d like to demonstrate just how easy it would be for your average technical account manager to learn in-built techniques and tools.
Reading Massive Log Files
To start with, you have a heavy text file. Even with weekly rotation, they’re far too large to search through indiscriminately. We use the standard fare tail
, head
and grep
to filter the outputs we need. It usually goes something like this:
1 2 3 4 5 6 7 8 |
|
Getting Just the Right Data
With tail
, head
and grep
we can easily pull out the right log lines, and this is usually enough for us to be able to read logs from a particular subset of traffic. Errors will be systematic, and then it doesn’t take very long to spot what’s wrong. A lot of the time, though, we might need to grab specific data from a specific subset - especially for use in other software or in other commands. So we’d like to filter out particular fields.
There are really two ways that we go about it. On the one hand, there’s awk
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
print $14
instructs awk to print the fourteenth column of the input, where the column is delimited by default by a space. Every time I use awk
I know there is plenty more it offers. To be really effective, though, we stick to the tools we know; awk
may be able to compete with Excel, but learning it properly will delay our response to an irate client. We go a little, but not much further:
1 2 3 4 5 |
|
In this case, we’ve asked for three columns, adding commas in-between. Adding --field-separator
sets the delimiter to a semicolon instead of the default space. That’s not exactly rocket science. We also occasionally play with sed
for similar outputs.
You notice that we’ve structured the lines to provide a label in front:
1 2 |
|
This is extremely useful to make sure that we pick the right value; we have a lot of similar alphanumeric hashes of various lengths. If I’m going to feed this data into another piece of software, however, I need to get rid of the label. awk
lets me ask for a substring:
1 2 |
|
substr()
takes column 14, ignores everything until the fifth character, and outputs the next 7 characters. That’s basically all we need to filter and collect data from our systems. The backend just logs an array - c’est tout.
Counting Frequency of Matches
A few weeks ago I was scratching my head at a malfunctioning ad campaign. There was a large volume of traffic, very little result, and some disagreements between different systems. Pulling different fields, twisting it back and forth revealed that a large amount of traffic shared the same reference tag - which is designed to be device-specific.
This was suspicious, especially as none of the other data - user agents, fingerprints, geolocations - followed the pattern. This only occurred on this one campaign, precluding most suggestions that it may be a bug in our system.
1
|
|
In the first command, we grep for clicks from the offending reference tag, print out a part of the second field (the hour), and count the number of unique matches with uniq --count
. This outputs an hourly breakdown:
1 2 3 4 5 6 7 8 9 |
|
It turned out that some server somewhere was faking a large amount of clicks on the campaign - taking enough care to look sufficiently like real traffic to most algorithms, but not sophisticated enough to avoid leaving a few traces in our logs. The strange burst patterns, even moreso when you started looking at intervals of seconds, made it clear that this was no coincidence. Drawing these conclusions were all within the comfortable grasp of the business side of the company.
In a similar case, we were asking ourselves whether there were clients with a faulty environment setting, as we were making some UX changes that could adversely affect them. This switch is, naturally, presented in our logs.
1
|
|
It’s not something we’ve worked on in particular before, and certainly nothing for which we’d have software. In this case, our support guys can easily pull up the frequency of faulty settings per app ID; just sort the data ahead of uniq
, and sort it again, this time not by the hour but by the frequency, the first key.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Matching Two Different Files with a Bash While loop
In a recent case, I was working with a partner and was looking to replay some data on rejected installs into another system. Not all the data we might possibly need is always logged in one place - sometimes it’s not available to a given method; in other cases we haven’t been foresightful enough. We add a few extra morsels of data every so often, and take them away as the codebase evolves.
In order to collect all the necessary data on this subset, I was looking to connect a few different log lines and match them up into a single dataset. We start by identifying lines like this:
1
|
|
From there, we’ll pull out the adjust reference tag (here ckxyuan
) and place all of those in a list:
1
|
|
We’ll output a second file with all the install and callback data from that period, in install_callback_data.log
. These are a mix of log lines such as these:
1 2 |
|
You’ll note that the reftag is also mentioned in all of these lines. We need to output all the lines in install_callback_data.log
that contain a reference tag from refs_rejected.log
. Fortunately grep
is smart enough to do all of it on its own:
1
|
|
The -f
switch instructs grep
to take patterns (one per line) from refs_rejected.log
, and try to match each of them against the lines of install_callback_data.log
. Also, all output gets redirected to subset_install_callback_data.log
. Et voilà.
Funnily enough, this is a task where I’ve found that most of the other software that I have simply can’t do.
Exporting matches into CSV files
Whenever I mention the time I spend in Excel, Salesforce and various other bloated enterprise applications, a chorus of sympathetic groans erupt from my colleagues in the tech team. It’s an unfortunate reality of the present-day business, no matter how much you like your open-sourced cloud-based mobile social document manager.
The weapon of choice for making the bloat play nice is the CSV file. One way is by using awk
:
1
|
|
Recently I’ve gotten hold of enough Ruby - with the quick assistance of my dear Rails team - to build a little .rb that runs right in the production console, and prints me a list with all our invoices. We’re also bringing our newest tech support guy up to speed on SQL in the last few weeks.
Conclusion
That’s a general introduction to the commands and methods we use to work with the server logs without getting too mired in the depths of bash scripts. The cases described are derived from our day-to-day analysis and which I hope give an insight into how these methods can be used outside of e.g. system administration, and how to solve problems more effectively than Excel, log readers or in-house software.
We also work extensively in SQL, but the database is designed for a different purpose. Our database is designed for the explicit production of the data we provide to our clients. Where the database requires migrations and consistency, the logs can be modified with a single deploy - and when we’re done, we just throw them in the trash.
For the enterprising business school graduate, there’s much more you can do when working with the CSVs and SQL outputs you might come across. Most of the time, it’s combined with Excel and text editors like Sublime - but I have found the ease, efficiency and utility of basic terminal commands to be a indispensible addition to my tool box.