Have You Tried Grepping It? Minimalistic Tech Support With Bash

As adeven’s primary “tech support”, I’d like to discourage call-center connotations and phrases like “turn it off; turn it on again” from rolling off your tongue. We’re a lean, mean startup, and our internal tools for troubleshooting are called tail, head and grep.

Our servers process almost a billion mobile attributions every month, and ~400 million log lines every day. These are designed for high verbosity and readability, to be consistent in structure and to provide just the right data to simplify our troubleshooting. Unlike a system with restrictions and opinions, our log lines can explain everything with minimal overhead.

This post isn’t meant as a Stack Overflow answer, but more as an insight into the tools and applications we use on a daily basis. If you’re reading this as a sysadmin, a CTO, or an engineer as irritable as my dear engineering colleagues, I’d like to demonstrate just how easy it would be for your average technical account manager to learn in-built techniques and tools.

Reading Massive Log Files

To start with, you have a heavy text file. Even with weekly rotation, they’re far too large to search through indiscriminately. We use the standard fare tail, head and grep to filter the outputs we need. It usually goes something like this:

1
2
3
4
5
6
7
8
$ tail -100000000 log/adjust.log | grep 'event tracked'
// tail outputs the last 100 million lines and grep filters for the log lines describing events

$ tail -100000000 log/adjust.log | head -1
// find the first line of a 100-million segment, to verify when the segment starts

$ tail -100000000 log/adjust.log | grep 'click tracked' | grep 'f0ob4r'
// find the clicks from a particular campaign ID

Getting Just the Right Data

With tail, head and grep we can easily pull out the right log lines, and this is usually enough for us to be able to read logs from a particular subset of traffic. Errors will be systematic, and then it doesn’t take very long to spot what’s wrong. A lot of the time, though, we might need to grab specific data from a specific subset - especially for use in other software or in other commands. So we’d like to filter out particular fields.

There are really two ways that we go about it. On the one hand, there’s awk:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ tail -100000000 log/adjust.log | grep 'install tracked' | awk '{print $14}' 
ca:2014-01-06T17:51:03
ca:2014-01-11T14:07:43
ca:2014-01-11T14:07:29
ca:2014-01-11T14:07:42
ca:2014-01-11T14:07:44
ca:2014-01-11T14:07:42
ca:2014-01-11T14:07:43
ca:2014-01-11T14:07:44
ca:2014-01-11T14:07:31
ca:2014-01-11T14:07:43
ca:2014-01-11T14:07:36
ca:2014-01-11T14:07:43
ca:2014-01-11T14:07:44

print $14 instructs awk to print the fourteenth column of the input, where the column is delimited by default by a space. Every time I use awk I know there is plenty more it offers. To be really effective, though, we stick to the tools we know; awk may be able to compete with Excel, but learning it properly will delay our response to an irate client. We go a little, but not much further:

1
2
3
4
5
$ awk --field-separator ; '{print $14 ", " $15 ", " $22}' log/subset.log > log/subset.csv
10:00:48.810783, 128, nonce:9qslji1he
10:00:48.845548, 124, nonce:29t8orj6m
10:00:48.859746, 130, nonce:2cjt84ru1
10:00:48.860098, 127, nonce:017ejbrrq

In this case, we’ve asked for three columns, adding commas in-between. Adding --field-separator sets the delimiter to a semicolon instead of the default space. That’s not exactly rocket science. We also occasionally play with sed for similar outputs.

You notice that we’ve structured the lines to provide a label in front:

1
2
ref:abc1234
nonce:2cjt84ru1

This is extremely useful to make sure that we pick the right value; we have a lot of similar alphanumeric hashes of various lengths. If I’m going to feed this data into another piece of software, however, I need to get rid of the label. awk lets me ask for a substring:

1
2
$ grep 'session tracked.*2cjt84ru1' | awk '{print substr($14, 5, 7)}'
abc1234

substr() takes column 14, ignores everything until the fifth character, and outputs the next 7 characters. That’s basically all we need to filter and collect data from our systems. The backend just logs an array - c’est tout.

Counting Frequency of Matches

A few weeks ago I was scratching my head at a malfunctioning ad campaign. There was a large volume of traffic, very little result, and some disagreements between different systems. Pulling different fields, twisting it back and forth revealed that a large amount of traffic shared the same reference tag - which is designed to be device-specific.

This was suspicious, especially as none of the other data - user agents, fingerprints, geolocations - followed the pattern. This only occurred on this one campaign, precluding most suggestions that it may be a bug in our system.

1
$ grep 'click.*cFz43Dr' ad_campaign_23_07.log | awk '{print substr($2, 0, 2)}' | uniq --count

In the first command, we grep for clicks from the offending reference tag, print out a part of the second field (the hour), and count the number of unique matches with uniq --count. This outputs an hourly breakdown:

1
2
3
4
5
6
7
8
9
 7796 10
10776 11
11523 12
  700 13
    0 14
  359 15
 9001 16
15234 17
  664 18

It turned out that some server somewhere was faking a large amount of clicks on the campaign - taking enough care to look sufficiently like real traffic to most algorithms, but not sophisticated enough to avoid leaving a few traces in our logs. The strange burst patterns, even moreso when you started looking at intervals of seconds, made it clear that this was no coincidence. Drawing these conclusions were all within the comfortable grasp of the business side of the company.

In a similar case, we were asking ourselves whether there were clients with a faulty environment setting, as we were making some UX changes that could adversely affect them. This switch is, naturally, presented in our logs.

1
$ tail -100000000 log/adjust.log | grep 'sandbox' | grep --only-matching --extended 'app:[a-z0-9]{11}' | sort | uniq -c | sort --numeric-sort

It’s not something we’ve worked on in particular before, and certainly nothing for which we’d have software. In this case, our support guys can easily pull up the frequency of faulty settings per app ID; just sort the data ahead of uniq, and sort it again, this time not by the hour but by the frequency, the first key.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
109 app:29nnx242ndm
133 app:cabwra82104
136 app:bxjbk83ndkj
143 app:klxinwur93d
159 app:cnus924byjh
168 app:una8na28r83
198 app:nvsu38new99
240 app:oe2u92dfher
273 app:nj4wunru393
279 app:94u3jngu373
284 app:nvjawn83h4k
732 app:8ndjel184nf
1132 app:n4nf35shuwe
1336 app:vncjs323meo
3057 app:iwre349v42i
11206 app:ffdsakl3134

Matching Two Different Files with a Bash While loop

In a recent case, I was working with a partner and was looking to replay some data on rejected installs into another system. Not all the data we might possibly need is always logged in one place - sometimes it’s not available to a given method; in other cases we haven’t been foresightful enough. We add a few extra morsels of data every so often, and take them away as the codebase evolves.

In order to collect all the necessary data on this subset, I was looking to connect a few different log lines and match them up into a single dataset. We start by identifying lines like this:

1
2014/01/12 07:18:09.126266 install rejected fingerprint with device tags [C app:nf3djs231a tracker:vdnska ref:ckxyuan hd:[HD co:ch sdk:android1.x an:Mozilla av:5.0 re:de dt:phone on:android ov:7.0.4 la:de sl:1]]

From there, we’ll pull out the adjust reference tag (here ckxyuan) and place all of those in a list:

1
$ grep 'install rejected' | grep -oE 'ref:[0-9a-z]{7}' > refs_rejected.log

We’ll output a second file with all the install and callback data from that period, in install_callback_data.log. These are a mix of log lines such as these:

1
2
2014/01/11 10:02:09.180040 install tracked [IC app:nf3djs231a tracker:vdnska ref:ckxyuan adid:dbaa10aa60becf8af2ff56f537d250ca sdk:android2.1.2 co:sa fc:sa fdt:phone ca:2014-01-11T10:01:48 fst:2014-01-11T10:01:48 zo:180 sc:1 hd:[HD co:sa sdk:android2.1.2 av:1.1.0 re:ae dn:SM-N9005 dt:phone on:android ov:18 la:ar ds:normal df:long dd:high dw:1080 dh:1920 sl:9]] [D tracker:vdnska ref:ckxyuan fc:sa fdt:phone fst:2014-01-11T10:01:48 lst:2014-01-11T10:01:48] [2672us]
2014/01/11 10:02:09.180212 session sent callback to callback worker [CB url:big-elephants.com data:[CD app:nf3djs231a4 tracker:vdnska ref:ckxyuan adid:dbaa10aa60becf8af2ff56f537d250ca sdk:android2.1.2 on:android ov:1389434508 co:sa nonce:dqnttho4u rand:146983781]]

You’ll note that the reftag is also mentioned in all of these lines. We need to output all the lines in install_callback_data.log that contain a reference tag from refs_rejected.log. Fortunately grep is smart enough to do all of it on its own:

1
grep -f refs_rejected.log install_callback_data.log > subset_install_callback_data.log

The -f switch instructs grep to take patterns (one per line) from refs_rejected.log, and try to match each of them against the lines of install_callback_data.log. Also, all output gets redirected to subset_install_callback_data.log. Et voilà.

Funnily enough, this is a task where I’ve found that most of the other software that I have simply can’t do.

Exporting matches into CSV files

Whenever I mention the time I spend in Excel, Salesforce and various other bloated enterprise applications, a chorus of sympathetic groans erupt from my colleagues in the tech team. It’s an unfortunate reality of the present-day business, no matter how much you like your open-sourced cloud-based mobile social document manager.

The weapon of choice for making the bloat play nice is the CSV file. One way is by using awk:

1
grep 'install tracked' log/installs.log | awk '{print $1, $2 "," $4 "," substr($5, 3, 5}' > installs.csv

Recently I’ve gotten hold of enough Ruby - with the quick assistance of my dear Rails team - to build a little .rb that runs right in the production console, and prints me a list with all our invoices. We’re also bringing our newest tech support guy up to speed on SQL in the last few weeks.

Conclusion

That’s a general introduction to the commands and methods we use to work with the server logs without getting too mired in the depths of bash scripts. The cases described are derived from our day-to-day analysis and which I hope give an insight into how these methods can be used outside of e.g. system administration, and how to solve problems more effectively than Excel, log readers or in-house software.

We also work extensively in SQL, but the database is designed for a different purpose. Our database is designed for the explicit production of the data we provide to our clients. Where the database requires migrations and consistency, the logs can be modified with a single deploy - and when we’re done, we just throw them in the trash.

For the enterprising business school graduate, there’s much more you can do when working with the CSVs and SQL outputs you might come across. Most of the time, it’s combined with Excel and text editors like Sublime - but I have found the ease, efficiency and utility of basic terminal commands to be a indispensible addition to my tool box.

Comments