virtuallypotato/content/posts/2020/09/finding-the-most-popular-ips-in-a-log-file.md at 829bd5c6d6a229a3d6fa3a00601903c3b59a8fe9

vPotato/virtuallypotato

Fork 0

mirror of https://github.com/jbowdre/virtuallypotato.git synced 2024-11-26 17:02:18 +00:00

John Bowdre 829bd5c6d6 organize content posts by year and month

2021-12-20 16:48:29 -06:00

2.8 KiB

Raw Blame History

series

date

Find IP-ish strings

This will get you all occurrences of things which look vaguely like IPv4 addresses:

grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' ACCESS_LOG.TXT

(It's not a perfect IP address regex since it would match things like 987.654.321.555 but it's close enough for my needs.)

Filter out `localhost`

The log likely include a LOT of traffic to/from 127.0.0.1 so let's toss out localhost by piping through grep -v "127.0.0.1" (-v will do an inverse match - only return results which don't match the given expression):

grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' ACCESS_LOG.TXT | grep -v "127.0.0.1"

Count up the duplicates

Now we need to know how many times each IP shows up in the log. We can do that by passing the output through uniq -c (uniq will filter for unique entries, and the -c flag will return a count of how many times each result appears):

grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' ACCESS_LOG.TXT | grep -v "127.0.0.1" | uniq -c

Sort the results

We can use sort to sort the results. -n tells it sort based on numeric rather than character values, and -r reverses the list so that the larger numbers appear at the top:

grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' ACCESS_LOG.TXT | grep -v "127.0.0.1" | uniq -c | sort -n -r

Top 5

And, finally, let's use head -n 5 to only get the first five results:

grep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' ACCESS_LOG.TXT | grep -v "127.0.0.1" | uniq -c | sort -n -r | head -n 5

Bonus round!

You know how old log files get rotated and compressed into files like logname.1.gz? I very recently learned that there are versions of the standard Linux text manipulation tools which can work directly on compressed log files, without having to first extract the files. I'd been doing things the hard way for years - no longer, now that I know about zcat, zdiff, zgrep, and zless!

So let's use a for loop to iterate through 20 of those compressed logs, and use date -r [filename] to get the timestamp for each log as we go:

for i in {1..20}; do date -r ACCESS_LOG.$i.gz; zgrep -o -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' \ACCESS_LOG.log.$i.gz | grep -v "127.0.0.1" | uniq -c | sort -n -r | head -n 5; done

Nice!

2.8 KiB Raw Blame History