A46 Tech Blog: I Can Linux and So Can You (bash commands) pt.3

Previously we worked on viewing and editing files using cat, vi, and sed. Now let's talk about searching and reporting. While it would be simpler and easier to learn Python or Perl for these purposes, we are instead going to discuss grep, awk, and some basic regex. Despite the awkwardness of awk and the accessibility of regex for programming languages like Perl and Python, awk is still utilized by many. Other variations also exist such as gawk (gnu awk) and nawk (new awk), but awk works for both and generally is shipped with the box. So I will stick to that for now. First through, I think a discussion of regex is in order. So buckle up, because it's gonna get bumpy.

Sidebar, you can test these out with grep real quick. Simply do this:

$ echo 'The quick brown fox jumped over the lazy dog.' > test.txt

Then run the regex without the outside slashes like so:

$ grep --color -E 'regex here' test.txt
or
$ egrep --color 'regex here' test.txt

Regular expressions use lots of special syntax for searching and grouping complex text. Often times people will be familiar with some of it due to the commonality of certain methods. I'm sure most are familiar with "wildcards" (*). There are also some different types of regular expressions, but I will try to avoid specific things and keep this as generic as possible. So let's start with a sentence to search.

The quick brown fox jumped over the lazy dog.

Every letter in the alphabet there. Now let's assume we are trying to find this line. Let's work to match as much of this line as possible and cover as much as we can. To encapsulate all of it, I will keep it between two forward slashes because that's commonly how you will come across it. So let's talk about the start of the sentence. That can use a special character, in this case it's ^. There are many special characters ([\^$.|?*+(){}) which to match them litterall requires putting a backslash (\) in front of them. Next we want to match a capital letter. The thing we want is the nested character bracket ([) for this. This allows us to search a grouping of characters in a single position, for example vowels would be [aeiou], capitals would be [A-Z], numbers would be [0-9] or \d, lower case would be [a-z], alphanumeric would be [A-Za-z0-9] or \w. Okay, so let's make magic.

/^[A-Z]/

Okay, that will only match a string that begins with a capital letter. Now let's narrow it a little more. Next we want to match 2 non-space characters and a space. To match a non-space character we use \S and to match spaces (space and tab) is \s or a literal space is a space. Or we can do \w for a word character, or \l for a lower case. So all of these work.

/^[A-Z]\S\S\s/
/^[A-Z]\w\w /
Confused yet? Yeah, it's complicated with many ways to do the same thing, but they are slightly different. Okay, so now let's talk about the next word. It's five lowercase letters. We can do this pretty easy, there's a way to look for a pattern of n length or between n and m characters long. We do something like [a-z]{5} for five characters or something like say... [a-z]{2,5} would be between 2 and 5 characters long. So let's do the 5.

/^[A-Z]\S\w\s[a-z]{5} /

Okay, so now we have the word brown. Let's assume we don't know the length of the word we are matching, just that it's made of word characters. We know it's at least one character long. The plus (+) character comes into play here. A search plus a + will match 1 or more instances of it. So \w+ matches brown, \d+ matches 8675309 or 1.

/^[A-Z]\S\w\s[a-z]{5} \w+/

Can we get more complicated? Yes we can. We know the next word is fox and a space. Let's match literally anything after this of 0 or more characters. For that we use the period (.) which matches one of anything and asterisk (*) which matches 0 or more of the pattern. If you wanted to do one or more, you would do .+ to achieve that. We'll add a space after, so it would either match two spaces or a word followed by a space. The catch is that it will match from the end of brown to the last instance of a space.

/^[A-Z]\S\w\s[a-z]{5} \w+ .* /

Okay, so two things left, the word dog and a period. Let's say we want to match either dog, or cat or neither? Well, we can do one or both. To do or we use the pipe character (|), and to keep it clean we will use a group, which goes in parentheses (()). To match 1 or 0 instances of something, the pattern gets followed by a question mark (?).

/^[A-Z]\S\w\s[a-z]{5} \w+ .* (dog|cat)?/

Okay, the light at the end of this tunnel is near. Or is that a train? No matter, into the breach! The sentence ends in a period. We need to match a period. Two things. First, since the period is a special character we need to escape it with a backslash (\). Second, the end is marked with a dollar sign ($), like in vi and vim.

/^[A-Z]\S\w\s[a-z]{5} \w+ .* (dog|cat)?\.$/

Now look at that. What a mess that is. Obviously we won't often need super complicated stuff like that. There's still a lot more, but this should help get you started. For a better reference you can check out this regex quick reference, and it also goes over the difference in types of regex. If you want a good amount of flexibility, Perl compatible regex is usually the way to go.

So now let's talk commands. Searching and reporting are two things computers should excel at. Most Linux distributions today come with egrep and gawk and there are also counterparts like grep and awk, which they are based on. You can use the -E option with grep or egrep. I suggest using the color option if it's not already aliased on your distribution so you can see what part matches. I'll cover aliases in another portion. The awk and gawk commands are part of a type of programming language made for searching and reporting. Usually you can turn to Perl or another programming language, but awk works for a quick and dirty one-liner.

So let's start with searching. The grep and egrep commands can be used to search files or output of other commands. When you need to narrow your output to a readable level, this will be the go-to. I often times just use grep because I'm just searching for a word and not some expressions, but we'll check out both. I think a good populated folder to use for demonstration will be dev, there are consistencies in there. So let's take a look.

To get an idea of how many files are in dev, take a moment to just look.

$ ls /dev

I'm not going to post mine because that's a long list. Now let's say we are trying to find if a partition exists on our hard drive. Well, it's listed in here. We know our drive is sda, so how do we list all the sda drives? Well, we pipe the output from ls through grep to do a search.

$ ls /dev | grep 'sda'
sda
sda1
sda2
sda3
sda4
sda5
sda6

Cool, so here we see I have 6 partitions (The first three are Windows related, 4 is extended, 5 is kali, and 6 is swap). So let's assume we want to just make a cut-and-paste command to search for hard drives in other older systems. Some use hda instead of sda. Let's also assume we want to check for multiple hard drives, so there could be an sdb or an hdb. Easy.

$ ls /dev | grep -E '^(h|s)d[a-z]'

If we don't include the ^ at the beginning I get watchdog in the result for the hdo part. That's really all there is to it for simple searches of output. We can also search through the files of entire directories with the -r option.

Beyond search filters, it's often necessary to report the findings. Often times that reporting or searching will be useful to dump into an actual readable format. For now, let's use a custom file we will call searchtest.txt, here's what I put in it.

id:title:author
1:The Origins of Modern Science:Herbert Butterfield
2:Catch 22:Joseph Heller
3:1984:George Orwell
4:Animal Farm:George Orwell

I made a column title as well. So let's run a quick search.

$ grep -i 'george' searchtest.txt
3:1984:George Orwell
4:Animal Farm:George Orwell

The -i option tells it to ignore the case. Now this is all fine and dandy, but still a bit difficult to read. So let's clean up the reporting phase using awk. We'll discard the id and just print the title and author.

$ awk -v FS=: '{print $2,$3}' searchtest.txt
title author
The Origins of Modern Science Herbert Butterfield
Catch 22 Joseph Heller
1984 George Orwell
Animal Farm George Orwell

Okay, so we have a lot going on here. We are using -v to set some variables, in this case the Field Separator (FS) from the default of whitespace to colon (:). This allows us to access fields by a dollar sign ($) followed by the place number. In the case of this file, we have $1, $2, and $3. We then have an actual code block where we print fields $2 and $3 separated by the Output Field separator (OFS), which defaults to a space. Okay, this is still messy. We could alter the OFS to make it clearer. There is another method to format your print statements a little better, printf.

$ awk -v FS=: '{printf "%-29s|%s\n", $2, $3}' searchtest.txt
title                        |author
The Origins of Modern Science|Herbert Butterfield
Catch 22                     |Joseph Heller
1984                         |George Orwell
Animal Farm                  |George Orwell

Okay, so now we can actually read it. Here's how the printf is working. The %s is a string we're going to substitute in, in order of the arguments passed to printf. The %-29s is to make sure the string is padded to a length of 29 characters long and the - makes it align to the left, default without the - is to the right. Now let's say we want to not include that first line. We can add a filter for that.

$ awk -v FS=: '/^[0-9]/ {printf "%-29s|%s\n", $2, $3}' searchtest.txt
The Origins of Modern Science|Herbert Butterfield
Catch 22                     |Joseph Heller
1984                         |George Orwell
Animal Farm                  |George Orwell

So there we can see that it accepts a regex filter. You can use this to filter through, select, and print out reports of any text files you have. Learning more about printf will also all you to do a lot of formatting on the reporting as well. The awk and gawk commands happen to be programming languages in themselves as well, but going into all of that detail right now would be lengthy. So for now, let's discuss a bit more for the ins and outs of printf.

The printf function exists in many programming languages and even as a Bash command. You may be wondering what %s means. Well, the % indicates a control and the s indicates a string. In this case, it's printing a string. If you wanted to simply print a percent, you'd have to type %%. You also have %c, which prints a single character ascii decimal value, %d and %i which print numbers, %e and %E print a number in scientific notation, %f and %F for floating point numbers, %g and %G which print in either scientific notation or floating point (whichever takes fewer characters), %o prints numbers in octal, %u prints unsigned integers, and %x and %X print in hexadecimal where %X prints in uppercase and %x uses lower case.

Formatting modifiers can be added to it as well, like the - justifies to the left, default is to the right. A + tells it to print positive or negative signs. A number indicates the space it should use at minimum and a decimal, like 5.2 would indicate a 5 character width with a floating point precision of 2. A leading zero will pad a number with zeros instead of spaces. A # tells it to use an alternate form for certain numbers, like hexadecimal 0x prefix. A ' will print numbers in the thousands to use a comma separator. As you can see, most of the formatting revolves around numbers, but the main thing is setting a width lets you put your output in columns, which can make it easier to read.

So for example, if you did a printf with %07.2 to 12.3, you'd get

0012.30

It's seven characters wide, padded at the beginning with zeros and held to a precision of two.

For now that's enough, next I will cover applying the regex to searching and substituting with vi and sed.

Monday, April 8, 2019

I Can Linux and So Can You (bash commands) pt.3

No comments:

Post a Comment