Saturday, June 29, 2013

My AWK Code for Parsing Physic Exercise

Bismillahirrahmanirrahim.

I have a project for translating an exercise in Physic into known variables. I use awk only for do that. This is the exercise example in Indonesian:

Dari salah satu bagian gedung yang tingginya 20 m, dua buah batu dijatuhkan secara berurutan. Massa kedua batu masing-masing 1/2 kg dan 5 kg. Bila percepatan gravitasi bumi di tempat itu g = 10 m/s2, tentukan waktu jatuh untuk kedua batu itu (Abaikan gesekan udara). 

I should translate it into known variables (diketahui -Indonesian) like this:

Tinggi = 20 m
Massa 1 = 1/2 kg
Massa 2 = 5 kg
Gravitasi = 10 m/s2

Actually, I have tried awk and stuck on this code for getting numbers.

{
  for(i=1; i<=NF; i++){   
    if($i ~ /^[[:digit:]]+/) 
      print $i     
  }
}

And this second code for getting units (like m, kg, m/s2).

{  
  for(i=1; i<=NF; i++){      
  if(($i ~ /^m\/s2/) || ($i ~ /^kg$/) || ($i ~ /^m$/))    
      print $i       
  }
}

And I have tried to join those two codes into one.

BEGIN { FS = "[, ]+" }      

#getting units

{  
  for(i=1; i<=NF; i++){      
  if(($i ~ /^m\/s2/) || ($i ~ /^kg$/) || ($i ~ /^m$/))    
      print $i         
  }
}

#getting numbers

{
  for(i=1; i<=NF; i++){    
    if($i ~ /^[[:digit:]]+/) 
      print $i      
  }
}

Result
master@master:~/Dokumen/Pelajaran/Semester 4/Pak Anom$ awk -f plasma.awk soal1 
m
20
kg
m/s2
1/2
5
10
master@master:~/Dokumen/Pelajaran/Semester 4/Pak Anom$ 

But all fail. What makes me fail? Because I don't understand awk syntax and logic. After I asked Stackoverflow (you can see my question at http://stackoverflow.com/questions/17312343/parsing-physic-exercise-in-awk), two or five minutes later I get the answer. So quick. The best code was this:

{
    for(i=1;i<=NF;i++) {
        gsub(/[,.]/,"",$(i+1))
        if($i~/^[[:digit:]]/ && $(i+1)=="m") {
            print "Height = "$i,$(i+1)
        }
        else if($i~/^[[:digit:]]/ && $(i+1)=="kg") {
            print "Mass "++x" = "$i,$(i+1)
        }
        else if($i~/^[[:digit:]]/ && $(i+1)=="m/s2") {
            print "Gravity = "$i,$(i+1)
        }
    }
}

Result
Height = 20 m
Mass 1 = 1/2 kg
Mass 2 = 5 kg
Gravity = 10 m/s2

Short Analysis 
My first code works in my thought baseline. I should scan all field (read: column) and save every founded pattern into variable then print the variable content. But the problem, main problem is I don't understand how to use variable in awk.


for(i=1; i<=NF; i++)


This code, awk for() looping, is same with C for() looping. The main difference is NF variable. This is built-in variable in awk used for Number of Field (read: number of column). So, with this for() looping I scan my whole exercise. I use i variable for counting field by field.


if($i ~ /^[[:digit:]]+/)


This code, if() statement, used for searching pattern. Basically, this if() statement does saving any matching pattern with regex I specified, into $i variable. Remember, $i variable. It is different with just i. My regex for this is /^[[:digit:]]+/ that means:

  • ^ = must at first place, avoid the pattern match in after or in middle. Must in the front. So, every pattern should be at first appear in the word. Example: /^anu/ is match with anu1 and anu3, not match with banu, 8anu, or every pattern not placing anu at first. That is ^ (carat). 
  • [[:digit:]] = POSIX style regex for every numbers. It limits pattern for only number, no alphabet or strange character can enter. 
  • + = one or many. It causes [[:digit:]] regex can used for 20. Without +, it only can match single number like 1, 3, 5, and so on. And + causes [[:digit:]] regex never match empty character, it must mach at least 1 character. So, with this +, I can scan number at least 0 until infinite. 
My second code was same with the first. The only one difference was

if(($i ~ /^m\/s2/) || ($i ~ /^kg$/) || ($i ~ /^m$/))  

What is the meaning of that code? Actually simple. We can divide that into 3 parts. First:


($i ~ /^m\/s2/)


Second:


($i ~ /^kg$/)


Third:


($i ~ /^m$/)


and those 3 parts connected by || operator. Those code mean: m/s2 OR kg OR m is match and save it on $i variable. Why we must write m\/s2 and not directly m/s2? Because awk differ pattern and internal awk code. Actually / (slash) is internal awk code, used for limiting regex. Any code snipped between two slashes (//), is pattern. So, when we want to search slash (/), we should inform awk that this slash is not internal awk code. How? Just use awk escape character named backslash (\) persistent in front of slash (/).

And what is $ (dollar) sign? It is opponent of ^ sign. If ^ is in front, then $ is in rear. If there is /anu$/ regex pattern, then it is match with lanu, 8anu, banu, any word contains anu pattern in rear. This /anu$/ is not match with anu9, anuyu, anum, any word contains character after anu.

My third code, the joined code, contains my first and second code. But there is difference.


BEGIN { FS = "[, ]+" }


What is that? This code section contains FS variable which is built-in awk variable for Field Separator customization. I put comma (,) and space ( ) as field separator. So, if there is m, or kg, or m/s, on my exercise, awk will never consider comma. In other word, I specify field separator so awk just take any unit but not the comma. So, later I can get output kg only not kg, (notice the comma).


No comments:

Post a Comment