Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
_AnonymousUser
Specialist III

regex parsing of apache access log

Hey Guys,
Anyone know if this can be done. I use a regex in my perl script monster ETL process (attached at bottom) that goes through my access logs and pulls out the GET variables in every HTTP request. I do this in such a manner that a hash is created with the key being the GET variable name, and the value being the GET variable value. This is nice because it allows me to deal with GET requests that don't all have the exact same number of parameters - some just evaluate to basically NULL.
Can I do this with Talend? I read up on "Setting up a File Regex schema" on p. 65 of the user guide but I am still not sure.
Thanks.
-----
if($_ =~ m/\,(*),(\/slacker\.jpg|\/slacker\.gif),(*)/){
$arg=$10;
while($arg =~ m/(\w+)=(*)/g){
$args{$1}=$2;
}
}
Example input:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
Labels (2)
7 Replies
_AnonymousUser
Specialist III
Author

Sorry, I just noticed my input lines were off (because I stripped out some fields for anonymity) - but the basic idea is the same! 0683p000009MACn.png
Anonymous
Not applicable

Is that really Apache access log?
I've implemented a tApacheLogInput for TOS 2.4 (available for TOS 2.3 in the ecosystem) that deals with "standard" Apache access log lines.
You gave an example of the input lines, can you also give the corresponding expected output lines?
_AnonymousUser
Specialist III
Author

It's actually a ligHTTPD log that I modified with the conf file to only log the data fields I need.
The key point is that I need to break up the variables in the GET request into a CSV file that I then ETL through my own custom process (which I want to replace with Talend).
Here are the lines:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
Here would be the output:
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,6374451b48368cf558,,400
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,f5c99a6c552c032123,anything,400
Notice: Line 1 only has 2 GET variables (var1 and var2), while Line 2 has 3 GET variables (var1, var2, var3). In the output, even though line 1 has only 2 variables a placeholder is inserted for var3. Another thing to keep in mind is that I simplified the naming of these GET variables for the example, but my users can name then anything (i.e. var1 OR variable1 OR myvariable1).
That's why I need the regexp - to grab all variables and drop to a hash array, where I only perform operations on the variables I expect, and toss the rest of the junk.
Thanks!!
Anonymous
Not applicable

In the tFileInputRegex, the regex is:
'
^\,
(+)
,
(?:var1=(+))?
(?:&?var2=(+))?
(?:&?var3=(+))?
,(\d+)
$'

The input of my job is:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
,24.24.24.24,var1=x&var3=foo,400
,1.2.3.4,var2=bar&var3=foobar,400
,1.2.3.4,var1=foo&var3=barfoo,400
,1.2.3.4,var3=barfoo,400

The limit is that you must respect the order var1,var2,var3. Some or all vars can be missing, but when present the order must be respected.
0683p000009MC6c.png 0683p000009MC6h.png 0683p000009MC6m.png
_AnonymousUser
Specialist III
Author

I must say.. that's pretty cool 0683p000009MACn.png
_AnonymousUser
Specialist III
Author

Ok, so does tFileInputRegex not support looping? Your code basically does the same thing, except you don't have the loop I put at the top that allows for a hash key/value system that doesn't care about order or presence.
Anonymous
Not applicable

Ok, so does tFileInputRegex not support looping?

No. Well, not with my current knowledge of regular expressions 🙂
Your code basically does the same thing, except you don't have the loop I put at the top that allows for a hash key/value system that doesn't care about order or presence.

Here comes another solution which does not care about vars order. I think this solution is slower than the first I gave you. Be also warned that the day you'll have a var4 and var5, you'll have to modify the tMap. This "problem" is not related to regex but to our static way to define schema.
My new input is (2 last lines are new):
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
,24.24.24.24,var1=x&var3=foo,400
,1.2.3.4,var2=bar&var3=foobar,400
,1.2.3.4,var1=foo&var3=barfoo,400
,1.2.3.4,var3=barfoo,400
,1.2.3.4,var3=foo&var1=barfoo,400
,1.2.3.4,var3=foo&var1=barfoo&var2=hithere,400

0683p000009MC6r.png 0683p000009MC2l.png 0683p000009MBu4.png 0683p000009MBrf.png