
Specialist III
2008-04-20
04:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
regex parsing of apache access log
Hey Guys,
Anyone know if this can be done. I use a regex in my perl script monster ETL process (attached at bottom) that goes through my access logs and pulls out the GET variables in every HTTP request. I do this in such a manner that a hash is created with the key being the GET variable name, and the value being the GET variable value. This is nice because it allows me to deal with GET requests that don't all have the exact same number of parameters - some just evaluate to basically NULL.
Can I do this with Talend? I read up on "Setting up a File Regex schema" on p. 65 of the user guide but I am still not sure.
Thanks.
-----
if($_ =~ m/\,(*),(\/slacker\.jpg|\/slacker\.gif),(*)/){
$arg=$10;
while($arg =~ m/(\w+)=(*)/g){
$args{$1}=$2;
}
}
Example input:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
Anyone know if this can be done. I use a regex in my perl script monster ETL process (attached at bottom) that goes through my access logs and pulls out the GET variables in every HTTP request. I do this in such a manner that a hash is created with the key being the GET variable name, and the value being the GET variable value. This is nice because it allows me to deal with GET requests that don't all have the exact same number of parameters - some just evaluate to basically NULL.
Can I do this with Talend? I read up on "Setting up a File Regex schema" on p. 65 of the user guide but I am still not sure.
Thanks.
-----
if($_ =~ m/\,(*),(\/slacker\.jpg|\/slacker\.gif),(*)/){
$arg=$10;
while($arg =~ m/(\w+)=(*)/g){
$args{$1}=$2;
}
}
Example input:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
451 Views
7 Replies

Specialist III
2008-04-20
04:30 AM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I just noticed my input lines were off (because I stripped out some fields for anonymity) - but the basic idea is the same!
451 Views

Anonymous
Not applicable
2008-04-20
04:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is that really Apache access log?
I've implemented a tApacheLogInput for TOS 2.4 (available for TOS 2.3 in the ecosystem) that deals with "standard" Apache access log lines.
You gave an example of the input lines, can you also give the corresponding expected output lines?
I've implemented a tApacheLogInput for TOS 2.4 (available for TOS 2.3 in the ecosystem) that deals with "standard" Apache access log lines.
You gave an example of the input lines, can you also give the corresponding expected output lines?
451 Views

Specialist III
2008-04-21
01:40 AM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's actually a ligHTTPD log that I modified with the conf file to only log the data fields I need.
The key point is that I need to break up the variables in the GET request into a CSV file that I then ETL through my own custom process (which I want to replace with Talend).
Here are the lines:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
Here would be the output:
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,6374451b48368cf558,,400
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,f5c99a6c552c032123,anything,400
Notice: Line 1 only has 2 GET variables (var1 and var2), while Line 2 has 3 GET variables (var1, var2, var3). In the output, even though line 1 has only 2 variables a placeholder is inserted for var3. Another thing to keep in mind is that I simplified the naming of these GET variables for the example, but my users can name then anything (i.e. var1 OR variable1 OR myvariable1).
That's why I need the regexp - to grab all variables and drop to a hash array, where I only perform operations on the variables I expect, and toss the rest of the junk.
Thanks!!
The key point is that I need to break up the variables in the GET request into a CSV file that I then ETL through my own custom process (which I want to replace with Talend).
Here are the lines:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
Here would be the output:
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,6374451b48368cf558,,400
19/Apr/2008:22:59:59 -0700,24.24.24.24,x,f5c99a6c552c032123,anything,400
Notice: Line 1 only has 2 GET variables (var1 and var2), while Line 2 has 3 GET variables (var1, var2, var3). In the output, even though line 1 has only 2 variables a placeholder is inserted for var3. Another thing to keep in mind is that I simplified the naming of these GET variables for the example, but my users can name then anything (i.e. var1 OR variable1 OR myvariable1).
That's why I need the regexp - to grab all variables and drop to a hash array, where I only perform operations on the variables I expect, and toss the rest of the junk.
Thanks!!
451 Views

Anonymous
Not applicable
2008-04-21
07:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the tFileInputRegex, the regex is:
The input of my job is:
The limit is that you must respect the order var1,var2,var3. Some or all vars can be missing, but when present the order must be respected.
'
^\,
(+)
,
(?:var1=(+))?
(?:&?var2=(+))?
(?:&?var3=(+))?
,(\d+)
$'
The input of my job is:
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
,24.24.24.24,var1=x&var3=foo,400
,1.2.3.4,var2=bar&var3=foobar,400
,1.2.3.4,var1=foo&var3=barfoo,400
,1.2.3.4,var3=barfoo,400
The limit is that you must respect the order var1,var2,var3. Some or all vars can be missing, but when present the order must be respected.
451 Views

Specialist III
2008-04-22
08:03 PM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I must say.. that's pretty cool
451 Views

Specialist III
2008-04-22
11:48 PM
Author
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so does tFileInputRegex not support looping? Your code basically does the same thing, except you don't have the loop I put at the top that allows for a hash key/value system that doesn't care about order or presence.
451 Views

Anonymous
Not applicable
2008-04-24
07:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so does tFileInputRegex not support looping?
No. Well, not with my current knowledge of regular expressions 🙂
Your code basically does the same thing, except you don't have the loop I put at the top that allows for a hash key/value system that doesn't care about order or presence.
Here comes another solution which does not care about vars order. I think this solution is slower than the first I gave you. Be also warned that the day you'll have a var4 and var5, you'll have to modify the tMap. This "problem" is not related to regex but to our static way to define schema.
My new input is (2 last lines are new):
,24.24.24.24,var1=x&var2=6374451b48368cf558,400
,24.24.24.24,var1=x&var2=f5c99a6c552c032123&var3=anything,400
,24.24.24.24,var1=x&var3=foo,400
,1.2.3.4,var2=bar&var3=foobar,400
,1.2.3.4,var1=foo&var3=barfoo,400
,1.2.3.4,var3=barfoo,400
,1.2.3.4,var3=foo&var1=barfoo,400
,1.2.3.4,var3=foo&var1=barfoo&var2=hithere,400
451 Views
