Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
kleinmat
Contributor III
Contributor III

Tokenizing Log files

Hi,
I would like to process Log files that come in a McAfee format. Processing means that I have to tokenize the Logs before doing something meaningful with them. However, I don't know how to tokenize such log files. Any ideas?
Here is a (deliberately simplified) example of two records:

192.168.1.12 10.2.33.12  123 4711 "www.google.com" TCP_MISS "something else"
127.0.0.1:12345 10.3.211.3 4321 53344 "www.domedomain.com/bla/xyz.php?x=1,y=2" TCP_MISS "more text"


As you can see:

The fields are delimited by means of a "space" character.
Strings are normally enclosed by "" but can contain space characters (and theoretically even " characters)
The string containing the TCP status is the exception as it is a String but it is not enclosed by ""
Dates are enclosed by [] but can contain space characters, too.

How can I tokenize such Log files in Talend? The regular CSV import component is too simple for this, I believe. Any ideas?
Thanks
Matt
Labels (2)
3 Replies
Anonymous
Not applicable

Hi,
Can you please set an exmaple with expected result for us?
Best regards
Sabrina
kleinmat
Contributor III
Contributor III
Author

Hi,
I did provide 2 example lines in my post.
The outcome would be a Schema in which each field would be separated. So for the two examples above:
192.168.1.12 10.2.33.12 123 4711 " www.google.com" TCP_MISS "something else"
1.192.168.1.12
2.10.2.33.12
3.
4.123
5.4711
6.www.google.com
7.TCP_MISS
8."something else"
And the second line:
127.0.0.1:12345 10.3.211.3 4321 53344 " www.domedomain.com/bla/xyz.php?x=1,y=2" TCP_MISS "more text"
1.127.0.0.1:12345
2.10.3.211.3
3.
4.4321
5.53344
6.www.domedomain.com/bla/xyz.php?x=1,y=2
7.TCP_MISS
8."more text"
How can that be done? That's an excerpt from a Standard Log file - so I assume that Talend must have means to process These file types?`
Thanks
Matt
kleinmat
Contributor III
Contributor III
Author

So I have experimented with Talend a bit and did not come across any Feature that could help process this type of file.
Does Talend really have nothing for Log Files?
I experimented a bit with Regular expressions and came up with this:
^(+)\s(+)\s"(.*?)"\s\\s(.*?)\s(\d+)\s(\d+)\s(\d+)\s(.{34})\s(.*?)\s(\d+)\s(.*?)\s(\d+)\s(.*?)\s(.*?)\s(\w+\b)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(+)$

For a slightly different Log Format, though:
192.246.180.238 112.225.107.58 "Undefined"  "GET http://something-bea.xyz.de.zz.com:7215/ExternalInformationServices/SAMCS?WSDL HTTP/1.1" 407 343 3793 "Apache-HttpClient/4.1.1 (java 1.5)" "-" 81 "-" 3 "-/-" "" "TCP_MISS" "-" "Authenticate Offer NTLM" "jhgjhg876g87-test" "-" 0.0.0.0

But when I try this in Talend, I get an exception:



Exception in thread "main" java.lang.Error: Unaufgelöstes Kompilierungsproblem:


Ungültige Escapezeichenfolge (gültig sind \b \t \n \f \r \" \' \\ )






at csa_pilot.regex_test_0_1.regex_test.tFileInputRegex_1Process(regex_test.java:529)


at csa_pilot.regex_test_0_1.regex_test.runJobInTOS(regex_test.java:1043)


at csa_pilot.regex_test_0_1.regex_test.main(regex_test.java:900)


Any idea?