<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Tokenizing Log files in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360091#M124759</link>
    <description>Hi,
&lt;BR /&gt;I would like to process Log files that come in a McAfee format. Processing means that I have to tokenize the Logs before doing something meaningful with them. However, I don't know how to tokenize such log files. Any ideas?
&lt;BR /&gt;Here is a (deliberately simplified) example of two records:
&lt;BR /&gt;
&lt;BR /&gt;
&lt;PRE&gt;192.168.1.12 10.2.33.12  123 4711 "www.google.com" TCP_MISS "something else"&lt;BR /&gt;127.0.0.1:12345 10.3.211.3  4321 53344 "www.domedomain.com/bla/xyz.php?x=1,y=2" TCP_MISS "more text"&lt;BR /&gt;&lt;/PRE&gt;
&lt;BR /&gt;
&lt;BR /&gt;As you can see:
&lt;BR /&gt;
&lt;BR /&gt;The fields are delimited by means of a "space" character.
&lt;BR /&gt;Strings are normally enclosed by "" but can contain space characters (and theoretically even " characters)
&lt;BR /&gt;The string containing the TCP status is the exception as it is a String but it is not enclosed by ""
&lt;BR /&gt;Dates are enclosed by [] but can contain space characters, too.
&lt;BR /&gt;
&lt;BR /&gt;How can I tokenize such Log files in Talend? The regular CSV import component is too simple for this, I believe. Any ideas?
&lt;BR /&gt;Thanks
&lt;BR /&gt;Matt</description>
    <pubDate>Wed, 10 Jun 2015 08:01:27 GMT</pubDate>
    <dc:creator>kleinmat</dc:creator>
    <dc:date>2015-06-10T08:01:27Z</dc:date>
    <item>
      <title>Tokenizing Log files</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360091#M124759</link>
      <description>Hi,
&lt;BR /&gt;I would like to process Log files that come in a McAfee format. Processing means that I have to tokenize the Logs before doing something meaningful with them. However, I don't know how to tokenize such log files. Any ideas?
&lt;BR /&gt;Here is a (deliberately simplified) example of two records:
&lt;BR /&gt;
&lt;BR /&gt;
&lt;PRE&gt;192.168.1.12 10.2.33.12  123 4711 "www.google.com" TCP_MISS "something else"&lt;BR /&gt;127.0.0.1:12345 10.3.211.3  4321 53344 "www.domedomain.com/bla/xyz.php?x=1,y=2" TCP_MISS "more text"&lt;BR /&gt;&lt;/PRE&gt;
&lt;BR /&gt;
&lt;BR /&gt;As you can see:
&lt;BR /&gt;
&lt;BR /&gt;The fields are delimited by means of a "space" character.
&lt;BR /&gt;Strings are normally enclosed by "" but can contain space characters (and theoretically even " characters)
&lt;BR /&gt;The string containing the TCP status is the exception as it is a String but it is not enclosed by ""
&lt;BR /&gt;Dates are enclosed by [] but can contain space characters, too.
&lt;BR /&gt;
&lt;BR /&gt;How can I tokenize such Log files in Talend? The regular CSV import component is too simple for this, I believe. Any ideas?
&lt;BR /&gt;Thanks
&lt;BR /&gt;Matt</description>
      <pubDate>Wed, 10 Jun 2015 08:01:27 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360091#M124759</guid>
      <dc:creator>kleinmat</dc:creator>
      <dc:date>2015-06-10T08:01:27Z</dc:date>
    </item>
    <item>
      <title>Re: Tokenizing Log files</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360092#M124760</link>
      <description>Hi,&lt;BR /&gt;Can you please set an exmaple with expected result for us?&lt;BR /&gt;Best regards&lt;BR /&gt;Sabrina</description>
      <pubDate>Thu, 25 Jun 2015 05:20:53 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360092#M124760</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2015-06-25T05:20:53Z</dc:date>
    </item>
    <item>
      <title>Re: Tokenizing Log files</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360093#M124761</link>
      <description>Hi,
&lt;BR /&gt;I did provide 2 example lines in my post.
&lt;BR /&gt;The outcome would be a Schema in which each field would be separated. So for the two examples above:
&lt;BR /&gt;192.168.1.12 10.2.33.12 123 4711 "
&lt;A href="http://www.google.com" target="_blank" rel="nofollow noopener noreferrer"&gt;www.google.com&lt;/A&gt;" TCP_MISS "something else"
&lt;BR /&gt;1.192.168.1.12
&lt;BR /&gt;2.10.2.33.12
&lt;BR /&gt;3.
&lt;BR /&gt;4.123
&lt;BR /&gt;5.4711
&lt;BR /&gt;6.www.google.com
&lt;BR /&gt;7.TCP_MISS
&lt;BR /&gt;8."something else"
&lt;BR /&gt;And the second line:
&lt;BR /&gt;127.0.0.1:12345 10.3.211.3 4321 53344 "
&lt;A href="http://www.domedomain.com/bla/xyz.php?x=1,y=2" target="_blank" rel="nofollow noopener noreferrer"&gt;www.domedomain.com/bla/xyz.php?x=1,y=2&lt;/A&gt;" TCP_MISS "more text"
&lt;BR /&gt;1.127.0.0.1:12345
&lt;BR /&gt;2.10.3.211.3
&lt;BR /&gt;3.
&lt;BR /&gt;4.4321
&lt;BR /&gt;5.53344
&lt;BR /&gt;6.www.domedomain.com/bla/xyz.php?x=1,y=2
&lt;BR /&gt;7.TCP_MISS
&lt;BR /&gt;8."more text"
&lt;BR /&gt;How can that be done? That's an excerpt from a Standard Log file - so I assume that Talend must have means to process These file types?`
&lt;BR /&gt;Thanks
&lt;BR /&gt;Matt</description>
      <pubDate>Mon, 29 Jun 2015 15:55:28 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360093#M124761</guid>
      <dc:creator>kleinmat</dc:creator>
      <dc:date>2015-06-29T15:55:28Z</dc:date>
    </item>
    <item>
      <title>Re: Tokenizing Log files</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360094#M124762</link>
      <description>So I have experimented with Talend a bit and did not come across any Feature that could help process this type of file. 
&lt;BR /&gt;Does Talend really have nothing for Log Files? 
&lt;BR /&gt;I experimented a bit with Regular expressions and came up with this: 
&lt;BR /&gt; 
&lt;PRE&gt;^(+)\s(+)\s"(.*?)"\s\\s(.*?)\s(\d+)\s(\d+)\s(\d+)\s(.{34})\s(.*?)\s(\d+)\s(.*?)\s(\d+)\s(.*?)\s(.*?)\s(\w+\b)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(+)$&lt;BR /&gt;&lt;/PRE&gt; 
&lt;BR /&gt;For a slightly different Log Format, though: 
&lt;BR /&gt; 
&lt;PRE&gt;192.246.180.238 112.225.107.58 "Undefined"  "GET http://something-bea.xyz.de.zz.com:7215/ExternalInformationServices/SAMCS?WSDL HTTP/1.1" 407 343 3793 "Apache-HttpClient/4.1.1 (java 1.5)" "-" 81 "-" 3 "-/-" "" "TCP_MISS" "-" "Authenticate Offer NTLM" "jhgjhg876g87-test" "-" 0.0.0.0&lt;BR /&gt;&lt;/PRE&gt; 
&lt;BR /&gt;But when I try this in Talend, I get an exception: 
&lt;BR /&gt; 
&lt;BR /&gt; 
&lt;BR /&gt; 
&lt;PRE&gt;&lt;BR /&gt;Exception in thread "main" java.lang.Error: Unaufgelöstes Kompilierungsproblem: &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;	Ungültige Escapezeichenfolge (gültig sind  \b  \t  \n  \f  \r  \"  \'  \\ )&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;	at csa_pilot.regex_test_0_1.regex_test.tFileInputRegex_1Process(regex_test.java:529)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;	at csa_pilot.regex_test_0_1.regex_test.runJobInTOS(regex_test.java:1043)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;	at csa_pilot.regex_test_0_1.regex_test.main(regex_test.java:900)&lt;BR /&gt;&lt;/PRE&gt; 
&lt;BR /&gt; 
&lt;BR /&gt;Any idea?</description>
      <pubDate>Mon, 29 Jun 2015 18:09:09 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Tokenizing-Log-files/m-p/2360094#M124762</guid>
      <dc:creator>kleinmat</dc:creator>
      <dc:date>2015-06-29T18:09:09Z</dc:date>
    </item>
  </channel>
</rss>

