Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in NYC Sept 4th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
_AnonymousUser
Specialist III
Specialist III

processing a semi-structured text file

Hi
How can I process a bunch of semi-structured TXT file to parse its content and insert on a Mysql database?
I need a hand on this. I have really no clue of how to start processing this kind of file. Any help, link or tutorial will be much appreciated.
Thanks in advance
My file is something like this:

Mon Apr 21 00:00:13 2014
Acct-Status-Type = Interim-Update
NAS-Port-Type = Wireless-802.11
User-Name = "user@name.com"
NAS-Port = 2149596816
Acct-Session-Id = "80203e90"
Event-Timestamp = "Apr 21 2014 00:00:13 UTC"
Acct-Input-Octets = 2745995
Acct-Output-Octets = 17889908
Acct-Input-Gigawords = 0
Acct-Output-Gigawords = 0
Acct-Input-Packets = 19376
Acct-Output-Packets = 20912
Acct-Session-Time = 7022
Timestamp = 1398038413
Mon Apr 21 00:00:14 2014
Acct-Status-Type = stop
NAS-Port-Type = Wireless-802.11
User-Name = "user@name.com"
NAS-Port = 2149596816
Acct-Session-Id = "80267e90"
Event-Timestamp = "Apr 21 2014 00:00:13 UTC"
Acct-Input-Octets = 2746795
Acct-Output-Octets = 17885408
Acct-Input-Gigawords = 0
Acct-Output-Gigawords = 0
Acct-Input-Packets = 19345
Acct-Output-Packets = 23342
Acct-Session-Time = 70
Timestamp = 1345668413
Labels (3)
10 Replies
Anonymous
Not applicable

There's probably a 100 way to do this...
Your data looks uniform, from the two sample 'records' you provide.
You could try: -
Read file using tFileInputDelimited as one field per line. Ignore blank lines and trim strings.
Use tMemorizeRows to memorize the last 15 (I think that's the correct number) rows.
Set-up a filter to look for the (final) "Timestamp" record.
Only pass this row forward in your flow, to tMap.
Map your record in tMap.
You can use the memorized rows to refer back to the other 'fields' starting from the Date/Time string through to "Acct-Session-Time". For most of your data, you can split on "=" and trim, to get the data value.
You should then have a row per file 'record'
Anonymous
Not applicable

I had a quick go at this.

tFileInputDelimited->tMemorizeRows->tFilter->tMap->tLogRow
I named the input record "theRecord".

Set tFileInputDelimited to skip blanks and trim strings
Set tMemorizrRows=15 and check the input column for memorisation.

Set tFilter advanced to "Timestamp".equals(row2.theRecord.substring(0, 9))

I mapped two columns for testing

AcctStatusType=((String[]) globalMap.get("tMemorizeRows_1_theRecord")).substring(((String[]) globalMap.get("tMemorizeRows_1_theRecord")).indexOf('=') + 2)
Timetamp=row3.theRecord.substring(row3.theRecord.indexOf('=') + 2)

You'd probably want to add some Exception handling.
Everything is assumed a String; but you could change the datatypes in the tMap or use tConvertType.
Output is: -

.--------------+----------.
| tLogRow_1 |
|=-------------+---------0|
|AcctStatusType|Timestamp |
|=-------------+----------|
|Interim-Update|1398038413|
|stop |1345668413|
'--------------+----------'
Anonymous
Not applicable

Hi Jose,
I am wondering about what would be your output metadata i.e. column structure?
Can you pl shed some info on it. What would be output column format ?
Thanks
Vaibhav
_AnonymousUser
Specialist III
Specialist III
Author

thanks tal00000 and sanvaibhav

Right now my problem is to process that kind of file taking into account sometimes the fields are not always the same. Sometimes I have 15 fields and sometimes 18.
Table schema is not a problem because I have every needed field. Sometimes will be in blank because some fields will not be present.

My first concern is to detect the end of every set of data because not always is the "Timestamp" data.

Regards
Anonymous
Not applicable

Whether the empty line is the end of data???. Whether the first column of the data block is defined or standard? You need to have some business or derivable logic to identify the end of data block or start of the block data or any sort of delimiter which distinguishes between two data blocks...
Whether the data in the block is ordered list format?
You would get some idea for defining or detecting the data block.
Thanks
Vaibhav
Anonymous
Not applicable

I don't think this is a Talend question.
If your file is consistent as your examples showed, then it is a simple process to extract your data.
If it is inconsistent, you need to describe the possible scenarios so there's at least a fighting chance of understanding how the data may be extracted.
_AnonymousUser
Specialist III
Specialist III
Author

Hi

The only data we can take for sure is that every chunk is delimited by an empty line.

thanks
Anonymous
Not applicable

Then, maybe, you need to not ignore blank lines in your input file. memorize the maximum number of 'fields'. Look for the blank line rather than "Timestamp" and then scan back through the memorized rows to see what you've got.
Anonymous
Not applicable

Hi Jose,
Can you try a trick...
Read input file with
- new line as field delimiter
- Large " " white space as line line delimiter
- and then use above techniques...
Vaibhav