processing a semi-structured text file

_AnonymousUser · ‎2014-04-24

Hi
How can I process a bunch of semi-structured TXT file to parse its content and insert on a Mysql database?
I need a hand on this. I have really no clue of how to start processing this kind of file. Any help, link or tutorial will be much appreciated.
Thanks in advance
My file is something like this:

Mon Apr 21 00:00:13 2014
Acct-Status-Type = Interim-Update
NAS-Port-Type = Wireless-802.11
User-Name = "user@name.com"
NAS-Port = 2149596816
Acct-Session-Id = "80203e90"
Event-Timestamp = "Apr 21 2014 00:00:13 UTC"
Acct-Input-Octets = 2745995
Acct-Output-Octets = 17889908
Acct-Input-Gigawords = 0
Acct-Output-Gigawords = 0
Acct-Input-Packets = 19376
Acct-Output-Packets = 20912
Acct-Session-Time = 7022
Timestamp = 1398038413
Mon Apr 21 00:00:14 2014
Acct-Status-Type = stop
NAS-Port-Type = Wireless-802.11
User-Name = "user@name.com"
NAS-Port = 2149596816
Acct-Session-Id = "80267e90"
Event-Timestamp = "Apr 21 2014 00:00:13 UTC"
Acct-Input-Octets = 2746795
Acct-Output-Octets = 17885408
Acct-Input-Gigawords = 0
Acct-Output-Gigawords = 0
Acct-Input-Packets = 19345
Acct-Output-Packets = 23342
Acct-Session-Time = 70
Timestamp = 1345668413

Anonymous · ‎2014-04-24

There's probably a 100 way to do this...
Your data looks uniform, from the two sample 'records' you provide.
You could try: -
Read file using tFileInputDelimited as one field per line. Ignore blank lines and trim strings.
Use tMemorizeRows to memorize the last 15 (I think that's the correct number) rows.
Set-up a filter to look for the (final) "Timestamp" record.
Only pass this row forward in your flow, to tMap.
Map your record in tMap.
You can use the memorized rows to refer back to the other 'fields' starting from the Date/Time string through to "Acct-Session-Time". For most of your data, you can split on "=" and trim, to get the data value.
You should then have a row per file 'record'

Anonymous · ‎2014-04-25

I had a quick go at this.

tFileInputDelimited->tMemorizeRows->tFilter->tMap->tLogRow
I named the input record "theRecord".

Set tFileInputDelimited to skip blanks and trim strings
Set tMemorizrRows=15 and check the input column for memorisation.

Set tFilter advanced to "Timestamp".equals(row2.theRecord.substring(0, 9))

I mapped two columns for testing

AcctStatusType=((String[]) globalMap.get("tMemorizeRows_1_theRecord")).substring(((String[]) globalMap.get("tMemorizeRows_1_theRecord")).indexOf('=') + 2)
Timetamp=row3.theRecord.substring(row3.theRecord.indexOf('=') + 2)

You'd probably want to add some Exception handling.
Everything is assumed a String; but you could change the datatypes in the tMap or use tConvertType.
Output is: -

.--------------+----------.
| tLogRow_1 |
|=-------------+---------0|
|AcctStatusType|Timestamp |
|=-------------+----------|
|Interim-Update|1398038413|
|stop |1345668413|
'--------------+----------'

Anonymous · ‎2014-04-25

Hi Jose,
I am wondering about what would be your output metadata i.e. column structure?
Can you pl shed some info on it. What would be output column format ?
Thanks
Vaibhav

_AnonymousUser · ‎2014-04-28

thanks tal00000 and sanvaibhav

Right now my problem is to process that kind of file taking into account sometimes the fields are not always the same. Sometimes I have 15 fields and sometimes 18.
Table schema is not a problem because I have every needed field. Sometimes will be in blank because some fields will not be present.

My first concern is to detect the end of every set of data because not always is the "Timestamp" data.

Regards

Anonymous · ‎2014-04-28

Whether the empty line is the end of data???. Whether the first column of the data block is defined or standard? You need to have some business or derivable logic to identify the end of data block or start of the block data or any sort of delimiter which distinguishes between two data blocks...
Whether the data in the block is ordered list format?
You would get some idea for defining or detecting the data block.
Thanks
Vaibhav

Anonymous · ‎2014-04-28

I don't think this is a Talend question.
If your file is consistent as your examples showed, then it is a simple process to extract your data.
If it is inconsistent, you need to describe the possible scenarios so there's at least a fighting chance of understanding how the data may be extracted.

_AnonymousUser · ‎2014-04-29

Hi

The only data we can take for sure is that every chunk is delimited by an empty line.

thanks

Anonymous · ‎2014-04-29

Then, maybe, you need to not ignore blank lines in your input file. memorize the maximum number of 'fields'. Look for the blank line rather than "Timestamp" and then scan back through the memorized rows to see what you've got.

Anonymous · ‎2014-04-29

Hi Jose,
Can you try a trick...
Read input file with
- new line as field delimiter
- Large " " white space as line line delimiter
- and then use above techniques...
Vaibhav

SQL

Talend Data Integration

v5.x