Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Manipulate JSON input file to remove additional string characters

Hi,

I have the following json file (tfileinputRaw input file )

Input :

[{"_comp":"1","Name":"ABC"} ,
{"_comp":"1","Name":"ZYX"} ,
{"_comp":"1","Name":"MNO"},
{"_comp":"1","Name":"YZX"}]

Output : Expected 

{"_comp":"1","Name":"ABC"}
{"_comp":"1","Name":"ZYX"}
{"_comp":"1","Name":"MNO"}
{"_comp":"1","Name":"YZX"}

 

My requirement is

1. To remove the square brackets [] at the start and end of the file. (There can be square brackets inside the file at other places, it has to be left untouched; only if there is any square brackets at start and end of the file it has to be removed if there is no square brackets it can be left as it is)

2. To remove comma (,) after each record is done.

  New records can be identified whenever {"_comp": is encountered.

So I plan the job to be

tfileinputRaw ==> tjava ==>tfileoutputRaw

in tjava logic planned => if there is , (comma) or [ (open square brackets) before {"_comp": then it has to be replaced with space else do no changes. Also remove ] (close square brackets) at EOF if present, if not present nothing else to be done.

Please help me how to implement this logic in tjava.

 

Labels (4)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Hi,
Everything works fine except for a single scenario,
if there is a json format file with array inside an element; that array (square brackets) is also getting missed, in the below example
see the array [group_1_id ] and [group_2_id] are missed from the resulting json format which we have.
tjava_code used : output_row.line = input_row.line.replaceAll("[\\[\\]]", "").replaceAll("} *,\\r\\n", "}\r\n");

Input :
[{"_com":"com_id1","_time1":{"gmt":"2017-04-04T18:46:46.392Z","ist":"2017-04-04T04:45:09.776Z"},"_time2":{"creator":"creator1","groups":{"grp1":["group_1_id"],"grp2":["group_2_id"]}}}]

 

Excepted :
{"_com":"com_id1","_time1":{"gmt":"2017-04-04T18:46:46.392Z","ist":"2017-04-04T04:45:09.776Z"},"_time2":{"creator":"creator1","groups":{"grp1":["group_1_id"],"grp2":["group_2_id"]}}}

Observed :
{"_com":"com_id1","_time1":{"gmt":"2017-04-04T18:46:46.392Z","ist":"2017-04-04T04:45:09.776Z"},"_time2":{"creator":"creator1","groups":{"grp1":"group_1_id","grp2":"group_2_id"}}}

Expect your inputs to get this problem resolved. Kindly assist.

View solution in original post

11 Replies
TRF
Champion II

Replace tFileInputRaw by tFileInputFullRow to read input file line by line.

Connect to tJavaRow with this line of code:

 

output_row.line = input_row.line.replaceAll("[\\[\\]]", "").replaceAll(",$", "");

For the example I connect tJavaRow to tLogRow.

Here is the result:

Starting job test at 15:31 25/05/2017.
[statistics] connecting to socket on port 3552
[statistics] connected
{"_comp":"1","Name":"ABC"} 
{"_comp":"1","Name":"ZYX"} 
{"_comp":"1","Name":"MNO"}
{"_comp":"1","Name":"YZX"}
[statistics] disconnected
Job test ended at 15:31 25/05/2017. [exit code=0]

 

Anonymous
Not applicable
Author

Just wanted to know,

1. Will this code removes , (comma) where ever it encounters? Because, in my data file there can be comma's in other fields too. I need only to remove comma's while the script encounters {"_comp": . Will the below scenario be handled with this same code ?

2. Also since you give tFileInputFullRow as an input; if there is any \n encountered during json, will that then be affected by this tjava code and make any changes to the data file after \n

 

3. BTW, I am getting the below error, can you assist me to know where I am wrong.

 

 

0683p000009Lv87.jpg

TRF
Champion II

  1. Yes, replaceAll(",$", "") means "replace comma at the end of the line
  2. not sure to understand very well, but I suppose the inpout file is made of multiple lines as shown on your initial post (default row separator "\n" can be changed), then output file will contain the same number of lines and the same separator
  3. change tJava by tJavaRow and just write as it was in the example, using input_row and out_row instead of row1 and row2 (as a best practice)
    output_row.line = input_row.line.replaceAll("[\\[\\]]", "").replaceAll(",$", "");
Anonymous
Not applicable
Author

If I give the row separator as "" instead of "\n" in tFileInputFullRow component nothing is working as you mention. This is what i mentioned in the point no. 2. I basically wanted the whole file to be treated as a single row and do the changes for the requirement mentioned above. Because at the input \n may be not given exactly at each record level.The solution you propose is not working when improper \n are appearing inside the input file for just a single record.

Hence i had to use the input component as tFileInputRaw during my initial Query. Can you please tell me how to manage the above scenario when treating the whole file in a single row?

Hence i wanted to search for the string ,{"_com": and if exists do the change as mentioned above. For this, \n does not matter and it replaces directly.

 

Can you think of any other idea to get my requirement fulfilled?

TRF
Champion II

Was not explained as it in your 1st post.

Well, you can convert object issued from tFileInputRaw to string using tConvert.

After that, you can use regex (as shown in my 1st answer) to arrange the content as desired.

Here is the complete job with the tConvert and its schema:

0683p000009Lv5x.png

 

As you can see, there is only 1 line from tFileInputRaw. It contains the complete file.

Last part of the tJavaRow is a little bit changed like this:

output_row.line = input_row.line.replaceAll("[\\[\\]]", "").replaceAll("} *,", "}");

In the input file is like this:

[{"_comp":"1","Name":"ABC"} ,
{"_comp":"1","Name":"ZYX"} ,
{"_comp":"1","Name":"MNO"},
{"_comp":"1","Name":"YZX"}]

The result is:

Starting job test at 21:20 25/05/2017.
[statistics] connecting to socket on port 3738
[statistics] connected
{"_comp":"1","Name":"ABC"}
{"_comp":"1","Name":"ZYX"}
{"_comp":"1","Name":"MNO"}
{"_comp":"1","Name":"YZX"}
[statistics] disconnected
Job test ended at 21:20 25/05/2017. [exit code=0]

In the input file is like this (with a splitted line):

[{"_comp":"1","Name":"ABC"} ,
{"_comp":"1","Name":"ZYX"} ,
{"_comp":"1","Name
end of the name on a 2nd line":"MNO"},
{"_comp":"1","Name":"YZX"}]

The result is:

Starting job test at 21:31 25/05/2017.
[statistics] connecting to socket on port 3422
[statistics] connected
{"_comp":"1","Name":"ABC"}
{"_comp":"1","Name":"ZYX"}
{"_comp":"1","Name
end of the name on a 2nd line":"MNO"}
{"_comp":"1","Name":"YZX"}
[statistics] disconnected
Job test ended at 21:31 25/05/2017. [exit code=0]

Is it what you expect?

TRF
Champion II

Just seen your PM after my last post, change tJavaRow like this

output_row.line = input_row.line.replaceAll("[\\[\\]]", "").replaceAll("} *,\\r\\n", "}\r\n");
Anonymous
Not applicable
Author

Thank you very much. Will check them up and let you know.
Also to understand the tjava code you've written, what would be the best
place to start with to learn and understand what you' ve proposed.
(Particularly tjava code) i am more interested at talend level coding of
tjava component and various other options available that can be used at
that component.

Will try your solution and come back with my observation. Huge applause for
your timely help and sharing the knowledge you carry. Much appreciate and
keep spreading.
Anonymous
Not applicable
Author

Hi,
Everything works fine except for a single scenario,
if there is a json format file with array inside an element; that array (square brackets) is also getting missed, in the below example
see the array [group_1_id ] and [group_2_id] are missed from the resulting json format which we have.
tjava_code used : output_row.line = input_row.line.replaceAll("[\\[\\]]", "").replaceAll("} *,\\r\\n", "}\r\n");

Input :
[{"_com":"com_id1","_time1":{"gmt":"2017-04-04T18:46:46.392Z","ist":"2017-04-04T04:45:09.776Z"},"_time2":{"creator":"creator1","groups":{"grp1":["group_1_id"],"grp2":["group_2_id"]}}}]

 

Excepted :
{"_com":"com_id1","_time1":{"gmt":"2017-04-04T18:46:46.392Z","ist":"2017-04-04T04:45:09.776Z"},"_time2":{"creator":"creator1","groups":{"grp1":["group_1_id"],"grp2":["group_2_id"]}}}

Observed :
{"_com":"com_id1","_time1":{"gmt":"2017-04-04T18:46:46.392Z","ist":"2017-04-04T04:45:09.776Z"},"_time2":{"creator":"creator1","groups":{"grp1":"group_1_id","grp2":"group_2_id"}}}

Expect your inputs to get this problem resolved. Kindly assist.

TRF
Champion II

Not exactly the same case, but with a little change to the tJavaRow gives what you ask for:

output_row.line = input_row.line.replaceAll("^\\[", "").replaceAll("\\]$", "").replaceAll("} *,\\r\\n", "}\r\n");

The last replaceAll is not mandatory with your sample as there is only 1 line, but if you have more than 1, it should be usefull.