Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Iterate all X rows and not for each row

hi,

 

I used talend for data extraction and insertion into OTSDB. But I need to cut my file, and a classic iteration take too much time (40 rows/s and I have 90 millions rows).

Do you know how to send for example 50 rows by 50 rows instead of each row individually?

 

Best regards,

 

terreprime.

Labels (3)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

OK, I have put together an example you will need to extrapolate from. It is quite simple. The layout for your job will be.....

 

tFileInputDelimited ---> tJavaFlex ---> tFileterRow ---> tFlowToIterate --->tRest

 

1) You read the file as normal with the tFileInputDelimited. 

2) The magic happens in the tJavaFlex. The code below shows what I did with my example. You will need to extrapolate from this to put in your JSON build (and combine) code....

Start Code

//Used to count the rows
int count = 0;
//Used to concatenate your Strings
String myConcatenatedVal = "";

Main Code

//Append 1 to each incoming row
count++;

//Concatenate your code (adjust this to concatenate your computed JSON Strings
myConcatenatedVal = myConcatenatedVal+row1.newColumn;

//A modulus operation to fire on every 50th row. It sets the output "newColumn" to the concatenated value, then resets the myConcatenatedVal and count variables.
if(count%50==0){
	row2.newColumn = myConcatenatedVal;
	myConcatenatedVal = "";
	count=0;
}else{
//The output "newColumn" column is set to null when not the 50th row
	row2.newColumn = null;
}

This code will build up your records and only output a value every 50th record. It will output a null value for every other row. To handle this null value (to filter it out), we use the tFilterRow. Use the Advanced Mode and then set the code to ....

input_row.newColumn!=null

Your tRest will now only run once for every 50 records. 

 

I hope that helps

 

View solution in original post

17 Replies
Anonymous
Not applicable
Author

 

Hi,
I have a similar problem, I want to treat a big CSV all the "x" rows.
I try to cut my file in several sub files, but isn't feasible for biggest files.
Thanks.

 

TRF
Champion II

Hi,
Is your opentsdb installation tuned and the server dimensioned to delivery the expected throughput? Did you validate this point from outside of Talend? What if you replace the access to opentsdb by something else, mets say for example, by an output csv file?
Anonymous
Not applicable
Author

Hi,

 

Thanks for your answer.

 

Normally, yes, I think.

The principal problem is the iteration for each row in talend.

Maybe it's the treatment of my file because i need to create a JSON file for sending data in OTSDB and I have a lot of components.

 

I can screen and explain my job if you want.

 

I'm a beginner in talend and database so I don't know all the possible method in talend.

 

Terreprime

TRF
Champion II

Usually Talend process very fast with files. Try to isolate the component you suspect to be slow.
Maybe share a picture of the job.
Anonymous
Not applicable
Author

Without the tHttpRequest talend execute 240 row/s so maybe it's OTSDB ..

 

This is my job:

0683p000009Ltke.png

The first sub-job create the JSON for OTSDB and in the second, I return and modify the JSON in the flux because ttFileOutputJSON replace some characters like " by /" 

TRF
Champion II

If I understand, the 2nd subjob is the part with a low throughput.

You may expect thousands rows / second when just reading a delimited file and write the content to another file, so you are fare away from a "standard" result.

Can you share the tJavaFlex_3 code, it may explain this result, depending of what happens in this component.

Also, explain the reason why you produce a JSON file but start the following subjob with a delimited file.

Is it the same you used for the 1st subjob?

In this case you may avoid to read the file one more time, adding a tReplicate after tFileInputDelimited_1.

 

For the initial question concerning low performances of OTSDB, can you give more details?

How do you proceed to push data to OTSDB?

What is the component you use for that?

 

Anonymous
Not applicable
Author

Hi,

 

To send data in OTSDB, i use the component tHttpRequest with post method, and the file for post method contains:

[
{"metric":"test_3","value":"299","timestamp":1493805152,"tags":{"Spec":"Matthieu"}}
]

 I need to create a file with exactly this format, but at the end of my first sub-job the file create is:

[
{"metric":"test_3","value":"299","timestamp":1493805152,"tags":"{\"Spec\":\"Matthieu\"}"}
]

The tags contains some wrong characters (the \ and " at the beginning and the end of the tags parameter).

So I use tFileInputDelimited_3 to return the JSON in the flux and use tJavaFlex_3 to replace those characters.

tJavaFlex_3:

row5.data=row5.data.replace("\\\"","\"");
row5.data=row5.data.replace("\"{","{");
row5.data=row5.data.replace("}\"","}");
row6.data=row5.data;

after I create a text file with the good format and the tHttpRequest_1 recover it.

Anonymous
Not applicable
Author

I may not understand the full problem here, but I have some concerns over the job structure.

 

You are iterating over a file and producing a new JSON file for each row of the file. Why is this? Is this just to produce some JSON? If so, creating a new file for the JSON you have demonstrated is incredibly inefficient. Why not create the JSON (if it is as simple as shown) as a String in a flow rather than iterating?


From what I am seeing your job could look like below....

tFileInputDelimited --> tJavaFlex (to produce JSON String)---> tFlowToIterate -->tRest (to send the JSON)

The file reads the data and sends it to the tJavaFlex. You carry out the JSON creation there. The tFlowToIterate will cause the tRest to be fired for each row. The tRest's Http Body would be set to the value of the JSON string. Let's say the JSON String is set to column "myJSON" and the row connecting the tJavaFlex to the tFlowToIterate is called "row1", then you would put the following in the tRest's Http Body.....

((String)globalMap.get("row1.myJSON"))

I see you are having an issue with the JSON format. Given what you have shown, it could be created as follows....

String myJSON = "[ {\"metric\":\"test_3\",\"value\":\"299\",\"timestamp\":1493805152,\"tags\":{\"Spec\":\"Matthieu\"}}]";


Of course, you won't want to hard code the values, so if the values are coming in columns from your input file (as I assume), your code might look like this....

 

String myJSON = "[ {\"metric\":\""+row.column1+""\",\"value\":\""+row.column2+"\",\"timestamp\":"+row.column3+",\"tags\":{\""+row.column4+"\":\""+row.column5+"\"}}]";

This is code I have put together without testing, so there may be some bugs, but essentially you just need to be aware of quotes ("). If your String needs quotes, you need to escape them using a backslash (\). So if you want the following String exactly....

Hello "John" how are you?

You would create it.....

String myString = "Hello \"John\" how are you?";


I may have missed something a bit more complex (I can't see inside your component config) but I think the file creation and reading is probably where your code is slowing down. Calling the service is also going to slow things down a little. Is there any way to batch up your requests?

Anonymous
Not applicable
Author

Hi,

 

The problem is just sending data fast in OTSDB.

 

I tried your solution, but the tRest doesn't send data in OTSDB, I think it's a wrong parameter, this is my job and the tRest:

0683p000009Ltvr.png0683p000009Ltvo.png

Talend tells me they are no error, but OTSDB not received the data.

 

I will try to solve the problem tomorrow, but if you have an idea, I'm listening to you.