Solved: Slow down while processing 100M+ row file? - Qlik Community

talendtester · ‎2018-09-19

I have a file with over 100 million rows of data.

The job processes around 2,780 files per second when the job starts, but after about 5 million rows the speed starts to slow down and eventually goes down to about 2 rows per second.

The job is:

tFileInputDelimited > tMap > tContextLoad

↓

tJava > tFileOutputDelimited

In the tMap component, I have Advanced settings Store on disk Max buffer size: 1,000,000

In the job's Run Tab advanced settings I have: -Xms6256M and -Xmx7024M

The virtual server I am running the job on has 8 processors, 8 sockets and 32GB of RAM

What can I do to keep the job running at 2,780 files per second?

Anonymous · ‎2018-09-19

Your routine needs to look something like this (you will need to handle the imports, etc).....

public class GPSConvert {

    public static String ConvertCoords(Double long_, Double lat_){
    	
    	String myResult = "";
    	CoordinateConversion cs = new CoordinateConversion();

    	//GET THE MGRS VALUE:
    	myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));
    	return myResult;
    }    
    
}

You can use this in your tMap by simply placing the code below in your the column you want to output this data in....

routines.GPSConvert.ConvertCoords(row1.long, row1.lat)

There may be a bit of tidying up to do, but this will make your job run a lot faster.

View solution in original post

Anonymous · ‎2018-09-19

Can you show us a screenshot of your job. Your job description doesn't make any sense I'm afraid.

Also, I have just run a job where I generated 100,000,000 rows of data and wrote them to a file. It was writing at 1.3 millions rows a second and I was using just 4GB RAM. I sense you are doing a little more than just reading and writing. A screenshot might help fill in the blanks

TRF · ‎2018-09-19

Also having an idea of what happens in tJava should be usefull

Anonymous · ‎2018-09-19

it sounds like you're running out of memory. you can test this theory by increasing the Xmx setting by a few GB and see if your slowdown occurs later in the process (maybe you get to ~6MM rows instead of 5MM)

for us to be able to give better advice, it would be very helpful if you can share a screenshot of your job and some detail of what you are doing in both the tMap and your tJava.

talendtester · ‎2018-09-19

In the tJava I am passing the Latitude, Longitude values to convert the point to the Military Grid Reference System value, here is the code:

String myResult = "";
CoordinateConversion cs = new CoordinateConversion();

Double lat_ = Double.parseDouble(context.myLat);
Double long_ = Double.parseDouble(context.myLong);

//GET THE MGRS VALUE:
context.myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));

//WRITE TO OUTPUT FILE:
row2.myLat = context.myLat;
row2.myLong = context.myLong;
row2.myResult = context.myResult;

talendtester · ‎2018-09-19

There isn't much to the job, the Lat Long values are loaded to context variables then passed to the tJava for converting to MGRS, then the MGRS value is outputted to the results file:

talendtester · ‎2018-09-19

Instead of continuing to keep throwing more memory at it, is there some way to clear the job buffer/cache every 1M rows processed?

Anonymous · ‎2018-09-19

I am still a little confused by the layout. Why are you assigning context variables millions of times? Why are you iterating to a tJava? What is the tJava sending to the tFileOutputDelimited? By the way, the tJava is not really best suited to working with row connectors. Can you give a description of what you are trying to achieve? This does not look like it will be terribly efficient at all.

talendtester · ‎2018-09-19

I have a file with over 100M unique Latitude, Longitude points.

I need to find out what the corresponding Military Grid Reference System (MGRS) designation is for each point.

For example input:

LATITUDE	LONGITUDE
33.172	-97.069

Output:

MGRS	LATITUDE	LONGITUDE
14SPB800720	33.172	-97.069

I pull the latitude longitude from each row of the file, I am passing the lat/long values to context variables so I can use the context variables in the tJavaRow when I call the function for getting the MGRS value.

Anonymous · ‎2018-09-19

OK, that is not necessary and is probably causing horrendous memory and time issues. Here is the layout you will need....

Input File ----->tMap------->Output File

The function can be used in a tMap against your column values while they are part of the row. If your function is several lines of code, add it to a Routine. If you are not sure how to do that, post your function here and I can help convert it for you.

If you convert it to use the above configuration, it will run significantly faster.

Slow down while processing 100M+ row file?

Java

Talend Data Integration

v7.x