Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Discover how organizations are unlocking new revenue streams: Watch here
cancel
Showing results for 
Search instead for 
Did you mean: 
talendtester
Creator III
Creator III

Slow down while processing 100M+ row file?

I have a file with over 100 million rows of data.

The job processes around 2,780 files per second when the job starts, but after about 5 million rows the speed starts to slow down and eventually goes down to about 2 rows per second.

 

The job is:

tFileInputDelimited > tMap > tContextLoad

                             ↓

                         tJava > tFileOutputDelimited

 

In the tMap component, I have Advanced settings Store on disk Max buffer size: 1,000,000

 

In the job's Run Tab advanced settings I have: -Xms6256M and -Xmx7024M

The virtual server I am running the job on has 8 processors, 8 sockets and 32GB of RAM

 

What can I do to keep the job running at 2,780 files per second?

 

 

 

Labels (3)
1 Solution

Accepted Solutions
Anonymous
Not applicable

Your routine needs to look something like this (you will need to handle the imports, etc).....

public class GPSConvert {

    public static String ConvertCoords(Double long_, Double lat_){
    	
    	String myResult = "";
    	CoordinateConversion cs = new CoordinateConversion();

    	//GET THE MGRS VALUE:
    	myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));
    	return myResult;
    }    
    
}

You can use this in your tMap by simply placing the code below in your the column you want to output this data in....

 

routines.GPSConvert.ConvertCoords(row1.long, row1.lat)

There may be a bit of tidying up to do, but this will make your job run a lot faster.

View solution in original post

14 Replies
Anonymous
Not applicable

Can you show us a screenshot of your job. Your job description doesn't make any sense I'm afraid. 

Also, I have just run a job where I generated 100,000,000 rows of data and wrote them to a file. It was writing at 1.3 millions rows a second and I was using just 4GB RAM. I sense you are doing a little more than just reading and writing. A screenshot might help fill in the blanks

TRF
Champion II
Champion II

Also having an idea of what happens in tJava should be usefull
Anonymous
Not applicable

it sounds like you're running out of memory. you can test this theory by increasing the Xmx setting by a few GB and see if your slowdown occurs later in the process (maybe you get to ~6MM rows instead of 5MM)

for us to be able to give better advice, it would be very helpful if you can share a screenshot of your job and some detail of what you are doing in both the tMap and your tJava.
talendtester
Creator III
Creator III
Author

In the tJava I am passing the Latitude, Longitude values to convert the point to the Military Grid Reference System value, here is the code:

 

String myResult = "";
CoordinateConversion cs = new CoordinateConversion();

Double lat_  = Double.parseDouble(context.myLat);
Double long_ = Double.parseDouble(context.myLong);

//GET THE MGRS VALUE:
context.myResult = String.valueOf( cs.latLon2MGRUTM(lat_,long_));

//WRITE TO OUTPUT FILE:
row2.myLat    = context.myLat;
row2.myLong   = context.myLong;
row2.myResult = context.myResult;

talendtester
Creator III
Creator III
Author

There isn't much to the job, the Lat Long values are loaded to context variables then passed to the tJava for converting to MGRS, then the MGRS value is outputted to the results file:

 

0683p000009M0LW.png

talendtester
Creator III
Creator III
Author

Instead of continuing to keep throwing more memory at it, is there some way to clear the job buffer/cache every 1M rows processed?

Anonymous
Not applicable

I am still a little confused by the layout. Why are you assigning context variables millions of times? Why are you iterating to a tJava? What is the tJava sending to the tFileOutputDelimited? By the way, the tJava is not really best suited to working with row connectors. Can you give a description of what you are trying to achieve? This does not look like it will be terribly efficient at all.

talendtester
Creator III
Creator III
Author

I have a file with over 100M unique Latitude, Longitude points.

I need to find out what the corresponding Military Grid Reference System (MGRS) designation is for each point.

 

For example input:

LATITUDELONGITUDE
33.172 -97.069

 

Output:

MGRSLATITUDELONGITUDE
14SPB80072033.172 -97.069

 

I pull the latitude longitude from each row of the file, I am passing the lat/long values to context variables so I can use the context variables in the tJavaRow when I call the function for getting the MGRS value.

Anonymous
Not applicable

OK, that is not necessary  and is probably causing horrendous memory and time issues. Here is the layout you will need....

 

Input File ----->tMap------->Output File

 

The function can be used in a tMap against your column values while they are part of the row. If your function is several lines of code, add it to a Routine. If you are not sure how to do that, post your function here and I can help convert it for you.

 

If you convert it to use the above configuration, it will run significantly faster.