Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

split a file containing a big string by fixed length.

Hi,
i have a file containing rows with out any delimitor, but the rows are fixed length. how to do this job in talend.

if file contains
abcdefghijklmno

i need the out put as below if the column length is 3.
abc
def
ghi
jkl
nmo

can we do this in talend - java?
Labels (3)
9 Replies
Anonymous
Not applicable
Author

I think you have a module tFileInputPositional or tFileInputMSPositional
Anonymous
Not applicable
Author

Hi Neth,
i have a file of 300MB of continuous string. i want to split it into records of constant size say 35 chars. I don't know how many records file contains. so i don't think fFileInputPositional or tFileInputMSPositional help to do this. any other ideas?
Anonymous
Not applicable
Author

how many fields do you have ?
Anonymous
Not applicable
Author

When handling large database type files it is sometimes necessary to split the file into "records" or known line lengths as the file has been output without any delimiters/separators between records.
Is there any feature in talend allows a user-specified string to be inserted at a constant user-specified increment in the file from some start point in the file to some end point in the file.
alevy
Specialist
Specialist

Try this code in a tJavaRow, which breaks a string into sets of no more than 66 characters, preserving whole words, delimiting each set by "~|~" and including a set number delimited by ":::".
Integer LineNumber = 1;
String DelimitedLine = "";
String RemainingLine = input_row.InputLine;
Integer EndOfLineIndex;
while (RemainingLine.length()>66) {
EndOfLineIndex = RemainingLine.lastIndexOf(" ",66);
DelimitedLine = DelimitedLine+"~|~"+String.valueOf(LineNumber)+":::"+RemainingLine.substring(0,EndOfLineIndex);
RemainingLine = RemainingLine.substring(EndOfLineIndex+1);
LineNumber = LineNumber+1;}
output_row.OutputLine = (DelimitedLine+"~|~"+String.valueOf(LineNumber)+":::"+RemainingLine).substring(3);

You can then follow the tJavaRow with a tNormalize (don't forget the escape characters in the item separator as it's a regular expression i.e. use "~\\|~") to separate the sets into rows and a tExtractDelimitedFields to separate the set numbers from the actual input subset.
Anonymous
Not applicable
Author

hi alevy,
can u provide an example for this.... becoz i am reading the data from the file ..... i am always getting "Out of Memory exception" ...... can u provide me a job for the same???
alevy
Specialist
Specialist

The example I provided works with strings of a "reasonable" length, assuming essentially that your input still has some sort of row/record delimiter and that you just need to break the strings down into smaller chunks. If your entire file is one string, then I'm not surprised you get an out of memory error 0683p000009MACn.png
Other than increasing the memory allocated to the run-time environment (see lots of other posts about this), you might have to write your own code to handle the file.
Sorry I can't suggest anything else.
Anonymous
Not applicable
Author

Hi alevy,
thanks for reply, as my files is in >10GB i am unable to handle this file in Talend, because of Out Of Memory problem. i am able to get the first record of desired length, after that we have to write the file in to other file without the extracted string. there the Out of memory occurs, because we have to hold ~20GB data in my case.
As of now i am using Ultra Edit tool to split the DB file in to records, its providing a dedicated function for that. i wish the same functionality will be provided in Talend soon.
Anonymous
Not applicable
Author

Hello, may I push this topic? Is there any solution to the original problem? I have the same thing to do: Big file, one chunk of data, but the data is to divide in defined rows. There is no EOL delimiter, one row is only defined by the number of characters in that row.