Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I have created a bigdata spark job.
I am reading rows from a file.
Basically I am generating an id for each record in the file in tmap using the following code.
Numeric.sequence("IDGen", 1000000, 1)
I checked the file and found duplicate IDs generated.
Why is this happening ?
Please note that this is a bigdata spark job and i am running this job in a spark cluster.
Is there a workaround for this issue?.
Thanks
Can you share your job?
How the duplicates are distributed over the records?
Why do you initialize the sequence with 1,000,000?
out of 10000 numbers generated by the talend sequence function , 5 numbers are duplicate.
what i meant is , I can see count of 5 different numbers as two
ex:
ID/number duplication_count
123456 2
324532 2
There is no special reason for initializing with 1,000,000
it is just an example to make people understand...
The real duplicates are different...
Any way the point is , I am getting duplicates
ok... i will share how duplicates are distributed in a short while
this is how duplicates are distributed.
ID Count of ID generated in the output file
1000815 2
1006072 2
1005490 2
1005905 2
1000889 2
1007748 2
1000246 2
Hi,
It looks very strange.
Can you share your job design + configuration for any component where the sequence is calculated.
Also, how are the duplicates identified? (with details)
Is there any value for which there is more than 2 duplicates?
here is the job design
Basically the job is reading from a source file . it attaches every record in the source file with an ID in tMap. Finally new records from tMap are stored into another file.
The subjob is checking for duplicates. the subjob groups all records in the output file of previous subjob on the basis of ID. ID along with count is stored into an output file .
There is no ID having more than 2 duplicates.
The following are the duplicates as i share earlier
ID Count
1000815 2
1006072 2
1005490 2
1005905 2
1000889 2
1007748 2
1000246 2
Nothing strange in the job design which is very simple.
Last question: for duplicates records, except the UniqueID, are other fields duplicated or not?