Numeric Sequence Generation function giving duplic... - Qlik Community

TomG1 · ‎2017-06-07

Hi,

I have created a bigdata spark job.

I am reading rows from a file.

Basically I am generating an id for each record in the file in tmap using the following code.

Numeric.sequence("IDGen", 1000000, 1)

I checked the file and found duplicate IDs generated.

Why is this happening ?

Please note that this is a bigdata spark job and i am running this job in a spark cluster.

Is there a workaround for this issue?.

Thanks

TRF · ‎2017-06-07

Can you share your job?

How the duplicates are distributed over the records?

Why do you initialize the sequence with 1,000,000?

TomG1 · ‎2017-06-07

out of 10000 numbers generated by the talend sequence function , 5 numbers are duplicate.

what i meant is , I can see count of 5 different numbers as two

ex:

ID/number duplication_count

123456 2

324532 2

There is no special reason for initializing with 1,000,000

cterenzi · ‎2017-06-07

If you initialize to 1,000,000 how are you getting values less than 1,000,000?

TomG1 · ‎2017-06-08

it is just an example to make people understand...

The real duplicates are different...

Any way the point is , I am getting duplicates

TomG1 · ‎2017-06-08

ok... i will share how duplicates are distributed in a short while

TomG1 · ‎2017-06-08

this is how duplicates are distributed.

ID Count of ID generated in the output file

1000815 2
1006072 2
1005490 2
1005905 2
1000889 2
1007748 2
1000246 2

TRF · ‎2017-06-08

Hi,

It looks very strange.

Can you share your job design + configuration for any component where the sequence is calculated.

Also, how are the duplicates identified? (with details)

Is there any value for which there is more than 2 duplicates?

TomG1 · ‎2017-06-08

here is the job design

Basically the job is reading from a source file . it attaches every record in the source file with an ID in tMap. Finally new records from tMap are stored into another file.

The subjob is checking for duplicates. the subjob groups all records in the output file of previous subjob on the basis of ID. ID along with count is stored into an output file .

There is no ID having more than 2 duplicates.

The following are the duplicates as i share earlier

ID Count

1000815 2
1006072 2
1005490 2
1005905 2
1000889 2
1007748 2
1000246 2

TRF · ‎2017-06-08

Nothing strange in the job design which is very simple.

Last question: for duplicates records, except the UniqueID, are other fields duplicated or not?

Numeric Sequence Generation function giving duplicate numbers

Big Data

Talend Data Integration

v6.x