Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in NYC Sept 4th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
TomG1
Creator
Creator

Numeric Sequence Generation function giving duplicate numbers

Hi,

 

I have created a bigdata spark job.

I am reading rows from a file.

Basically I am generating an id for each record in the file in tmap using the following code.

Numeric.sequence("IDGen", 1000000, 1)

I checked the file and found duplicate IDs generated.

 

Why is this happening ?

 

Please note that this is a bigdata spark job and i am running this job in a spark cluster.

 

Is there a workaround for this issue?.

 

Thanks 

Labels (3)
22 Replies
TRF
Champion II
Champion II

Can you share your job?

How the duplicates are distributed over the records?

Why do you initialize the sequence with 1,000,000?

TomG1
Creator
Creator
Author

out of 10000 numbers generated by the talend sequence function , 5 numbers are duplicate.

what i meant is , I can see count of 5 different numbers as two 

 

ex:

 

ID/number           duplication_count

123456                   2

324532                   2

 

 

There is no special reason for initializing with 1,000,000

cterenzi
Specialist
Specialist

If you initialize to 1,000,000 how are you getting values less than 1,000,000?
TomG1
Creator
Creator
Author

it is just an example to make people understand...

The real duplicates are different...

Any way the point is , I am getting duplicates

TomG1
Creator
Creator
Author

ok... i will share how duplicates are distributed in a short while

TomG1
Creator
Creator
Author

this is how duplicates are distributed.

 

ID                      Count of ID generated in the output file

1000815               2
1006072               2
1005490               2
1005905               2
1000889               2
1007748               2
1000246               2

TRF
Champion II
Champion II

Hi,

It looks very strange.

Can you share your job design + configuration for any component where the sequence is calculated.

Also, how are the duplicates identified? (with details)

Is there any value for which there is more than 2 duplicates?

TomG1
Creator
Creator
Author

here is the job design

 

0683p000009Lucf.jpg

 

Basically the job is reading from a source file . it attaches every record in the source file with an ID in tMap. Finally new records from tMap are stored into another file.

 

0683p000009LuOz.jpg

The subjob is checking for duplicates. the subjob groups all records in the output file of previous subjob on the basis of ID. ID along with count is stored into an output file .

There is no ID having more than 2 duplicates. 

The following are the duplicates as i share earlier

 

ID                   Count

1000815            2
1006072            2
1005490            2
1005905            2
1000889            2
1007748            2
1000246            2
 

TRF
Champion II
Champion II

Nothing strange in the job design which is very simple.

Last question: for duplicates records, except the UniqueID, are other fields duplicated or not?