Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik and ServiceNow Partner to Bring Trusted Enterprise Context into AI-Powered Workflows. Learn More!
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Best ways to examine large datasets

Hi,

I have a task to cleanse some very dirty and large datasets and have tried a few components such as tSamplereservoir and tSamplerow.and others...

I've run into errors due to the formatting of the dataset being incorrect, so I've been trying to find out the problem with the data without running the whole table to do so.

.

I want to do a few specific things to check these large datasets like the below;

  • Select distinct values in ONE output column so I can better prepare the metadata length and datatype.
  • Be able to select a certain row that has come up as an error due to it having some value that doesn't suit the datatype - example, I had an error where an inconsistent value was on row 2778, I want to extract that row to find the problem
  •  Any other tips to do what I'm doing with other useful Components if you have any.

Many thanks

   

 

Labels (1)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Hi Again, I'm doing a trial of Talend & I'm a bit concerned that no-one can answer my questions?

I thought these basic data examination tasks would be pretty easy to answer for someone who had Talend experience. 

I've used several similar products & can usually fins solutions by trial and error/Internet searches/Forums but not so far.

I'd like to know if my questions are not phrased properly or confusing or whatever as I've done a lot of research to try & answer these questions my self.

Otherwise, Talend may not be for me.

Thanks

View solution in original post

2 Replies
Anonymous
Not applicable
Author

Hi Again, I'm doing a trial of Talend & I'm a bit concerned that no-one can answer my questions?

I thought these basic data examination tasks would be pretty easy to answer for someone who had Talend experience. 

I've used several similar products & can usually fins solutions by trial and error/Internet searches/Forums but not so far.

I'd like to know if my questions are not phrased properly or confusing or whatever as I've done a lot of research to try & answer these questions my self.

Otherwise, Talend may not be for me.

Thanks

dprot
Contributor II
Contributor II

Hi,
First of all, I'm sorry for the time we needed to answer you.


If you want to select distinct values into one column, you can use the tUniqRow component, that should answer to your need: https://help.talend.com/reader/FnHYY1jWCvZe5NolmUNMdQ/ZWEHPNtq0AakndnOqHQJOQ

 

If you need to isolate a certain row and have a way to identify it, you can use the tFilterRow component (see https://help.talend.com/reader/Btf8zDsnT4ikhQgFW1plpQ/A8jXysHjNUXgVcJIkOBapg), you can use it even if you need several columns to identify your row in an unique manner.


About the tReservoirSampling component, it can be very useful if you want to extract a sample of your data that will be homogeneous, it will guarantee you that your profiling is not biased on your sample for example (it won't be the case if you take for example the 1000 first rows).


I hope it will answer your questions.
Damien