Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
PP_12
Contributor III
Contributor III

How to write script that can check large csv/excel files for data errors? Data quality eg to find extra double quotes or find text in a number field

Hi, 

I'm having trouble trying to find the best way to find errors for the csv/excel files we receive. They have advised this is the best they can provide. So we are left having to manually fix any data quality issues. Please note the csv's have double quotes as a text qualifier as there are some fields like Address that contain commas e.g. "190 Fuller Road, Summer City, USA","",

Is there an easy way to check each large csv for any errors? Some of the examples I would like to check for are below

-Street_Number should only have numeric values no text and vice versa for fields like Street_name 
- Identify extra double quotes as this sometimes puts details in the wrong fields when loaded
- any fields with nulls etc...

I would kind of like a table showing which has errors or potential errors to fix. I searched and people have recomended something like data quality rules but I have no idea where to start.

Any suggestions would be great. Sorry unable to upload data 

Thank you

4 Replies
anthonyj
Creator III
Creator III

Hi,
I can't answer your query completely but I can offer the settings required to read your CSV in using double quotes as the qualifier. 

You'll need to read file in using UTF-8 as below

Load * from [lib://DataFile/YourFile.csv]
(txt, utf8, embedded labels, delimiter is ',', msq)

This will ignore the commas between the double quotes.

If you have fields you expect to be numeric/text you can create a table after loading and add the function "isnum( )" and "istext( )" for the boolean result of -1 or 0 for true and false. You could try experimenting with a "LEN( )" function to sort if there's any fields that are unexpectedly long.

I hope this helps.

Thanks

Anthony

PP_12
Contributor III
Contributor III
Author

Thank you for your response Anthony. Unfortunately I have tried utf-8 and still have issues as the data is quite messy. It was recommended in another thread that is best to clean csv's. Do you have any suggestions in how to clean these large csv files? Notepad, excel or can we use qlik to clean the files? Just unsure how to clean and I am a bit limited with which tools I have. Sorry this is all new to me

In regards to your suggestions do I just load the table as usual and add column eg IsNum(ID) ?

Thanks again for your help

anthonyj
Creator III
Creator III

Hi,
Sorry that didn't help. If the data is as messy as you say the UTF-8 encoding probably only got so far before the double quotes became unbalanced throwing out the load. I was expecting that it would create the columns but that some of the data may have been aligned to the incorrect columns. Once it was loaded you could, create a table in Qlik and you'd be able to see by the headings and the values available in the table where the inconsistencies are in your CSV.

Depending on the size of the file you could try just opening it in Excel and see where the columns are split in there. Excel is still a good data wrangling tool.

Regards

Anthony

marcus_sommer

Handling a poor data-quality is most often not an easy job and how expensive it could become depends of the degree of poor data related to the correct ones and of course to the kind of wrong/missing data. This is quite unrelated to the tools which you have available.
 
Saying this it's not unimportant with which tool you tries to repair the data. If you used an editor it will be a nearly complete manually job even if the advanced ones may have some clever search- and replace features and may offer some ways to automate parts with macros. Using Excel with or without VBA could simplify the job already significantly but Excel has an own data-interpretation which may cause further challenges.
 
Much more suitable would be to use Qlik for it. Compared to Excel it's far more powerful in regard to the available string/convert/format-functions and to apply loop-logic as well as mappings. Very important is to do it within a systematically approach. This means:
 
  • loading the file-content within the right fields (if the mentioned msq-feature didn't returned the wanted results you need a pre-load with fixed lengths to count the quotes and field-delimiters and correct them before dividing them into the fields)
  • splitting multiple information within a field like your mentioned addresses into single fields
  • applying multiple checks on the fields like: isnum/istext/isnull and len/rangemin/rangemax as well as other logic to detect the included data/format-pattern
  • defining what is a valid and what an invalid value
  • if the values are invalid should they corrected - yes/no ? - not all cleaning which is technically possible is also sensible in regard to ALL business requirements
  • cleaning from invalid values in n-layers
  • mapping values to unify heterogenous ones
  • evaluating the entire data-set - is it sensible to use it or is the quality further too poor to work with
This means after some days/weeks of hard work you may to note that **bleep** in results further in **bleep** out ... (but beside of it you will learn a lot). Therefore you should really consider if you accept the statements that the data couldn't be provided in a better state or if there are also other ways - maybe a direct access to the data-bases which is probably behind the csv.
 
Edit: S H I T was automatically replaced by **bleep**
 
- Marcus