Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Dear everybody,
I'm trying to achieve a mass conversion of QVD file to parquet file for a demo server.
Many of my QVD contain "mixed" data - ie numeric and text data in the same field which leads to a loss of data with parquet file
Storing fields with mixed data types into parquet may result in loss of data
[657748] values dropped in [MY_FIELD]
I can deal with mixed field one by one using the text() function but my purpose is to do a general load & Store with LOAD * from myqvdfile then store * to myparquetfile;
Does anybody have a solution ?
You may read the XML-Header of the QVD's which contain a lot of information about the field-data - at least numeric fields and string-fields are uniquely to identify in <NumberFormat> and <Tags>. By mixed fields I'm not sure but I think the reverse approach of not being a number or text should identify the mixed ones.
A bit simpler may be to load all fields as strings - maybe by using qvdnooffields() and qvdfieldnames() in a loop and creating an appropriate load-statement within a variable which then used in the final table-load:
Hello, any updates regarding this, we have the same issue here.
@NicolasAimain1
Did you get this issue sorted in your store statements?
I have the same challenge and would like to know how you got around this in your example.
I am binary loading an app and looping through the tables and storing them out to our data lake as parquet files.
Many thanks, Carl.
Hi, it's still not solved
I may give a try to Marcus solution within a few week
A qvw contained also various xml meta-data within the header which may be useful to get the needed information.
Beside this you may loop through all fields from a table after the binary load but not as a resident-load else querying the system-tables, maybe with something like this as starting point:
for i = 0 to nooftables() -1
let t = tablename($(i));
for ii = 1 to nooffields('$(t)')
let f = fieldname($(ii), '$(t)');
c: load isnum([$(f)]) as C autogenerate fieldvaluecount([$(f)]);
next
next
Within the loop and directly after the load the fieldvaluecount() of C could be checked which should be 1 for each pure numeric respectively string field and if it's 2 it's a mixed field and the single value of -1 or 0 for C returns also the data-type information.
With a bit on-top logic you may also exclude key-fields or already known fields from the check-loop. By the most fields the above logic is quite fast but by dozens of millions of distinct key-values such approach will take some time ...
Using a loop to load XML information from all my QVD file I can see that there is no corresponding data tag in "String%Table" for problematic field
I created a chart table based on below code using Concat(distinct [String%Table],'|') to find out field without data tags (or with only $key tags)
tagging these fields appears to solve the problem (no more [xxxxx] values dropped in [<field>] message)
next step will be to
for each vFile in filelist('lib://<mylib>')
let vFilename = <keep only filename from vFile>;
String:
LOAD
String%Table,
'$(vFilename)_'&%Key_QvdFieldHeader_59D66ED49CFF179D as %Key_QvdFieldHeader
FROM [lib://<mylib>$(vFilename).qvd]
(XmlSimple, table is [QvdTableHeader/Fields/QvdFieldHeader/Tags/String]);
QvdFieldHeader:
LOAD
'$(vFilename)' as Filename,
"FieldName",
'$(vFilename)_'&%Key_QvdFieldHeader_59D66ED49CFF179D as %Key_QvdFieldHeader
FROM [lib://<mylib>$(vFilename).qvd]
(XmlSimple, table is [QvdTableHeader/Fields/QvdFieldHeader]);
Next