The surprise was that the QVD file was bigger than the CSV file.
Zip file (with csv): 32,169 KB
CSV file: 94,192 KB
QVD file: 121,277 KB
Is around 30% bigger!
This file has four fields: id, latitude, longitude and postal code. I think, the "problem" is in the latitude and longitude fields, that they are super acurate (15 decimal digits), because the id is an incremental integer (is a little space in QV) and the postal code is a nine char string (with a lot of similar blocks)
To increase the performance I have some suggestions:
What's the number of decimal digits in the coordinates needed for a postal code?
Can I divide the postal code in two four char strings? It has sense?
I'm not surprised the QVD is larger in this case. A CSV file of locations would likely have no repeating field values. It's likely every value is unique. If that's the case, the smallest storage format would be pure text like a CSV.
A QVD on the other hand contains lists of values (Symbol tables), pointers and metadata. When there are repeating field values, the "de-duplication" process will result in a QVD that takes less disk space than the source data.
Rob is right that with mostly distinct values you won't save storage space or RAM but with a few transformations you could reduce the size significantely, for example with:
LOAD floor(id) as id,
subfield(postcode, ' ', 1) as postcode1,
subfield(postcode, ' ', 2) as postcode2,
floor(subfield(latitude, '.', 1)) as latitude1,
floor(left(subfield(latitude, '.', 2), 14)) as latitude2,
// this kind of cutting will cause a small inaccuracy but I think it's not essential
floor(subfield(longitude, '.', 1)) as longitude1,
floor(left(subfield(longitude, '.', 2), 14)) as longitude2
FROM [ukpostcodes.csv] (txt, codepage is 1252, embedded labels, delimiter is ',');
store test into test.qvd (qvd);
which will reduce it to 58,003 and if you removed the id (any use ? - the association with the datamodel should be the postcode, right? - whereby by loading the data into a datamodel the complete postcode would be needed but it could be then an autonumber() field) it would be just reduced to 45,955.
And if you load them there with an exists() to the real existing postcodes you should need even much lesser RAM as this number.
But if you are not on the limits of your environment it might be not worth to make these additionally efforts ...