Compression Discrepancy in XML Files Generated by Talend vs eprocat
I'm currently working on a project where I'm creating a catalog using the BMEcat new catalog 1.2 version Schema through Talend. I've imported the BMEcat XSD in Talend Studio, created a structure, and used a tHMap to map elements from source to target (BMEcat XML). However, I'm encountering a perplexing issue with the file size and compression ratio.
The generated XML file from Talend is approximately 3GB in size. When I compress it using the Deflate algorithm at normal compression level, the resulting ZIP file is around 400MB, reaching only a 12% compression ratio.
Interestingly, when I create the same catalog using eprocat, the raw file size is also 3GB. However, when I compress it using the Deflate algorithm at normal compression level, the ZIP file is only around 170MB, achieving a 5% compression ratio.
Upon inspecting the files, I noticed two main differences:
- Talend: Single-line XML file without indentation.
- eprocat: Indented XML file.
I've conducted some tests by changing the XML notation at the top and indenting the Talend-generated file to match the eprocat structure. Despite these adjustments, the compression ratio remains at 12% for the Talend-generated file.
**Objective:**
I aim to achieve a similar compression ratio for the Talend-generated file as I do with eprocat (5%). Are there any specific configurations or optimizations in Talend that can be applied to enhance the compression ratio? Any insights or suggestions would be greatly appreciated.