Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have an XML file (2MB) that was various child elements - Employee, chemical, electrical, transport, mechanical, which I am trying to load to Snowflake table. When the job was executed, tXMLMap seem to take most of the time in processing the data.
The tXMLMap read is taking around 15-20 minutes to complete for a small set of data. The job design would look as below -
Please advise how to design a job that reads XML which contains various sub elements and attributes.
Hi, maybe you can cut your job in two, first read the xml transform it and write the output on other files.
then read the files to load data to snowflakes and by this way you can also load all your data in parallel if you make each (read --> load) in separate job.
Send Me Love and Kudos
Thanks for the response ! Are you suggesting to transform the xml and write it to a file in S3 and load to snowflake from there? Could you please tell me what components i can use?
so first you write all your data in files, then you run a job which execute all your output in parrallel in simple job (read csv --> load snowflake).
You could use the component textractxmlfield wich is lighter than txmlMap.
And you could also increase your Jvm max memory for better performance
Thank you ! Let me try this approach. I still feel the tXMLMap is going to take time to process the data. Even in the approach that i followed, the snowflake load happened in seconds, but the tXMLMap processing took a good amount of time. Am thinking if it could be because of the loop conditions set to read XML.
@guenneguez jeremy - I went with the approach you had suggested. The xml read completed in 0.64 seconds, but tXMLMap took 20 minutes to spit out to delimited files. What is going wrong here?
so an other way : use textractxmlfield for each output instead of one xmlMap. And run them in parrallel.
@guenneguez jeremy - A small correction. The first approach of tfileInputXML -> tXMLMap -> tFileDelimitedOutput
and parallel runs of tFileDelimitedInput -> tSnowflakeOutput
took 7 minutes to complete.
Let me try using tExtractXMLField component and check how long it takes
@guenneguez jeremy @Shicong Hong Jeremy, in your approach, won't the same XML file be read multiple times based on number of tExtractXMLField that's there in the design? I would want XML to be read once and written to multiple tables. How can this be achieved?