Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I have a location column that I'm processing through tExtractDelimitedFields to break out into city, region, country
However, the input isn't consistent, sometimes just containing a country, sometimes a region and country, sometimes a city, region, and country:
|=--------------------------------=| |location |
|=--------------------------------=| |Bangkok, Thailand | |Sydney, New South Wales, Australia| |Atlanta, Georgia, United States | |Zurich, Switzerland | |Peru | |=--------------------------------=|
Which is obviously and correctly split by tExtractDelimitedFields to:
|=----------------+-----------------------------+--------------=| |city |region |country | |=----------------+-----------------------------+--------------=| |Bangkok | Thailand | | |Sydney | New South Wales | Australia | |Atlanta | Georgia | United States | |Zurich | Switzerland | | |Peru | | | |=----------------|-----------------------------|--------------=|
Now, I'd like to correct the locations of the different elements of the location field. My thought was to extract the fields into variables, and shift them to the right until the country field is no longer blank. But it occurred to me that
Here is the solution I propose as a more elegant way (let you decide this point!)
Use a tFileInputFullRow to read the file as it is (schema is one field called "line").
Use tMap to left pad the line field with the missing "," if any:
Here is the magic, get a substring from ",," starting a position depending of the number of "," in the line, then add the line to this substring which contains 0, 1 or 2 "," (for ex: ",Bangkok, Thailand" or ",,Peru").
Now you can use tExtractDelimitedFields as you want (just have to change the Field separator to "," and to define the desired output schema):
Check the "Trim Column" option on Advanced setting tabs.
And the result as expected with just 1 line of Java code:
Hope this helps.
I assume I brute-forced it with tJavaRow:
if ("" == input_row.country && "" == input_row.region) { output_row.city = ""; output_row.region = ""; output_row.country = input_row.city; } else if ("" == input_row.country) { output_row.city = ""; output_row.region = input_row.city; output_row.country = input_row.region; } else { output_row.city = input_row.city; output_row.region = input_row.region; output_row.country = input_row.country; }
If there's a more elegant way to do it, please let me know!
Here is the solution I propose as a more elegant way (let you decide this point!)
Use a tFileInputFullRow to read the file as it is (schema is one field called "line").
Use tMap to left pad the line field with the missing "," if any:
Here is the magic, get a substring from ",," starting a position depending of the number of "," in the line, then add the line to this substring which contains 0, 1 or 2 "," (for ex: ",Bangkok, Thailand" or ",,Peru").
Now you can use tExtractDelimitedFields as you want (just have to change the Field separator to "," and to define the desired output schema):
Check the "Trim Column" option on Advanced setting tabs.
And the result as expected with just 1 line of Java code:
Hope this helps.
@jwhite_tc, did my proposition help you to solve your case?
It did indeed! Basically pad the front of the location field with commas to force it to three total. And the trim in the advanced settings. Lovely!