Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi all. I've got a table with about 50 fields. In my script, after the table is loaded, and some other dimension tables are loaded, I want to create some derived fields based on the data that I've loaded. I've been getting weird results. I tried doing something like this:
LEFT JOIN (MyTable) LOAD
ID,
if(...) AS DerivedField1,
if(...) AS DerivedField2
RESIDENT MyTable;
Is this a good approach? As I said, I've been getting strange results, like where the data doesn't come out right according to the logic. Or sometimes it creates a ton of duplicate rows. So how do you guys create derived fields after the fact?
Cheers,
Eric
The syntax looks good. The question about logic and depends on what you really need.
You can do a load of load...don´t use a join.....if i understood you...something like this...
LOAD
YEAR(xxx) AS YEAR,
MONTH(xxx) AS MONTH,
IF(xxx='xyz',True,false) AS yyyy,
*;
LOAD
xxxx,
xxxx
SQL SELECT
xxxx,
xxxx ;
You have three options:
1) Left Join the table to itself like you did in your example. Which does look right, I would guess if your getting duplicate rows the "ID" is not truely a unique identifer or was duplicated earlier before this left join. When trouble shooting I always recommend output using trace the rowcount of the resulting table after each Join.
2) Temp load and then Resident load. Example:
Table1_Tmp:
Load Field1, Field2;
SQL Select Field1, Field2 From Table1;
Table1:
Noconcatenate
Load *,
Field1 + Field2 AS Field3
Resident Table1_Tmp;
Drop Table Table1_Tmp;
Both 1 & 2 are most useful when you must Join two or more tables in memory before all the fields needed for your calculation are present in one table. However, if all the fields present in the original table loaded I recommend using proceeding loads maybe even multiple. Example:
Table1:
Load *,
SubTotal + Tax AS Total
Load *,
SubTotal * 0.06 AS Tax
Load Labor,
Material,
Labor + Material AS SubTotal;
SQL Select Labor, Material From Table1;
Its kind of funny that most code executes top down but load statements like these kind of go bottom up. Whenever possible I would use mulitple loads chained on one select before I would consider options 1 or 2 above. As far as performance goes I would not know which is faster option 1 or 2.