7 Replies Latest reply: Dec 30, 2013 11:22 AM by Paul Madden RSS

    Aggregate on a String? Or?

    Paul Madden

      I've got a nice little issue I'm trying to solve. First, let me say that I am not at all familiar with the Aggregate Operator yet. I know, I'm embarrassed. But I really just haven't had a need for it with numerical data yet. Second, I don't even know if the Aggregate Operator is going to be the best tool for this job. I've also been trying combinations of Unique/Transforms with a little bit of success.

       

      Here is a sample dataset:

       

      EmployeeIDDepartmentFloorRoomDateOfEntry
      111Accounting1/1/12
      111Accounting2012/2/12
      111Accounting3/3/12
      222HRThree1001/1/12
      222AccountingFour1002/2/12
      222HROne1003/3/12
      222ISTwo5004/4/12

       

      Yes. I know this is a horrible dataset. Unfortunately, it's what the DB designer came up with, and now I have to clean it. I'm not happy about it. But it's what keeps my paychecks coming.

       

      What I need to do is combine these 7 records into 2. The algorithm I will use will be:

       

      For each group of EmployeeIDs:

      1) First, try to select the data with the most occurrences (disregard nils & empties).

      2) If there are no duplicates, select the record with the last date of entry.

       

      So, my target dataset should look like this:

       

      EmployeeIDDepartmentFloorRoomDateOfEntry
      111Accounting201<new timestamp>
      222HRTwo100<new timestamp>

       

      I "THINK" that all of the strings are known values. So if it was necessary, I might be able to assign integers to the string values. I just don't want to default to that if there is a somewhat easy way to do this sort of work on strings.

       

      I'm continuing to play with this, but I welcome ideas at this point.  Thanks in advance!

        • Re: Aggregate on a String? Or?
          Massimo Grossi

          Hope this can help you

           

           

          SET ThousandSep='.';

          SET DecimalSep=',';

          SET MoneyThousandSep='.';

          SET MoneyDecimalSep=',';

          SET MoneyFormat='€ #.##0,00;-€ #.##0,00';

          SET TimeFormat='hh:mm:ss';

          SET DateFormat='DD/MM/YYYY';

          SET TimestampFormat='DD/MM/YYYY hh:mm:ss[.fff]';

          SET MonthNames='gen;feb;mar;apr;mag;giu;lug;ago;set;ott;nov;dic';

          SET DayNames='lun;mar;mer;gio;ven;sab;dom';

           

           

          //

          //load original data and count null for every rows

          //

          Tmp:

          LOAD *, if(len(trim(Floor))=0,1,0) + if(len(trim(Room))=0,1,0) as FlagNullCount;

          LOAD * INLINE [

              EmployeeID, Department, Floor, Room, DateOfEntry

          111, Accounting, , ,1/1/12

          111, Accounting, , 201 ,2/2/12

          111, Accounting, , ,3/3/12

          222, HR, Three, 100, 1/1/12

          222, Accounting, Four, 100, 2/2/12

          222, HR, One, 100, 3/3/12

          222, IS, Two, 500, 4/4/12

          ];

           

           

          //

          // flag record in the target dataset

          //

          Tmp2:

          NoConcatenate

          LOAD

            if(isnull(EmployeeID) or Peek(EmployeeID)<>EmployeeID,1,0) as FlagInTargetDataset,

            *

          Resident Tmp

          Order by EmployeeID, FlagNullCount, DateOfEntry desc;

           

           

          DROP Table Tmp;

           

           

          //

          // make final table

          //

          Target:

          NoConcatenate load * Resident Tmp2 Where FlagInTargetDataset=1;

           

           

          DROP Table Tmp2;

          • Re: Aggregate on a String? Or?
            Nicole Smith

            See the attached.

             

             

            returns

             

            • Re: Aggregate on a String? Or?
              Massimo Grossi

              It would be interesting to see if someone can do the same thing directly in a chart.

              • Re: Aggregate on a String? Or?
                Paul Madden

                I suppose I meant most duplicates. For example:

                 

                EmployeeID 222, the most occurrences of [Department] is "HR".

                EmployeeID 222, the most occurrences of [Room] is "100".

                EmployeeID 111, the most occurrences of [Room] is "201" (because I want to ignore NULLs if there is at least one row with data in it).

                • Re: Aggregate on a String? Or?
                  Juan Gerardo Cabeza Luque

                  Hi Paul,

                      Probably it is possible to use an Aggregation operator and/or a Transformation operator with some special functions to write all this logic with LUA in a simpler dataflow. But I prefer to use as less LUA as possible to ease the maintenance and use different combinations of operators.

                   

                      With your data I have created a dataflow to implement this logic and create the desired result:

                   

                  Capture.PNG.png

                   

                  Although it can look complex, it is not. There is three group of operators to perform the same process with the three fields used (Department, Floor and Room) and two joins at the end to mixed all fields.

                   

                  My process is to calculate the number of iterations for every field value, then get the max iteration for every one, and join the max iteration with the iteration numbers to get only the records with max iterations. At the end, I order by date and get only the last value (last date), in case there are more than one record with the same number of iterations.

                   

                  Attached you have the Expressor project and the source data I have used.

                   

                  Hope this helps.

                    • Re: Aggregate on a String? Or?
                      Paul Madden

                      Wow! Tons of great information from all. Thanks!

                       

                      Juan, your solution is a bit closer to what I was after. Thank you very much! I think I may have also simplified your solution. Have a look:

                       

                      12-30-2013 10-08-13 AM.png

                      The first Aggregate Operators are exactly as you created them. But it appears that your 2nd Aggregate and 1st Join are not necessary when sorting by NumRoom and MaxDate.

                       

                      (FYI - The Transforms in the screenshot aren't doing anything important)

                       

                      Let me know if you think I'm missing something. But it's worked on every combination of data I threw at it. So we'll see how it does on a much larger dataset once I get a snapshot of the production data.

                       

                      Thanks again!