13 Replies Latest reply: Jun 5, 2012 11:46 AM by user5674 RSS

Replace count distinct by sum : bad results

user5674

Hello,

 

I have attached an example.

 

I want to test to replace count distinct big expression with many if conditions by sum with if conditions to compare calculation speed.

 

I used a table with my data (TMP) and an column containing Id

I have added a table with only the distinct Id and a column flag.

 

But with replacing my count distinct by a sum of the flag column, I have bad results.

 

Could you explain me what is the problem?

 

Cause to the filter conditions I can't use the peek style load script because the first line with the Id which obtain the flag may have other fields values which validate only certain condition and not others. But i want my count distinct evaluate expressions dynamically on selected data.

That why I have attempted to use a table (TMPDistinct)

to associate one single Id value to one single flag 1.

 

Regards,

  • Re: Replace count distinct by sum : bad results
    user5674

    I have done some other test with formulas in attached application.

     

    I don't know if the thing I want is possible:

    use the current line value of the "fake" dimension CategoryTtl as value condition on the true aggregated sum  on field Category in the "Set analysis" formula.

     

    The "best" I have done is the yellow column with an explicit if switch. I don't think it's really "better" concerning performance...

     

    And with the If conditions I don't understand why all my Sum result are wrong.

    • Replace count distinct by sum : bad results
      swuehl

      Hi user5674,

       

      as soon as you use something like

      =sum(If(Date=vDate,flag))

       

      in your chart with dimension Category, you tell QV to go through your table containing Date and return flag for every line where Date=vDate.

       

      For example, for your chosen date and Cat A, there are three lines. But you are only interested in the distinct two lines.

       

      If you use the set analysis version,

      =sum( {$<Date={'$(=vDate)'}>} flag)


      you limit the possible values for ID for Cat A to 1 and 4 by your set expression, so if you do the sum over flag, only two values of 1 are summed up (again, QV will look at each line of your flag table, but here there are only two lines to be regarded).

       

      This sounds a bit complicated and I think it is. I hope I made my point a bit clear.

       

      Regards,

      Stefan


      • Re: Replace count distinct by sum : bad results
        user5674

        Hi,

         

        Thanks for your answer.

         

        I understand what is the problem on the simple Sum(if(...)) now. It take all lines of the TMP table validating the condition instead of all lines of the flag table (TMPDistinct ).

         

        The first "set analysis" on the left pivot table (Sum set analysis) seems give the correct results.

         

        But I don't know why the "set analysis" don't take account of the selection. if I select the 20/05/2012 the column :

        -first table: Sum set analysis

        -second table: Sum set analysis with explicit if-switch

        remain the same but I haven't the condition Date=vDate (=25/05/2012) validated as Date=20/05/2012.

        So why I haven't "0" ?I have written the "$" though in order to use current selection.

         

         

        I haven't solution to have the good result on my second table with the "fake" dimension "CategoryTtl".

         

        For this one I'm not sure if  it's possible to heve the correct results without the horrible If/Switch conditions: I think a count distinct with the condition if(Category=CategoryTtl,...) is less heavy in calculation time, so it's meaningless to use set analysis with this method.

        • Replace count distinct by sum : bad results
          swuehl

          >So why I haven't "0" ?I have written the "$" though in order to use current selection.

           

          Sure, you are using the current selection set, but you override your selection in Date by using set modifier

          <Date={'$(=vDate)'}>}


          and vDate equals 25/05/2012, right?

           

          Looking at your second question using the data island, may I ask why do you want to use a data island here?

           

          I think you can make your expression work like:

           

          =If(

              CategoryTtl<>'Total'

              ,sum({$<Date={'$(=vDate)'}>} aggr(if(CategoryTtl=Category,  flag),CategoryTtl,Category,Date,Id))  

               ,sum({$<Date={'$(=vDate)'}>} flag)

          )

           

          Also, maybe this is of use to understand how QV handles set analysis in dimension context:

          http://community.qlik.com/docs/DOC-1335

           

          Regards,

          Stefan

          • Re: Replace count distinct by sum : bad results
            user5674

            About the <Date={'$(=vDate)'}>} I misunderstood set analysis here. I thought that was just a sort of filter on current selection, not a forced selection which override current selection.

             

            But for this type of formula this isn't a big problem here because the variable is automatically updated by macro in the real application when I change the date field.

             

             

            Concerning the goal of the data island this is related to this subject: http://community.qlik.com/message/221295

             

            I have a total aggregated over a "fake" dimension label in order to have two total rows with distinct formula.

             

            I have deleted the two total rows in this dicussion because it isn't directly related to the subject.

             

            Regards,

            Sylvain

            • Replace count distinct by sum : bad results
              user5674

              Hello Stefan,

               

              Thanks for the formula

              =If(

                  CategoryTtl<>'Total'

                  ,sum({$<Date={'$(=vDate)'}>} aggr(if(CategoryTtl=Category,  flag),CategoryTtl,Category,Date,Id))  

                   ,sum({$<Date={'$(=vDate)'}>} flag)

              )

               

              It return good dynamic results on "normal" rows, but the total row isn't dynamic (always "4" and not "0" when the 2012-05-20 date is selected). I have tried to change as sum({$<Date={'$(=vDate)'}>} aggr(flag,Date) ) but it don't give me the good result.

               

              Have you a solution for this ?

               

              Regards,

              Sylvain

              • Replace count distinct by sum : bad results
                swuehl

                You could try something like this (if you are limiting your evaluation to a certain date):

                 

                =If(

                     CategoryTtl<>'Total'

                     ,sum({$<Date={'$(=vDate)'}>} aggr(if(CategoryTtl=Category,  flag),CategoryTtl,Category,Date,Id)) 

                      ,sum({$<Date={'$(=vDate)'}>} aggr(flag,CategoryTtl, Date, Id))

                )

                • Replace count distinct by sum : bad results
                  user5674

                  Thanks!

                   

                  It solves my problem.

                  • Re: Replace count distinct by sum : bad results
                    user5674

                    Hi,

                     

                    I have a question on the method sum( {SET} aggr(, Dimensions) ):

                    I must add each field used in the set analysis formula as a dimensions of the aggr?

                     

                    I have many "IF" condition with text field range exclusion and date field range inclusion to replace by a set analysis formula in order to boost performances in my true application.

                     

                    If I must add them, I think I don't optimize but decrease performance as I aggregate over the ID I want count, and other fields.

                     

                    My full expression is like this:

                    Count(DISTINCT
                    IF(
                      [Category]=[CategoryTtl]
                      AND [Period]=vLastPeriod
                      AND Class='B'
                      AND ((vActive=0 AND Action='Y') OR vActive=1)
                      AND CStatus<>'SOLVED'
                      AND CStatus<>'CANCELED'
                      AND (not isnull([CStatus))
                      AND CStatus<>''
                      AND Record.Date<=vLastDate
                      AND Record.Date>vLastDate_1
                      ,FlagDistinctID
                    )
                    )


                    I have set this expression but it doesn't work (only 0 on each row) :

                     

                    (variable)

                     

                    Sum(

                        {$<
                          Period={'$(=vLastPeriod)'}
                          ,Class={'B'}
                          ,CStatus - = {'SOLVED','CANCELED',''}
                          ,"Record.Date"={'>$(=vLastDate_1)<=$(=vLastDate)'}
                         >}

                         aggr(
                              IF(
                                   [Category]=[CategoryTtl]
                                   AND ((vActive=0 AND Action='Y') OR vActive=1)
                                   ,FlagDistinctID
                              )
                              ,[CategoryTtl],[Category]
                          )
                    )

                     

                    My goal is to decrease expressions calculation time which is too long.

                     

                    I can't do something as sub aggregates/fields on the load script because I have many expressions like this one and my application have many selectable field for users, so my expressions must remain dependant of users selections.

                    And It's to difficult to estimate all possible sub aggregates to flag distinct ID in the load script.

                    ( expression A->distinct ID, expression B->distinct ID,...)

                     

                    Could you help me ?

                    • Replace count distinct by sum : bad results
                      swuehl

                      No, I believe you don't need to add each field in your set expression as dimension to the aggr() function. Not necessarily.

                       

                      Not sure why your expression returns all zero. Are you still using the same 1 dimension CategoryTtl?

                       

                      If you comment out some set modifiers in your set expression or the additional condition in the if() statement, does this change anything?

                       

                      It's quite hard to debug this just looking at the expression (I don't see anything really wrong at the moment) and without knowing your data (model). Is there a chance that you post an updated sample file?

                      • Re: Replace count distinct by sum : bad results
                        user5674

                        Thanks.

                         

                        I will provide an example.

                         

                        Regards.

                        • Re: Replace count distinct by sum : bad results
                          user5674

                          Hello,

                           

                          Here an example with a sample dataset CSV.

                           

                          The issue is to sum on flag tables only, and not on AllRecord table, as long as I understand the bad results when not "0".

                           

                          I don't know if it will really decrease the calculation time, but it's my goal to decrease it.

                           

                          Regards

                          • Re: Replace count distinct by sum : bad results
                            user5674

                            Hi,

                             

                            I have success to have the good results in my application using something like this

                             

                            aggr(

                                 Count(

                                      {SET CLAUSE}

                                      DISTINCT

                                      IF(

                                           TrueDimension=FakeDimension1

                                           AND conditions dependant of variables

                                          , ID

                                      )

                                      ,FakeDimension1,FakeDimension2

                            )

                             

                            Same thing for my Mt formula:

                            SUM(

                                 aggr(

                                      Min(

                                           {SET CLAUSE}

                                           IF(

                                                TrueDimension=FakeDimension1

                                                AND conditions dependant of variables

                                               , Mt

                                            )

                                      )

                                      ,FakeDimension1,FakeDimension2,ID

                                 )

                            )

                             

                             

                            Edit: The statistics give me a good result, calculation time / 2 for  ten expression like this instead of use the "IF" ! Set analysis is really interesting for me here.