Skip to main content
Announcements
NEW: Seamless Public Data Sharing with Qlik's New Anonymous Access Capability: TELL ME MORE!
ArturoMuñoz
Employee
Employee

If you follow this blog then you know that during the last weeks we’ve been working in our Choose your champion‌ app for the FIFA World Cup tournament happening in Russia.

Our data provider, Gracenote, populates and periodically updates several tables in a SQL database with all the tournament data.

As an example, just assume the Players table looks like this:

Player

Actions

Player A

145

Player B

17

Player C

139

 

The field Actions contain “compressed” information for 8 different actions during a game. The value 145 encapsulates all the information about Player A's performance during a certain game.

Combining multiple values into a single field

 

It is a frequent practice to combine multiple flag or binary columns into a single decimal column, instead of creating multiple columns to store these indicators.

One of the main advantages of this system versus the alternative of creating a column for each one of the actions is that if later we decide to record a new type of action we can just add it to the Actions column versus having to create a new column for each field.

But, how can I read what Actions column content actually means?

 

To do so we need more information. We need to know how that column is being created. For example, let’s assume we have this description of Actions column:

 

Actions Legend

game played = 1

goal scored = 2

penalty scored = 4

penalty missed = 8

1st yellow card = 16

2nd yellow card = 32

red card = 64

assists = 128

 

The above's data is the key to understand what’s stored in the Action column. To get a better view of it, let's convert the base 10 integer “Value”, to binary as in the expanded representation table below:

 

Player A (145)

Player B (17)

Player C (139)

Game played (1)

1

1

1

Goal scored (2)

0

0

1

Penalty scored (4)

0

0

0

Penalty missed (8)

0

0

1

1st yellow card (16)

1

1

0

2nd yellow card (32)

0

0

0

Red card (64)

0

0

0

Assist (128)

1

0

1

 

Binary flags representation:

Player

Binary

Decimal

Player A

10010001

145

Player B

00010001

17

Player C

10001011

139

 

Description:

Player A: played the game, got a yellow card and gave an assist to other player to score a goal.

Player B: played the game and got a yellow.

Player C: played the game, scored a goal, missed a penalty and assisted for a goal.

 

How to find what we are looking for?

 

To do so, it’s important to familiarize with the available bitwise or bit operators in Qlik:

 

Bitnot, bitand, bitor, bitxor , >>, <<

 

For more examples and further explanation you can check this page: https://en.wikipedia.org/wiki/Bitwise_operation

 

Continuing with the example, let’s try to find players that have scored a goal.

 

We need to check if the second bit is set, that means 00000010 (or integer 2 in base 10)

 

Player

Value

Value bitand 2

Player A

145

0

Player B

17

0

Player C

139

2

 

Select
  Player,
  Value
From PlayerTable where Value bitand 2 >0;

 

The statement above is selecting from PlayerTable where Value bitand 2 > 0. But, let's go row by row to see what's happening under the hood:

A bitwise AND takes two equal-length binary representations and performs the logical AND operation on each pair of the corresponding bits, by multiplying them. Thus, if both bits in the compared position are 1, the bit in the resulting binary representation is 1 (1 × 1 = 1); otherwise, the result is 0 (1 × 0 = 0 and 0 × 0 = 0)

 

Player A

        10010001 (145 decimal)

bitand  00000010 (2 decimal)

        ---------

returns 00000000 (0 decimal)

 

Player B

        00010001 (17 decimal)

bitand  00000010 (2 decimal)

        ---------

returns 00000000 (0 decimal)

 

Player C

        10001011 (139 decimal)

bitand  00000010 (2 decimal)

        ---------

returns 00000010 (2 decimal)

 

 

Find the players who received a “Yellow Card” and did not “Assist”

 

We need to check if the fifth bit is set,  00010000 (or integer 16 in base 10) and if the eight bit is not set (128 base 10)

 

Player

Value

Value bitand 16

Value bitand 128

Player A

145

16

128

Player B

17

16

0

Player C

139

0

128

 

Select
  Player,
  Value
From PlayerTable where Value bitand 16 > 0 and Value bitand 128 = 0;

Bit operators in expressions

 

Remember you can use the bit operators to create expressions too, so in a object we could do:

=count(distinct {<Value = {"=Value bitand 16 > 0"}>} Player)

 

This counts how many players have seen a yellow card so far, note how we are comparing the Value with the flag yellow card (16).

As you can see in the examples, the use of the bitand operator has allowed us to compare two different values at a binary level and returns whether the two numbers intersect allowing me to select and/or count the correct numbers.

 

Hope you find this interesting.

Arturo (@arturoqv)

 

PS: I'm attaching a QlikView and Qlik Sense app with the example data so you can test it out yourselves.

6 Comments
Or
MVP
MVP

Using this "compressed" method appears to use up nearly the same memory compared splitting it into individual columns, based on some random numbers I had QlikView generate. A million lines 0-1023 take up 1.59MB on disk rather than 1.60MB when split into ten binary flags (hopefully I got the math right) . I suppose if your app has enough lines / enough fields like this and you're clawing for every byte, it's worth doing, but in most use cases I think I'd rather have individual columns for ease of use. Reading "YellowCard AND NOT Assist" is a lot easier on the eyes than "Value bitand 16 > 0 and Value bitand 128 = 0;"

Are there any use cases for this other than compression? Otherwise, if my data came in this format, I would find the bit operators very useful, but only for breaking up the data into individual columns.

3,518 Views
ArturoMuñoz
Employee
Employee

I think you summarized it pretty well Or Shoham, it's a method that the DB Architect could use to potentially reduce the size and maintenance needs of a DB, not really the Qlik app. It is rarely a decision that a Qlik Developer will face her/himself. This post was meant to serve as an explanation for us, Qlik Devs, on how to deal with situations where the data we are working with contains columns like described in the blog post.

3,518 Views
Or
MVP
MVP

In that case, good job explaining - it was crystal clear. Now we just need to replace your DB architect.

3,518 Views
evanplancaster
Contributor III
Contributor III

Though the example given is a little contrived, I could totally see a use for this in things where the number of fields could be extremely large and full of 0s for most entries if you had to have a set number of columns.

Take, for example, a dataset containing every pitch thrown to a batter in baseball. If I want to know what pitches the batter swung at, this could be difficult to fit into a regular table, because there are not a set number of pitches a batter can face per plate appearance. (For those not familiar with baseball, the batter has baseballs thrown to him and he has to either hit the ball before three balls in the strike zone are thrown to him, wherein he strikes out, or get four balls out of the strike zone thrown to him, wherein he gets a walk, getting a free pass to first base. If he has two strikes against him, he can continue to hit balls foul (not in play) for as long as he is able, meaning a plate appearance can result in as few as one pitch -- the batter hits the first pitch he sees or gets hit by the pitch -- or as many as... well... theoretically, an infinite number of pitches, because he could just keep fouling pitches off indefinitely.) If one batter has a crazy plate appearance like this, for instance:

PitchCount   PitchResult   Count

  1. Swing/Miss 0-1
  2. Swing/Miss 0-2
  3. Swing/Foul 0-2
  4. Ball 1-2
  5. Swing/Foul 1-2
  6. Ball 2-2
  7. Swing/Foul 2-2
  8. Swing/Foul 2-2
  9. Swing/Foul 2-2
  10. Swing/Foul 2-2
  11. Swing/Foul 2-2
  12. Swing/Foul 2-2
  13. Ball 3-2
  14. Swing/Foul 3-2
  15. Swing/Foul 3-2
  16. Swing/Foul 3-2
  17. Swing/Foul 3-2
  18. Swing/Foul 3-2
  19. Swing/Foul 3-2
  20. Swing/Miss Strikeout*

... then you'd have to have to have 20 columns for every single plate appearance for every single player, even though the average number of pitches seen in a plate appearance is like under 4. And of course, what happens if someone goes and gets a 21-pitch plate appearance? Then you have to go back and add another column for pitch 21 and populate it with a 0 for every plate appearance except the one that went 21 pitches deep.

Instead, you could have it set up as follows:

Pitch n = 2^(n-1), where n = the pitch number seen in the plate appearance.

So:

  • A record for whether the batter swung at pitch 1 would equal 2^(1-1) = 2^0 = 1, multiplied by either a 1 or 0, depending on whether the batter swung or not.
  • A record for whether the batter swung at pitch 2 would equal 2^(2-1) = 2^1 = 2, multiplied by either a 1 or 0, depending on whether the batter swung or not.
  • A record for whether the batter swung at pitch 2 would equal 2^(3-1) = 2^2 = 4, multiplied by either a 1 or 0, depending on whether the batter swung or not.
  • Etc.

Thus the plate appearance above would look like:

11111110111111010111(base 2) = 1044439 (base 10)

Meanwhile, a more typical plate appearance would look like:

  1. Ball 1-0
  2. Swing/Miss 1-1
  3. Swing/Miss 1-2
  4. Ball 2-2
  5. Ball 3-2
  6. Swing/Miss Strikeout

... which would convert to:

100110(base 2) = 38 (base 10)

No need for a bunch of virtually empty columns and the potential to have to go back and add more to accommodate one ridiculously long (but admittedly pretty epic) plate appearance!

I agree that this would be pretty difficult to explain to the average Qlik developer, and it does make the code is quite difficult to interpret. I also freely admit I'm not a DB Admin by ANY stretch of the imagination, and so there may be a dozen ways to store this type of data just as efficiently of which I am unaware. But I can also see it usefulness in certain situations.

Good article!

Evan Lancaster

(* see 6 Epic At-Bats | Mental Floss‌; this plate appearance actually happened!)

3,518 Views
ArturoMuñoz
Employee
Employee

giphy.gif

Well done Evan!

0 Likes
3,518 Views
Or
MVP
MVP

The only problem here is that the proposed dataset doesn't actually let you use common filters - such as "Show me how a hitter performs on a 3-2 count" - without having to loop through the entire field and break down each individual bit. I'd just store this as a non-binary table containing the pitch number, count, and outcome, and I imagine that's how MLB stores it too. It wouldn't actually cost any significant amount of memory in Qlik because the number of distinct values is quite low - I think in this case it would actually be more efficient than using bits. Might check it later. 

0 Likes
3,518 Views