<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Counting word occurances from a text field in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315514#M86159</link>
    <description>&lt;BLOCKQUOTE&gt; 
 &lt;TABLE border="1"&gt; 
  &lt;TBODY&gt; 
   &lt;TR&gt; 
    &lt;TD&gt;1. Pass the input row to tJavaRow with the following code to remove all non-space non-alpha characters and convert to lower-case:&lt;BR /&gt; output_row.ColumnName = input_row.ColumnName.replaceAll("|\\d","").toLowerCase();&lt;BR /&gt;2. Then use tNormalize to convert it to one row for each word.&lt;BR /&gt;3. Then use tAggregateRow to group by and count the words.&lt;/TD&gt; 
   &lt;/TR&gt; 
  &lt;/TBODY&gt; 
 &lt;/TABLE&gt; 
&lt;/BLOCKQUOTE&gt; 
&lt;BR /&gt;Thanks so much for the help alevy. Can't believe it's so simple and it worked like a charm. Just a note, the tAggregateRow doesn't like counting strings. I included an ID field and counted those occurrences and grouped up by the string word and everything works great. Output looks like: 
&lt;BR /&gt;was|51 
&lt;BR /&gt;stage|1 
&lt;BR /&gt;becoming|1 
&lt;BR /&gt;way|22 
&lt;BR /&gt;experience|8 
&lt;BR /&gt;Would you have any tips or ideas for an equally fantastic solution to filter out words I've compiled into a DB table? For instance, lets say the word "was" is in a table and should not appear?? 
&lt;BR /&gt;Thanks again for all your help! I'm starting to love this product.....</description>
    <pubDate>Wed, 17 Aug 2011 04:02:09 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2011-08-17T04:02:09Z</dc:date>
    <item>
      <title>Counting word occurances from a text field</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315512#M86157</link>
      <description>Hi all, 
&lt;BR /&gt;Pardon my ignorance as I'm just getting started with TOS and learning as I go. I've written code in PHP that basically extracts all the words from a MYSQL db text column, strips unwanted characters and makes all the words lower case, so characters like "\/,-()" etc get stripped, then compares and filters the word against another table so they do not get included, finally, it updates an array with the word and adds a count to it. I can then output the array to the screen, a file or whatever. 
&lt;BR /&gt;For example, the paragraph: 
&lt;BR /&gt;"The brown cow jumped over the moon in a great hurry. He is in such a hurry!" 
&lt;BR /&gt;Would have case adjusted and unwanted characters striped. It now becomes: 
&lt;BR /&gt;"the brown cow jumped over the moon in a great hurry he is in such a hurry" 
&lt;BR /&gt; 
&lt;BR /&gt;Then, filters for words, in this case say the words, "he, the, a, in, such, is" so output is now: 
&lt;BR /&gt;"brown cow jumped over moon great hurry hurry" 
&lt;BR /&gt; 
&lt;BR /&gt;Then builds the array and counts occurrences, so it now looks like this: 
&lt;BR /&gt;Word Count 
&lt;BR /&gt;_____ |______ 
&lt;BR /&gt;brown | 1 
&lt;BR /&gt;cow | 1 
&lt;BR /&gt;jumped | 1 
&lt;BR /&gt;over | 1 
&lt;BR /&gt;moon | 1 
&lt;BR /&gt;great | 1 
&lt;BR /&gt;hurry | 2 
&lt;BR /&gt; 
&lt;BR /&gt;Can anyone give me some guidance what components I need to work with and how the flow would pattern out? I've been looking at some ideas in the forums, as well as the tExtractDelimitedFields, tArray, tJava and noticed a pivot component in the exchange area, but again, I'm not clear on which components to use or even if this is possible? Any ideas would be greatly appreciated. 
&lt;BR /&gt; 
&lt;BR /&gt;Thanks in advance!</description>
      <pubDate>Sat, 16 Nov 2024 12:44:19 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315512#M86157</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-11-16T12:44:19Z</dc:date>
    </item>
    <item>
      <title>Re: Counting word occurances from a text field</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315513#M86158</link>
      <description>1. Pass the input row to tJavaRow with the following code to remove all non-space non-alpha characters and convert to lower-case:
&lt;BR /&gt; output_row.ColumnName = input_row.ColumnName.replaceAll("|\\d","").toLowerCase();
&lt;BR /&gt;2. Then use tNormalize to convert it to one row for each word.
&lt;BR /&gt;3. Then use tMap or tFilterRow to remove the words you don't want.
&lt;BR /&gt;4. Then use tAggregateRow to group by and count the remaining words.</description>
      <pubDate>Wed, 17 Aug 2011 00:45:12 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315513#M86158</guid>
      <dc:creator>alevy</dc:creator>
      <dc:date>2011-08-17T00:45:12Z</dc:date>
    </item>
    <item>
      <title>Re: Counting word occurances from a text field</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315514#M86159</link>
      <description>&lt;BLOCKQUOTE&gt; 
 &lt;TABLE border="1"&gt; 
  &lt;TBODY&gt; 
   &lt;TR&gt; 
    &lt;TD&gt;1. Pass the input row to tJavaRow with the following code to remove all non-space non-alpha characters and convert to lower-case:&lt;BR /&gt; output_row.ColumnName = input_row.ColumnName.replaceAll("|\\d","").toLowerCase();&lt;BR /&gt;2. Then use tNormalize to convert it to one row for each word.&lt;BR /&gt;3. Then use tAggregateRow to group by and count the words.&lt;/TD&gt; 
   &lt;/TR&gt; 
  &lt;/TBODY&gt; 
 &lt;/TABLE&gt; 
&lt;/BLOCKQUOTE&gt; 
&lt;BR /&gt;Thanks so much for the help alevy. Can't believe it's so simple and it worked like a charm. Just a note, the tAggregateRow doesn't like counting strings. I included an ID field and counted those occurrences and grouped up by the string word and everything works great. Output looks like: 
&lt;BR /&gt;was|51 
&lt;BR /&gt;stage|1 
&lt;BR /&gt;becoming|1 
&lt;BR /&gt;way|22 
&lt;BR /&gt;experience|8 
&lt;BR /&gt;Would you have any tips or ideas for an equally fantastic solution to filter out words I've compiled into a DB table? For instance, lets say the word "was" is in a table and should not appear?? 
&lt;BR /&gt;Thanks again for all your help! I'm starting to love this product.....</description>
      <pubDate>Wed, 17 Aug 2011 04:02:09 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315514#M86159</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2011-08-17T04:02:09Z</dc:date>
    </item>
    <item>
      <title>Re: Counting word occurances from a text field</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315515#M86160</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;tAggregateRow doesn't like counting strings&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;I had no problem with this in 4.1.3 or 4.2.2...&lt;BR /&gt;&lt;BLOCKQUOTE&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Would you have any tips or ideas for an equally fantastic solution to filter out words I've compiled into a DB table&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;Pass the flow from tNormalize into tMap.  Use e.g. tMSSqlInput to provide a lookup flow.  Set to inner-join and Catch inner-join lookup rejects for the output.</description>
      <pubDate>Wed, 17 Aug 2011 04:54:43 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Counting-word-occurances-from-a-text-field/m-p/2315515#M86160</guid>
      <dc:creator>alevy</dc:creator>
      <dc:date>2011-08-17T04:54:43Z</dc:date>
    </item>
  </channel>
</rss>

