topic Re: Counting word occurances from a text field in Talend Studio

Counting word occurances from a text field

Anonymous — Sat, 16 Nov 2024 12:44:19 GMT

Hi all,
Pardon my ignorance as I'm just getting started with TOS and learning as I go. I've written code in PHP that basically extracts all the words from a MYSQL db text column, strips unwanted characters and makes all the words lower case, so characters like "\/,-()" etc get stripped, then compares and filters the word against another table so they do not get included, finally, it updates an array with the word and adds a count to it. I can then output the array to the screen, a file or whatever.
For example, the paragraph:
"The brown cow jumped over the moon in a great hurry. He is in such a hurry!"
Would have case adjusted and unwanted characters striped. It now becomes:
"the brown cow jumped over the moon in a great hurry he is in such a hurry"

Then, filters for words, in this case say the words, "he, the, a, in, such, is" so output is now:
"brown cow jumped over moon great hurry hurry"

Then builds the array and counts occurrences, so it now looks like this:
Word Count
_____ |______
brown | 1
cow | 1
jumped | 1
over | 1
moon | 1
great | 1
hurry | 2

Can anyone give me some guidance what components I need to work with and how the flow would pattern out? I've been looking at some ideas in the forums, as well as the tExtractDelimitedFields, tArray, tJava and noticed a pivot component in the exchange area, but again, I'm not clear on which components to use or even if this is possible? Any ideas would be greatly appreciated.

Thanks in advance!

Re: Counting word occurances from a text field

alevy — Wed, 17 Aug 2011 00:45:12 GMT

1. Pass the input row to tJavaRow with the following code to remove all non-space non-alpha characters and convert to lower-case:
output_row.ColumnName = input_row.ColumnName.replaceAll("|\\d","").toLowerCase();
2. Then use tNormalize to convert it to one row for each word.
3. Then use tMap or tFilterRow to remove the words you don't want.
4. Then use tAggregateRow to group by and count the remaining words.

Re: Counting word occurances from a text field

Anonymous — Wed, 17 Aug 2011 04:02:09 GMT

1. Pass the input row to tJavaRow with the following code to remove all non-space non-alpha characters and convert to lower-case:
output_row.ColumnName = input_row.ColumnName.replaceAll("|\\d","").toLowerCase();
2. Then use tNormalize to convert it to one row for each word.
3. Then use tAggregateRow to group by and count the words.

Thanks so much for the help alevy. Can't believe it's so simple and it worked like a charm. Just a note, the tAggregateRow doesn't like counting strings. I included an ID field and counted those occurrences and grouped up by the string word and everything works great. Output looks like:
was|51
stage|1
becoming|1
way|22
experience|8
Would you have any tips or ideas for an equally fantastic solution to filter out words I've compiled into a DB table? For instance, lets say the word "was" is in a table and should not appear??
Thanks again for all your help! I'm starting to love this product.....

Re: Counting word occurances from a text field

alevy — Wed, 17 Aug 2011 04:54:43 GMT

tAggregateRow doesn't like counting strings

I had no problem with this in 4.1.3 or 4.2.2...

Would you have any tips or ideas for an equally fantastic solution to filter out words I've compiled into a DB table

Pass the flow from tNormalize into tMap. Use e.g. tMSSqlInput to provide a lookup flow. Set to inner-join and Catch inner-join lookup rejects for the output.