Skip to main content
Woohoo! Qlik Community has won “Best in Class Community” in the 2024 Khoros Kudos awards!
Announcements
Nov. 20th, Qlik Insider - Lakehouses: Driving the Future of Data & AI - PICK A SESSION
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Export to UTF-8 CSV using Data Preparation Tool Free

Hello

 

We are using the Data Preparation Free tool to manage excel file that will then be imported into a Marketing Automation tool called Marketo.

 

Unfortunately, when we export the data from the data preparation tool usin CSV UTF-8 encoding, any Latin characters that have accents or double dots on top for example are all coming out incorrectly. When we do the same thing with MS Excel all come out fine.

 

What it is that we are doing wrong when exporting data from the data preparation free tool?

 

Thanks for the help

Labels (2)
9 Replies
Anonymous
Not applicable
Author

Hi,

 

It seems surprising at first glance ... especially knowing that Data Prep only supports UTF-8 when exporting to CSV.

 

Therefore, few questions:

  • Can you clarify in which tool you have faced the encoding issue when opening the CSV file generated by Data Prep?
  • Could you share a sample of the source file and of the resulting file exported from Data Prep? It can of course be dummy data instead of your real data, as long as we can witness the issue.

 

 

Thank you

Anonymous
Not applicable
Author

Hi Gwendal

 

Thanks for replying to quickly. 

 

We use Talend Data preparation Desktop Free tool. Then open the file in MS Excel 2016.

 

Here are two examples

Company First Name Last Name Function Title Address Postal Code Post Office Country
ABC John Smith CIO Stephensonstraße 1 12345 Frankfurt a. Main Germany
XYZ Jörg Munch CDO Arabellastraße 4 12345 München Germany

 

Hope this helps

Thanks

Axel

Anonymous
Not applicable
Author

Hi Axel,

 

That is what I assumed ... it's actually an Excel issue: Excel doesn't automatically detect encoding of CSV files and always assumes it is Windows-1252 ... hence the issue you see when opening the CSV file generated by Data Prep. If you open the CSV file in Notepad++ or any other decent text editor, you will see that the file generated by Data Prep is well encoded in UTF-8.

 

Note that you can display correctly a file encoded in UTF-8 in Excel ... it's everything but user-friendly, though. See https://www.youtube.com/watch?v=GcYt1mJbwk4 for instance (it is not on Excel 2016 but the sequence is essentially the same). And it has been this way for ever ... so it is unlikely to be fixed by Microsoft.

 

Good news: in the next Data Prep release (planned for January 2018), you will be able to select the encoding when exporting to CSV. So you'll be able to export in Windows-1252 so that Excel can read it correctly natively.

 

Hope this helps,

 

Gwendal

Anonymous
Not applicable
Author


Good news: in the next Data Prep release (planned for January 2018), you will be able to select the encoding when exporting to CSV. So you'll be able to export in Windows-1252 so that Excel can read it correctly natively.

Hello @gvaznunes,
I have the same exact problem in Talend Open Studio, do you know if there is another way to solve the problem there?

Thank you and sorry if I wrote in this conversation about it!

Anonymous
Not applicable
Author

Hi,

 

Encoding is configurable in tFileOutputDelimited ... in the "Advanced" tab:

0683p000009LrGi.png

 

So just select the appropriate encoding here ... and job done!

 

Regards,

 

Gwendal

Anonymous
Not applicable
Author

@gvaznunes, thank you for your help, I knew already about that, but the problem is that I have already set the encoding and I'm still having problem with some characters.

I tried to change the type of encoding from UTF-8 to the others but I still have the problem....

The source is coming from an Oracle DB with codification CP-1252 (so windows 1252) and the issues are related to characters like Ä / Ü and similar.....

Do you have some idea on how to resolve this?

Anonymous
Not applicable
Author

@gvaznunes@abaran hi guys!

Do you have any advice to solve this problem?

I will write down some of the words that are causing me this problem (on the left the right word, on the right what i find in the csv after the extraction):
- Querétaro                      Querétaro
- Cuautitlán                      Cuautitlán
- Zürich                            Zürich
- KLÖCKNER                  KLÖCKNER
- Chéraga                        Chéraga

 

Right now the field on the left is the one extracted directly from the table (inside a csv file) from the ORACLE DB with codification cp-1252, and the field on the right is the one extracted from the talend procedure with codification UTF-8 and also with the custom codification CP-1252....

Just a little bit of background about my Talend's Job.... Before the extraction to the csv file i use a REST call (tRESTclient) via Application Server to fill the table on the destination database. The data are stored with the correct encoding on the source db, but when i check on the destination db the data are already stored incorrectly....
Is it possible that there is a different encoding on the Application Server?


Thank you!

Anonymous
Not applicable
Author

Hello,
I'm here again, always with the same issue....
I checked in the details what happen in my job and i saw the the REST call on the Application Express works properly, the data extracted are in the correct encoding.

I was looking around on the web and I saw that a lot of people have this issue, as you can see in the following link https://community.talend.com/t5/Design-and-Development/UTF-8-ANSI-problem/td-p/190040 .

How can we solve this?

It seems that all is related to this problem with encoding in UTF8 http://www.i18nqa.com/debug/utf8-debug.html#dbg , but i cannot find a way to solve this issue...
I hope that someone can help me.

Thank you.

Anonymous
Not applicable
Author

Hello,
i figured out to resolve partially my problem. I've added two parameters inside my job:
- one parameter to the tOracleOutput component in the advanced Option : "characterEncoding=UTF-8"

- one parameter inside the execution section of the Main Job: -Dfile.encoding=UTF-8.

0683p000009LsF5.png

Now I'have just one issue left as the characters - and ' are inserted on my Oracle db as .

Do you have any idea on how to resolve this last issue?

Thank You.