Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Write Table now available in Qlik Cloud Analytics: Read Blog
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

How to create an AVRO schema for my JSON data?

I'm a data scientist and have been tasked with trying out Pipeline Designer by my company. I have some JSON data that I was hoping to just process. When using a Kinesis source I need to read it in an AVRO format. I have done a bit of reading about this and have found a bit of Python that will convert JSON to an AVRO format, but first I need to come up with a suitable AVRO schema for the conversion. Is there an easy way to do this? My JSON is quite big and I have no idea about Apache AVRO. Maybe there is an online tool that can be used? It would be really useful if someone can help as I think Pipeline Designer might be quite useful for us, but I can't test it without first converting my JSON.

Labels (4)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Hi @JackTaylor,

 

There are several AVRO tools online, but I haven't found one which solves the problem that you have. In fact, I had the same issue when trying to process Twitter data. One of the guys from R&D let me in on a trick which is really useful. 

 

1) Create a Test Connection and a Dataset from that, you can hard code your JSON into the values section after setting the type to JSON. Below is a screenshot of my Twitter data being hardcoded into mine.

0683p000009M4k6.png

 

2) I assume you have access to create a S3 Bucket in Amazon (given you are using Kinesis)? If so, create a connection to S3 and create a S3 data Dataset.

 

3) Now all you need to do is create a Pipeline between your Test Connection source Dataset (containing your hard coded JSON) and your S3 bucket target Dataset. Below is a screenshot of the one I created. You can see the data preview showing the twitter data.

0683p000009M4p5.png

 

4) When you run it, there will be a file which is created in your S3 Bucket. This will contain your JSON data in an AVRO format with the AVRO schema that was used at the very beginning of the file. Below you can see the contents of the file that was created. I have coloured the AVRO schema in red...

Objavro.schemaä]{"type":"record","name":"outer_record_3231340614480644067","namespace":"org.talend","fields":[{"name":"created_at","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"source","type":["string","null"]},{"name":"truncated","type":["boolean","null"]},{"name":"in_reply_to_status_id","type":["string","null"]},{"name":"in_reply_to_status_id_str","type":["string","null"]},{"name":"in_reply_to_user_id","type":["string","null"]},{"name":"in_reply_to_user_id_str","type":["string","null"]},{"name":"in_reply_to_screen_name","type":["string","null"]},{"name":"user","type":{"type":"record","name":"subrecord_1333510210252506782","namespace":"","fields":[{"name":"id","type":["int","null"]},{"name":"id_str","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"screen_name","type":["string","null"]},{"name":"location","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"description","type":["string","null"]},{"name":"translator_type","type":["string","null"]},{"name":"protected","type":["boolean","null"]},{"name":"verified","type":["boolean","null"]},{"name":"followers_count","type":["int","null"]},{"name":"friends_count","type":["int","null"]},{"name":"listed_count","type":["int","null"]},{"name":"favourites_count","type":["int","null"]},{"name":"statuses_count","type":["int","null"]},{"name":"created_at","type":["string","null"]},{"name":"utc_offset","type":["string","null"]},{"name":"time_zone","type":["string","null"]},{"name":"geo_enabled","type":["boolean","null"]},{"name":"lang","type":["string","null"]},{"name":"contributors_enabled","type":["boolean","null"]},{"name":"is_translator","type":["boolean","null"]},{"name":"profile_background_color","type":["string","null"]},{"name":"profile_background_image_url","type":["string","null"]},{"name":"profile_background_image_url_https","type":["string","null"]},{"name":"profile_background_tile","type":["boolean","null"]},{"name":"profile_link_color","type":["string","null"]},{"name":"profile_sidebar_border_color","type":["string","null"]},{"name":"profile_sidebar_fill_color","type":["string","null"]},{"name":"profile_text_color","type":["string","null"]},{"name":"profile_use_background_image","type":["boolean","null"]},{"name":"profile_image_url","type":["string","null"]},{"name":"profile_image_url_https","type":["string","null"]},{"name":"profile_banner_url","type":["string","null"]},{"name":"default_profile","type":["boolean","null"]},{"name":"default_profile_image","type":["boolean","null"]},{"name":"following","type":["string","null"]},{"name":"follow_request_sent","type":["string","null"]},{"name":"notifications","type":["string","null"]}]}},{"name":"geo","type":["string","null"]},{"name":"coordinates","type":["string","null"]},{"name":"place","type":{"type":"record","name":"subrecord_2202783987613205576","namespace":"","fields":[{"name":"id","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"place_type","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"full_name","type":["string","null"]},{"name":"country_code","type":["string","null"]},{"name":"country","type":["string","null"]},{"name":"bounding_box","type":{"type":"record","name":"subrecord6297473197442324111","namespace":"org.talend","fields":[{"name":"type","type":["string","null"]},{"name":"coordinates","type":{"type":"array","items":{"type":"record","name":"subrecord_3004235398744729621","fields":[]}}}]}},{"name":"attributes","type":"org.talend.subrecord_3004235398744729621"}]}},{"name":"contributors","type":["string","null"]},{"name":"is_quote_status","type":["boolean","null"]},{"name":"extended_tweet","type":{"type":"record","name":"subrecord_6104836229700222151","namespace":"","fields":[{"name":"full_text","type":["string","null"]},{"name":"display_text_range","type":{"type":"array","items":["int","null"]}},{"name":"entities","type":{"type":"record","name":"subrecord5678571617538413247","namespace":"org.talend","fields":[{"name":"hashtags","type":{"type":"array","items":{"type":"record","name":"subrecord5761850113934372676","fields":[{"name":"text","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"urls","type":{"type":"array","items":{"type":"record","name":"subrecord_2167212455227691253","fields":[{"name":"url","type":["string","null"]},{"name":"expanded_url","type":["string","null"]},{"name":"display_url","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"user_mentions","type":{"type":"array","items":{"type":"record","name":"subrecord_953558199486337988","fields":[{"name":"screen_name","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}}]}},{"name":"quote_count","type":["int","null"]},{"name":"reply_count","type":["int","null"]},{"name":"retweet_count","type":["int","null"]},{"name":"favorite_count","type":["int","null"]},{"name":"entities","type":{"type":"record","name":"subrecord4863175834995993793","namespace":"","fields":[{"name":"hashtags","type":{"type":"array","items":["string","null"]}},{"name":"urls","type":{"type":"array","items":"org.talend.subrecord_2167212455227691253"}},{"name":"user_mentions","type":{"type":"array","items":["string","null"]}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}},{"name":"favorited","type":["boolean","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"possibly_sensitive","type":["boolean","null"]},{"name":"filter_level","type":["string","null"]},{"name":"lang","type":["string","null"]},{"name":"timestamp_ms","type":["string","null"]}]}avro.codecsnappyuÿ˙”ABÂ8úh∏
m`Úî
Ù<Wed Apr 24 11:09:06 +0000 2019Ä¿∂Ω»æœé&1121008154978996224ÜHow can Think tanks inform government over transitions/elections? Here are examples from many countries:‚Ķ https://t.co/5Ch96iQEoX§<a href="http://twitter.com/download/iphone" rel="nofollow">T	,D for iPhone</a>P‘ºé–218222378On
ÌTÌont¸t46http://www..8.orgˆIndepend!dresearch, ideas and adviceåa better-OXed world | Global platf!i∫Ñ-Äpbased at @udelpacificononeê⁄rÆ0‡¬†ä<Sun Nov 21 19:22:52 U 0en;$022330b
È`abs.twimg.com/images/them(15/bg.pngd3!¡™4q084B4F(C0DFEC3îY://p2åprofile_
î†665054717790162944/8nRE5cuT_normal.pngñM¶˛NNtvMbanners/U_(/1553553770!nAÉP 01e215db7136a37epWapi!~A(.!Ä(1.1/geo/id/>5T.jsoncityNairobi
	0, KenyaKE
	8Polygon÷˛Í¢Í h.ÁhXr11hoVUlW @ACBF_Official #ŰATTS #E˝eÅma!±$ †≤i)2,$¥÷.Zz÷>”/articlEÏink-Y-and-ï‚»-strategies-to-raise-the-quality-of-democracies/DoùRÅ8:e ‚͓ĵ2\ «ÜÖ3092304357Çû!…:ΩQhÔt©E[<i/web/status/112B
:f.≥‘ÖlowÖW41556104146128&spuÿ˙”ABÂ8úh∏
m`

If you use the schema you see here when you repeat these steps using your JSON, you will be able to use the AVRO schema to convert your JSON into AVRO using your Python code. 

View solution in original post

3 Replies
Anonymous
Not applicable
Author

Hi @JackTaylor,

 

There are several AVRO tools online, but I haven't found one which solves the problem that you have. In fact, I had the same issue when trying to process Twitter data. One of the guys from R&D let me in on a trick which is really useful. 

 

1) Create a Test Connection and a Dataset from that, you can hard code your JSON into the values section after setting the type to JSON. Below is a screenshot of my Twitter data being hardcoded into mine.

0683p000009M4k6.png

 

2) I assume you have access to create a S3 Bucket in Amazon (given you are using Kinesis)? If so, create a connection to S3 and create a S3 data Dataset.

 

3) Now all you need to do is create a Pipeline between your Test Connection source Dataset (containing your hard coded JSON) and your S3 bucket target Dataset. Below is a screenshot of the one I created. You can see the data preview showing the twitter data.

0683p000009M4p5.png

 

4) When you run it, there will be a file which is created in your S3 Bucket. This will contain your JSON data in an AVRO format with the AVRO schema that was used at the very beginning of the file. Below you can see the contents of the file that was created. I have coloured the AVRO schema in red...

Objavro.schemaä]{"type":"record","name":"outer_record_3231340614480644067","namespace":"org.talend","fields":[{"name":"created_at","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"source","type":["string","null"]},{"name":"truncated","type":["boolean","null"]},{"name":"in_reply_to_status_id","type":["string","null"]},{"name":"in_reply_to_status_id_str","type":["string","null"]},{"name":"in_reply_to_user_id","type":["string","null"]},{"name":"in_reply_to_user_id_str","type":["string","null"]},{"name":"in_reply_to_screen_name","type":["string","null"]},{"name":"user","type":{"type":"record","name":"subrecord_1333510210252506782","namespace":"","fields":[{"name":"id","type":["int","null"]},{"name":"id_str","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"screen_name","type":["string","null"]},{"name":"location","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"description","type":["string","null"]},{"name":"translator_type","type":["string","null"]},{"name":"protected","type":["boolean","null"]},{"name":"verified","type":["boolean","null"]},{"name":"followers_count","type":["int","null"]},{"name":"friends_count","type":["int","null"]},{"name":"listed_count","type":["int","null"]},{"name":"favourites_count","type":["int","null"]},{"name":"statuses_count","type":["int","null"]},{"name":"created_at","type":["string","null"]},{"name":"utc_offset","type":["string","null"]},{"name":"time_zone","type":["string","null"]},{"name":"geo_enabled","type":["boolean","null"]},{"name":"lang","type":["string","null"]},{"name":"contributors_enabled","type":["boolean","null"]},{"name":"is_translator","type":["boolean","null"]},{"name":"profile_background_color","type":["string","null"]},{"name":"profile_background_image_url","type":["string","null"]},{"name":"profile_background_image_url_https","type":["string","null"]},{"name":"profile_background_tile","type":["boolean","null"]},{"name":"profile_link_color","type":["string","null"]},{"name":"profile_sidebar_border_color","type":["string","null"]},{"name":"profile_sidebar_fill_color","type":["string","null"]},{"name":"profile_text_color","type":["string","null"]},{"name":"profile_use_background_image","type":["boolean","null"]},{"name":"profile_image_url","type":["string","null"]},{"name":"profile_image_url_https","type":["string","null"]},{"name":"profile_banner_url","type":["string","null"]},{"name":"default_profile","type":["boolean","null"]},{"name":"default_profile_image","type":["boolean","null"]},{"name":"following","type":["string","null"]},{"name":"follow_request_sent","type":["string","null"]},{"name":"notifications","type":["string","null"]}]}},{"name":"geo","type":["string","null"]},{"name":"coordinates","type":["string","null"]},{"name":"place","type":{"type":"record","name":"subrecord_2202783987613205576","namespace":"","fields":[{"name":"id","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"place_type","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"full_name","type":["string","null"]},{"name":"country_code","type":["string","null"]},{"name":"country","type":["string","null"]},{"name":"bounding_box","type":{"type":"record","name":"subrecord6297473197442324111","namespace":"org.talend","fields":[{"name":"type","type":["string","null"]},{"name":"coordinates","type":{"type":"array","items":{"type":"record","name":"subrecord_3004235398744729621","fields":[]}}}]}},{"name":"attributes","type":"org.talend.subrecord_3004235398744729621"}]}},{"name":"contributors","type":["string","null"]},{"name":"is_quote_status","type":["boolean","null"]},{"name":"extended_tweet","type":{"type":"record","name":"subrecord_6104836229700222151","namespace":"","fields":[{"name":"full_text","type":["string","null"]},{"name":"display_text_range","type":{"type":"array","items":["int","null"]}},{"name":"entities","type":{"type":"record","name":"subrecord5678571617538413247","namespace":"org.talend","fields":[{"name":"hashtags","type":{"type":"array","items":{"type":"record","name":"subrecord5761850113934372676","fields":[{"name":"text","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"urls","type":{"type":"array","items":{"type":"record","name":"subrecord_2167212455227691253","fields":[{"name":"url","type":["string","null"]},{"name":"expanded_url","type":["string","null"]},{"name":"display_url","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"user_mentions","type":{"type":"array","items":{"type":"record","name":"subrecord_953558199486337988","fields":[{"name":"screen_name","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}}]}},{"name":"quote_count","type":["int","null"]},{"name":"reply_count","type":["int","null"]},{"name":"retweet_count","type":["int","null"]},{"name":"favorite_count","type":["int","null"]},{"name":"entities","type":{"type":"record","name":"subrecord4863175834995993793","namespace":"","fields":[{"name":"hashtags","type":{"type":"array","items":["string","null"]}},{"name":"urls","type":{"type":"array","items":"org.talend.subrecord_2167212455227691253"}},{"name":"user_mentions","type":{"type":"array","items":["string","null"]}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}},{"name":"favorited","type":["boolean","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"possibly_sensitive","type":["boolean","null"]},{"name":"filter_level","type":["string","null"]},{"name":"lang","type":["string","null"]},{"name":"timestamp_ms","type":["string","null"]}]}avro.codecsnappyuÿ˙”ABÂ8úh∏
m`Úî
Ù<Wed Apr 24 11:09:06 +0000 2019Ä¿∂Ω»æœé&1121008154978996224ÜHow can Think tanks inform government over transitions/elections? Here are examples from many countries:‚Ķ https://t.co/5Ch96iQEoX§<a href="http://twitter.com/download/iphone" rel="nofollow">T	,D for iPhone</a>P‘ºé–218222378On
ÌTÌont¸t46http://www..8.orgˆIndepend!dresearch, ideas and adviceåa better-OXed world | Global platf!i∫Ñ-Äpbased at @udelpacificononeê⁄rÆ0‡¬†ä<Sun Nov 21 19:22:52 U 0en;$022330b
È`abs.twimg.com/images/them(15/bg.pngd3!¡™4q084B4F(C0DFEC3îY://p2åprofile_
î†665054717790162944/8nRE5cuT_normal.pngñM¶˛NNtvMbanners/U_(/1553553770!nAÉP 01e215db7136a37epWapi!~A(.!Ä(1.1/geo/id/>5T.jsoncityNairobi
	0, KenyaKE
	8Polygon÷˛Í¢Í h.ÁhXr11hoVUlW @ACBF_Official #ŰATTS #E˝eÅma!±$ †≤i)2,$¥÷.Zz÷>”/articlEÏink-Y-and-ï‚»-strategies-to-raise-the-quality-of-democracies/DoùRÅ8:e ‚͓ĵ2\ «ÜÖ3092304357Çû!…:ΩQhÔt©E[<i/web/status/112B
:f.≥‘ÖlowÖW41556104146128&spuÿ˙”ABÂ8úh∏
m`

If you use the schema you see here when you repeat these steps using your JSON, you will be able to use the AVRO schema to convert your JSON into AVRO using your Python code. 

Anonymous
Not applicable
Author

It took a while to figure this out, but now I have got it working. Thanks. It would be nice for users to be able to just submit their JSON and get the AVRO schema returned without having to go via S3.

Anonymous
Not applicable
Author

<p>I have started writing a few blogs regarding this sort of requirement. You can take a look here: https://www.talend.com/blog/author/rhall/