Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I'm a data scientist and have been tasked with trying out Pipeline Designer by my company. I have some JSON data that I was hoping to just process. When using a Kinesis source I need to read it in an AVRO format. I have done a bit of reading about this and have found a bit of Python that will convert JSON to an AVRO format, but first I need to come up with a suitable AVRO schema for the conversion. Is there an easy way to do this? My JSON is quite big and I have no idea about Apache AVRO. Maybe there is an online tool that can be used? It would be really useful if someone can help as I think Pipeline Designer might be quite useful for us, but I can't test it without first converting my JSON.
Hi @JackTaylor,
There are several AVRO tools online, but I haven't found one which solves the problem that you have. In fact, I had the same issue when trying to process Twitter data. One of the guys from R&D let me in on a trick which is really useful.
1) Create a Test Connection and a Dataset from that, you can hard code your JSON into the values section after setting the type to JSON. Below is a screenshot of my Twitter data being hardcoded into mine.
2) I assume you have access to create a S3 Bucket in Amazon (given you are using Kinesis)? If so, create a connection to S3 and create a S3 data Dataset.
3) Now all you need to do is create a Pipeline between your Test Connection source Dataset (containing your hard coded JSON) and your S3 bucket target Dataset. Below is a screenshot of the one I created. You can see the data preview showing the twitter data.
4) When you run it, there will be a file which is created in your S3 Bucket. This will contain your JSON data in an AVRO format with the AVRO schema that was used at the very beginning of the file. Below you can see the contents of the file that was created. I have coloured the AVRO schema in red...
Objavro.schemaä]{"type":"record","name":"outer_record_3231340614480644067","namespace":"org.talend","fields":[{"name":"created_at","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"source","type":["string","null"]},{"name":"truncated","type":["boolean","null"]},{"name":"in_reply_to_status_id","type":["string","null"]},{"name":"in_reply_to_status_id_str","type":["string","null"]},{"name":"in_reply_to_user_id","type":["string","null"]},{"name":"in_reply_to_user_id_str","type":["string","null"]},{"name":"in_reply_to_screen_name","type":["string","null"]},{"name":"user","type":{"type":"record","name":"subrecord_1333510210252506782","namespace":"","fields":[{"name":"id","type":["int","null"]},{"name":"id_str","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"screen_name","type":["string","null"]},{"name":"location","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"description","type":["string","null"]},{"name":"translator_type","type":["string","null"]},{"name":"protected","type":["boolean","null"]},{"name":"verified","type":["boolean","null"]},{"name":"followers_count","type":["int","null"]},{"name":"friends_count","type":["int","null"]},{"name":"listed_count","type":["int","null"]},{"name":"favourites_count","type":["int","null"]},{"name":"statuses_count","type":["int","null"]},{"name":"created_at","type":["string","null"]},{"name":"utc_offset","type":["string","null"]},{"name":"time_zone","type":["string","null"]},{"name":"geo_enabled","type":["boolean","null"]},{"name":"lang","type":["string","null"]},{"name":"contributors_enabled","type":["boolean","null"]},{"name":"is_translator","type":["boolean","null"]},{"name":"profile_background_color","type":["string","null"]},{"name":"profile_background_image_url","type":["string","null"]},{"name":"profile_background_image_url_https","type":["string","null"]},{"name":"profile_background_tile","type":["boolean","null"]},{"name":"profile_link_color","type":["string","null"]},{"name":"profile_sidebar_border_color","type":["string","null"]},{"name":"profile_sidebar_fill_color","type":["string","null"]},{"name":"profile_text_color","type":["string","null"]},{"name":"profile_use_background_image","type":["boolean","null"]},{"name":"profile_image_url","type":["string","null"]},{"name":"profile_image_url_https","type":["string","null"]},{"name":"profile_banner_url","type":["string","null"]},{"name":"default_profile","type":["boolean","null"]},{"name":"default_profile_image","type":["boolean","null"]},{"name":"following","type":["string","null"]},{"name":"follow_request_sent","type":["string","null"]},{"name":"notifications","type":["string","null"]}]}},{"name":"geo","type":["string","null"]},{"name":"coordinates","type":["string","null"]},{"name":"place","type":{"type":"record","name":"subrecord_2202783987613205576","namespace":"","fields":[{"name":"id","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"place_type","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"full_name","type":["string","null"]},{"name":"country_code","type":["string","null"]},{"name":"country","type":["string","null"]},{"name":"bounding_box","type":{"type":"record","name":"subrecord6297473197442324111","namespace":"org.talend","fields":[{"name":"type","type":["string","null"]},{"name":"coordinates","type":{"type":"array","items":{"type":"record","name":"subrecord_3004235398744729621","fields":[]}}}]}},{"name":"attributes","type":"org.talend.subrecord_3004235398744729621"}]}},{"name":"contributors","type":["string","null"]},{"name":"is_quote_status","type":["boolean","null"]},{"name":"extended_tweet","type":{"type":"record","name":"subrecord_6104836229700222151","namespace":"","fields":[{"name":"full_text","type":["string","null"]},{"name":"display_text_range","type":{"type":"array","items":["int","null"]}},{"name":"entities","type":{"type":"record","name":"subrecord5678571617538413247","namespace":"org.talend","fields":[{"name":"hashtags","type":{"type":"array","items":{"type":"record","name":"subrecord5761850113934372676","fields":[{"name":"text","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"urls","type":{"type":"array","items":{"type":"record","name":"subrecord_2167212455227691253","fields":[{"name":"url","type":["string","null"]},{"name":"expanded_url","type":["string","null"]},{"name":"display_url","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"user_mentions","type":{"type":"array","items":{"type":"record","name":"subrecord_953558199486337988","fields":[{"name":"screen_name","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}}]}},{"name":"quote_count","type":["int","null"]},{"name":"reply_count","type":["int","null"]},{"name":"retweet_count","type":["int","null"]},{"name":"favorite_count","type":["int","null"]},{"name":"entities","type":{"type":"record","name":"subrecord4863175834995993793","namespace":"","fields":[{"name":"hashtags","type":{"type":"array","items":["string","null"]}},{"name":"urls","type":{"type":"array","items":"org.talend.subrecord_2167212455227691253"}},{"name":"user_mentions","type":{"type":"array","items":["string","null"]}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}},{"name":"favorited","type":["boolean","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"possibly_sensitive","type":["boolean","null"]},{"name":"filter_level","type":["string","null"]},{"name":"lang","type":["string","null"]},{"name":"timestamp_ms","type":["string","null"]}]}avro.codecsnappyuÿ˙”ABÂ8úh∏
m`Úî
Ù<Wed Apr 24 11:09:06 +0000 2019Ä¿∂Ω»æœé&1121008154978996224ÜHow can Think tanks inform government over transitions/elections? Here are examples from many countries:‚Ķ https://t.co/5Ch96iQEoX§<a href="http://twitter.com/download/iphone" rel="nofollow">T ,D for iPhone</a>P‘ºé–218222378On
ÌTÌont¸t46http://www..8.orgˆIndepend!dresearch, ideas and adviceåa better-OXed world | Global platf!i∫Ñ-Äpbased at @udelpacificononeê⁄rÆ0‡¬†ä<Sun Nov 21 19:22:52 U 0en;$022330b
È`abs.twimg.com/images/them(15/bg.pngd3!¡™4q084B4F(C0DFEC3îY://p2åprofile_
î†665054717790162944/8nRE5cuT_normal.pngñM¶˛NNtvMbanners/U_(/1553553770!nAÉP 01e215db7136a37epWapi!~A(.!Ä(1.1/geo/id/>5T.jsoncityNairobi
0, KenyaKE
8Polygon÷˛Í¢Í h.ÁhXr11hoVUlW @ACBF_Official #ŰATTS #E˝eÅma!±$ †≤i)2,$¥÷.Zz÷>”/articlEÏink-Y-and-ï‚»-strategies-to-raise-the-quality-of-democracies/DoùRÅ8:e ‚Ķ“ĵ2\ «ÜÖ3092304357Çû!…:ΩQhÔt©E[<i/web/status/112B
:f.≥‘ÖlowÖW41556104146128&spuÿ˙”ABÂ8úh∏
m`
If you use the schema you see here when you repeat these steps using your JSON, you will be able to use the AVRO schema to convert your JSON into AVRO using your Python code.
Hi @JackTaylor,
There are several AVRO tools online, but I haven't found one which solves the problem that you have. In fact, I had the same issue when trying to process Twitter data. One of the guys from R&D let me in on a trick which is really useful.
1) Create a Test Connection and a Dataset from that, you can hard code your JSON into the values section after setting the type to JSON. Below is a screenshot of my Twitter data being hardcoded into mine.
2) I assume you have access to create a S3 Bucket in Amazon (given you are using Kinesis)? If so, create a connection to S3 and create a S3 data Dataset.
3) Now all you need to do is create a Pipeline between your Test Connection source Dataset (containing your hard coded JSON) and your S3 bucket target Dataset. Below is a screenshot of the one I created. You can see the data preview showing the twitter data.
4) When you run it, there will be a file which is created in your S3 Bucket. This will contain your JSON data in an AVRO format with the AVRO schema that was used at the very beginning of the file. Below you can see the contents of the file that was created. I have coloured the AVRO schema in red...
Objavro.schemaä]{"type":"record","name":"outer_record_3231340614480644067","namespace":"org.talend","fields":[{"name":"created_at","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"source","type":["string","null"]},{"name":"truncated","type":["boolean","null"]},{"name":"in_reply_to_status_id","type":["string","null"]},{"name":"in_reply_to_status_id_str","type":["string","null"]},{"name":"in_reply_to_user_id","type":["string","null"]},{"name":"in_reply_to_user_id_str","type":["string","null"]},{"name":"in_reply_to_screen_name","type":["string","null"]},{"name":"user","type":{"type":"record","name":"subrecord_1333510210252506782","namespace":"","fields":[{"name":"id","type":["int","null"]},{"name":"id_str","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"screen_name","type":["string","null"]},{"name":"location","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"description","type":["string","null"]},{"name":"translator_type","type":["string","null"]},{"name":"protected","type":["boolean","null"]},{"name":"verified","type":["boolean","null"]},{"name":"followers_count","type":["int","null"]},{"name":"friends_count","type":["int","null"]},{"name":"listed_count","type":["int","null"]},{"name":"favourites_count","type":["int","null"]},{"name":"statuses_count","type":["int","null"]},{"name":"created_at","type":["string","null"]},{"name":"utc_offset","type":["string","null"]},{"name":"time_zone","type":["string","null"]},{"name":"geo_enabled","type":["boolean","null"]},{"name":"lang","type":["string","null"]},{"name":"contributors_enabled","type":["boolean","null"]},{"name":"is_translator","type":["boolean","null"]},{"name":"profile_background_color","type":["string","null"]},{"name":"profile_background_image_url","type":["string","null"]},{"name":"profile_background_image_url_https","type":["string","null"]},{"name":"profile_background_tile","type":["boolean","null"]},{"name":"profile_link_color","type":["string","null"]},{"name":"profile_sidebar_border_color","type":["string","null"]},{"name":"profile_sidebar_fill_color","type":["string","null"]},{"name":"profile_text_color","type":["string","null"]},{"name":"profile_use_background_image","type":["boolean","null"]},{"name":"profile_image_url","type":["string","null"]},{"name":"profile_image_url_https","type":["string","null"]},{"name":"profile_banner_url","type":["string","null"]},{"name":"default_profile","type":["boolean","null"]},{"name":"default_profile_image","type":["boolean","null"]},{"name":"following","type":["string","null"]},{"name":"follow_request_sent","type":["string","null"]},{"name":"notifications","type":["string","null"]}]}},{"name":"geo","type":["string","null"]},{"name":"coordinates","type":["string","null"]},{"name":"place","type":{"type":"record","name":"subrecord_2202783987613205576","namespace":"","fields":[{"name":"id","type":["string","null"]},{"name":"url","type":["string","null"]},{"name":"place_type","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"full_name","type":["string","null"]},{"name":"country_code","type":["string","null"]},{"name":"country","type":["string","null"]},{"name":"bounding_box","type":{"type":"record","name":"subrecord6297473197442324111","namespace":"org.talend","fields":[{"name":"type","type":["string","null"]},{"name":"coordinates","type":{"type":"array","items":{"type":"record","name":"subrecord_3004235398744729621","fields":[]}}}]}},{"name":"attributes","type":"org.talend.subrecord_3004235398744729621"}]}},{"name":"contributors","type":["string","null"]},{"name":"is_quote_status","type":["boolean","null"]},{"name":"extended_tweet","type":{"type":"record","name":"subrecord_6104836229700222151","namespace":"","fields":[{"name":"full_text","type":["string","null"]},{"name":"display_text_range","type":{"type":"array","items":["int","null"]}},{"name":"entities","type":{"type":"record","name":"subrecord5678571617538413247","namespace":"org.talend","fields":[{"name":"hashtags","type":{"type":"array","items":{"type":"record","name":"subrecord5761850113934372676","fields":[{"name":"text","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"urls","type":{"type":"array","items":{"type":"record","name":"subrecord_2167212455227691253","fields":[{"name":"url","type":["string","null"]},{"name":"expanded_url","type":["string","null"]},{"name":"display_url","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"user_mentions","type":{"type":"array","items":{"type":"record","name":"subrecord_953558199486337988","fields":[{"name":"screen_name","type":["string","null"]},{"name":"name","type":["string","null"]},{"name":"id","type":["long","null"]},{"name":"id_str","type":["string","null"]},{"name":"indices","type":{"type":"array","items":["int","null"]}}]}}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}}]}},{"name":"quote_count","type":["int","null"]},{"name":"reply_count","type":["int","null"]},{"name":"retweet_count","type":["int","null"]},{"name":"favorite_count","type":["int","null"]},{"name":"entities","type":{"type":"record","name":"subrecord4863175834995993793","namespace":"","fields":[{"name":"hashtags","type":{"type":"array","items":["string","null"]}},{"name":"urls","type":{"type":"array","items":"org.talend.subrecord_2167212455227691253"}},{"name":"user_mentions","type":{"type":"array","items":["string","null"]}},{"name":"symbols","type":{"type":"array","items":["string","null"]}}]}},{"name":"favorited","type":["boolean","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"possibly_sensitive","type":["boolean","null"]},{"name":"filter_level","type":["string","null"]},{"name":"lang","type":["string","null"]},{"name":"timestamp_ms","type":["string","null"]}]}avro.codecsnappyuÿ˙”ABÂ8úh∏
m`Úî
Ù<Wed Apr 24 11:09:06 +0000 2019Ä¿∂Ω»æœé&1121008154978996224ÜHow can Think tanks inform government over transitions/elections? Here are examples from many countries:‚Ķ https://t.co/5Ch96iQEoX§<a href="http://twitter.com/download/iphone" rel="nofollow">T ,D for iPhone</a>P‘ºé–218222378On
ÌTÌont¸t46http://www..8.orgˆIndepend!dresearch, ideas and adviceåa better-OXed world | Global platf!i∫Ñ-Äpbased at @udelpacificononeê⁄rÆ0‡¬†ä<Sun Nov 21 19:22:52 U 0en;$022330b
È`abs.twimg.com/images/them(15/bg.pngd3!¡™4q084B4F(C0DFEC3îY://p2åprofile_
î†665054717790162944/8nRE5cuT_normal.pngñM¶˛NNtvMbanners/U_(/1553553770!nAÉP 01e215db7136a37epWapi!~A(.!Ä(1.1/geo/id/>5T.jsoncityNairobi
0, KenyaKE
8Polygon÷˛Í¢Í h.ÁhXr11hoVUlW @ACBF_Official #ŰATTS #E˝eÅma!±$ †≤i)2,$¥÷.Zz÷>”/articlEÏink-Y-and-ï‚»-strategies-to-raise-the-quality-of-democracies/DoùRÅ8:e ‚Ķ“ĵ2\ «ÜÖ3092304357Çû!…:ΩQhÔt©E[<i/web/status/112B
:f.≥‘ÖlowÖW41556104146128&spuÿ˙”ABÂ8úh∏
m`
If you use the schema you see here when you repeat these steps using your JSON, you will be able to use the AVRO schema to convert your JSON into AVRO using your Python code.
It took a while to figure this out, but now I have got it working. Thanks. It would be nice for users to be able to just submit their JSON and get the AVRO schema returned without having to go via S3.
<p>I have started writing a few blogs regarding this sort of requirement. You can take a look here: https://www.talend.com/blog/author/rhall/