Skip to main content
Announcements
NEW: Seamless Public Data Sharing with Qlik's New Anonymous Access Capability: TELL ME MORE!
cancel
Showing results for 
Search instead for 
Did you mean: 
kinmo
Contributor
Contributor

Duplicate Data: CloudPostgreSQL --> Datastream --> BigQuery

I have a basic pipeline setup where I use Python to scrape data from the web, push to a SQL server, use Google Datastream to replicate it in Big query, so I can efficiently consume it in other apps.

My issue that is that I am accumulating duplicates in my Big query tables. I actually know whats causing this, but don't have a good solution. When I update my SQL Tables, I truncate them, and append a new set of data to updata to the table. I have to do this because Datastream cant interface with SQL views.

Big query isn't mirroring the SQL Tables. Data stream is taking my appended data, and simply adding it to my Bigquery Tables, instead of mirroring my SQL tables 1:1

How can I get Big query to reflect these tables Exactly??

Labels (1)
1 Reply
SushilKumar
Support
Support

Hello @kinmo 

Do you Big Querry have any concept of Constraints? as Oracle or SQL Does. if it has then it helps you . However, you need to find alternate way to deal with it may be GPC doc may help you.

As we have Table merge in Databrick to avoid Duplicates. 

Regards,

Sushil Kumar