Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
agentgill
Contributor

[resolved] Salesforce Bulk Queries for Large data sets

I want to query using the Salesforce Bulk API all the contacts in my org. The largest batch size is 10K, I would have expected Talend to send multiple batches. How can I achieve this.
Labels (3)
1 Solution

Accepted Solutions
Anonymous
Not applicable

I still have this problem where talend DI will only send one batch request.
I'm trying to track down where it should write the logic for batch files, but I can only find it for the upload in the SalesforceBulkAPI.java:
    private List<BatchInfo> createBatchesFromCSVFile() throws IOException, AsyncApiException {
List<BatchInfo> batchInfos = new ArrayList<BatchInfo>();
BufferedReader rdr = new BufferedReader(new InputStreamReader(new FileInputStream(bulkFileName), FILE_ENCODING));
// read the CSV header row
byte[] headerBytes = (rdr.readLine() + "\n").getBytes("UTF-8");
int headerBytesLength = headerBytes.length;
File tmpFile = File.createTempFile("sforceBulkAPI", ".csv");
// Split the CSV file into multiple batches
try {
FileOutputStream tmpOut = new FileOutputStream(tmpFile);
int currentBytes = 0;
int currentLines = 0;
String nextLine;
boolean needStart = true;
boolean needEnds = true;
while ((nextLine = rdr.readLine()) != null) {
int num = countQuotes(nextLine);
// nextLine is header or footer of the record
if (num % 2 == 1) {
if (!needStart) {
needEnds = false;
} else {
needStart = false;
}
} else {
// nextLine is a whole record or middle of the record
if (needEnds && needStart) {
needEnds = false;
needStart = false;
}
}
byte[] bytes = (nextLine + "\n").getBytes("UTF-8");
// Create a new batch when our batch size limit is reached
if (currentBytes + bytes.length > maxBytesPerBatch || currentLines > maxRowsPerBatch) {
createBatch(tmpOut, tmpFile, batchInfos);
currentBytes = 0;
currentLines = 0;
}
if (currentBytes == 0) {
tmpOut = new FileOutputStream(tmpFile);
tmpOut.write(headerBytes);
currentBytes = headerBytesLength;
currentLines = 1;
}
tmpOut.write(bytes);
currentBytes += bytes.length;
if (!needStart && !needEnds) {
currentLines++;
needStart = true;
needEnds = true;
}
}
// Finished processing all rows
// Create a final batch for any remaining data
if (currentLines > 1) {
createBatch(tmpOut, tmpFile, batchInfos);
}
} finally {
tmpFile.delete();
}
return batchInfos;
}

View solution in original post

12 Replies
agentgill
Contributor
Author

I want to the same ability you get on the BulkExec import. I want to be able to query for all 7M contacts in one go and have salesforce output the file back to Talend.
Anonymous
Not applicable

Hi ,
I couldn't find the 'query' action for the salesforceBulkExec component, so I ends up with the same question as you raised.
Did you found any work-around for this?
Thanks
agentgill
Contributor
Author

You have to use tSalesforceConnection + tSalesforceInput component. Configure these two to work together making sure you select "For salesforce bulk component" inside the tSalesforceConnection component; then output to CSV as the last step.
The issue I have is this configuration only allows you to export out up to the Salesforce maximum batch size of 10,000.
I was hoping that the component would create (n) of batches if required automatically, but it looks like I have to create that logic. Waiting on talend.
Anonymous
Not applicable

Hi, thanks a lot for the reply, just found that my Talend version 5.1 was outdated as the bulk query only available in 5.2.x
Referring to this guideline: http://www.salesforce.com/us/developer/docs/api_asynch/index.htm
- Bulk query would has limitation max. 10 files, each max. size 1 gig
- If the query needs to return more than 10 files, the query should be filtered to return less data. Bulk batch sizes are not used for bulk queries.
So, I am a bit confuse with the 10,000 limitation. I would think this is the batch size limit to upload data to Salesforce, but not the other way round as in the bulk query
agentgill
Contributor
Author

Thanks for replying. I understand the limitations of the Salesforce API, but still struggling with configuring Talend to query large data (millions of records) volumes using Bulk. Non bulk is not an option.
If I fire the same query using Jitterbit/workbench then these apps simply create additional batches automatically and return the records in the query. In my case 7M+ records are returned in a single file within 10-15 minutes.
Similarly if I am importing using the Bulk API talend just keeps sending batches with 10K of records until it hits the limit of 2000 in 24 hrs.
Any ideas?
Anonymous
Not applicable

Hi,
Thanks a lot of pointing out the limitation was something 'implemented' by Talend instead of Salesforce, I am clear as what you means now.
I think, the existing Talend that with batches with 10K may never meet your requirement to get 7M+ of records.
Is it possible to filter your data by certain criteria, for e.g. lastModifiedDate ?
I am working on a flow something like this:
get lastSynDate in local DB -> add this as filter criteria in the query condition -> populate data -> update lastSynDate in local DB.
Not too sure if this flow applicable for your scenario ...
Again, thanks a lot for your info.
agentgill
Contributor
Author

That's one solution. Thanks for sharing.
I still feel this should be natively handled by the Talend component. Will raise an an enhancement request.
I am planning on building a process incorporates first bulk exporting using Workbench and then using Talend to process the file. It's not critical process, only something which we run each time we refresh a sandbox so it doesn't really matter if it has a few more steps.
agentgill
Contributor
Author

I'm now working with another large salesforce.com client and the latest version of Talend 5.3.0r101800
Same as before we are attempting to perform large bulk queries and hitting 10K record limit. I would expect Talend to be able to spool multiple batches out to CSV. The full data set is being queried (see attachment), but Talend cannot process the output.
Any advice / pointers
0683p000009MCuq.jpg 0683p000009MCuv.jpg
Anonymous
Not applicable

Hi agentgill
I seen there have been a same issue report on jira
https://jira.talendforge.org/browse/TDI-7779
and it had been fixed since v4.0.4. Don't know tSalesforceBulkExec have been updated in the latest version, but I would like to suggest you to download v4.1.0, which is closer to v4.0.4, and test if it works on v4.1.0. You can download v4.1.0 from this link:
http://sourceforge.net/projects/talend-studio/files/Talend%20Open%20Studio/4.1.0/
Shong