🐛 Source HubSpot: fix infinite loop when iterating through search results #44899

ehearty · 2024-08-29T15:48:21Z

🐛 Source HubSpot: fix infinite loop when iterating through search results

What

Fixes airbytehq/airbyte/#43317

How

I updated our custom HS connector to filter on both primary key AND timestamp, then to sort by object's primary key. The timestamps new return out of order, but once we finish iterating through all the results we should have all the data (similar to full refresh processing).

Review guide

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py
- updates to the "process_search" function:
  - added a "last_id" parameter
  - updated filter logic to include:
    - self.primary_key >= last_id
    - self.last_modified_field >= self._state.timestamp()
    - self.last_modified_field <= self._init_sync.timestamp()
      note: I added this logic because I don't like "open ended queries" where the result set might change depending on whether new records were modified after the start of the pull.
  - updated sort logic to use the primary key instead of the last_modified_field to ensure that we're always grabbing a unique result set
- added a "get_max" function that will first attempt a numeric comparison, but fall back to a string comparison if int conversion isn't possible - this is mostly to account for unit tests where the "id" field is a string instead of an integer, and any future object searches where we might want to use a non-numeric primary key
- updated read_records to:
  - grab the maximum primary key returned in the result set, and pass that as the last_id value to the process_search function every time we go over 10k results
  - no longer update the _state with the last_modified_timestamp now that we're sorting by primary key instead
  - once the search completes, update the state with the init_sync_timestamp to ensure that the next run starts from exactly where we left off

User Impact

What is the end result perceived by the user? None.
If there are negative side effects, please list them.
- Now that we're no longer updating the state with the last modified date after each iteration, we'll lose any partial progress if an error occurs during the sync. Given how rarely this scenario occurs, and how quickly and easily we can re-pull this data, impact should be minimal.

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

vercel · 2024-08-29T15:48:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Sep 19, 2024 10:07pm

CLAassistant · 2024-08-29T15:48:28Z

All committers have signed the CLA.

…n iterating through search results to avoid infinite loop fixes airbytehq/airbyte/airbytehq#43317

aldogonzalez8 · 2024-09-09T12:38:30Z

I will put this here as a comment just if in the future, somebody wants to understand the change and has some inquiries:

Before:

Request data to HS on the search endpoint
- filter on the latest state (timestamp).
- ordered. by timestamp
We use the token returned in response to paginate.
If the token is none, pagination is over
If pagination >10000 records, we start a new request as we can't get further because HS doesn't want that.
New request will use the latest/greatest timestamp
if all data has identical timestamps because some batch update occurred, a new request would try to use the same initial timestamp, and we get the infinite loop.

After:

Request data on HS in the search endpoint:
- filter on the latest state (timestamp)
- Filter on latest object id/primary_key with initial None
- Filter on timestamp < current_sync to not have "open-ended queries"
We use the token returned in response to paginate.
If the token is none, pagination is over
If pagination > 1000 records, we start a new request because we can't go further, as HS doesn't want that.
New requests will filter like:
- Filter still on the latest state before sync started
- Use the latest object ID/primary_key obtained from the previous pagination process (technically, it's the same loop, so pagination is not over, but we restart because of the 10,000 limit and that stuff.).
- Again, Filter on timestamp < current_sync to not have "open-ended queries"
if all data has identical timestamps because some batch update occurred, this will not affect the pagination process as we also filtered and now instead sorted by object ID rather than timestamp.
We are happy because it is not an infinite loop.

aldogonzalez8 · 2024-09-09T12:52:36Z

airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py

+                "filters": [
+                    {"value": int(self._state.timestamp() * 1000), "propertyName": self.last_modified_field, "operator": "GTE"},
+                    {"value": int(self._init_sync.timestamp() * 1000), "propertyName": self.last_modified_field, "operator": "LTE"},
+                    {"value": last_id, "propertyName": self.primary_key, "operator": "GTE"},
+                ],
+                "sorts": [{"propertyName": self.primary_key, "direction": "ASCENDING"}],


@ehearty, Would you mind putting here some documentation about why this may be needed? It doesn't need to be as large as my comment in the PR, but some TL;DR on why we have a second filter may be complementary to the one in the read records section that talks about the 10,000 limitations.

My only concern is that we don't have a specific test for this scenario. Would you mind taking a look at the chance to have one? Please let me know what you think.

*docummentation = comments

aldogonzalez8 · 2024-09-09T12:53:11Z

docs/integrations/sources/hubspot.md

@@ -331,7 +331,8 @@ The connector is restricted by normal HubSpot [rate limitations](https://legacyd
  <summary>Expand to review</summary>

 | Version | Date       | Pull Request                                             | Subject                                                                                                                                                                          |
-|:--------|:-----------|:---------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|:--------|:-----------| :------------------------------------------------------- |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 4.2.19 | 2024-08-29 | [42688](https://github.com/airbytehq/airbyte/pull/44899) | Fix incremental search to use primary key as placeholder instead of lastModifiedDate |


We need to update the date here.

aldogonzalez8 · 2024-09-09T12:54:37Z

@ehearty LGTM left some comments, but I'm happy the unit test is good, and regression is returning good results.

ehearty · 2024-09-10T13:22:39Z

@ehearty LGTM left some comments, but I'm happy the unit test is good, and regression is returning good results.

Thanks @aldogonzalez8 - time permitting, I'll try to have the changes you requested in by the end of the week.

aldogonzalez8 · 2024-09-12T15:52:57Z

@ehearty If you make changes before the weekend, can you put the release date on Monday? I prefer not to ship on Thursday afternoon-Friday. If all tests pass and everything looks good, I will approve and later merge that day. Again, just if you do it before Friday. Thanks.

ehearty · 2024-09-16T23:11:15Z

@ehearty If you make changes before the weekend, can you put the release date on Monday? I prefer not to ship on Thursday afternoon-Friday. If all tests pass and everything looks good, I will approve and later merge that day. Again, just if you do it before Friday. Thanks.

Sorry for the delay. I've specifically blocked off time this week to make those final updates and will have them submitted in the next couple of days.

ehearty · 2024-09-19T17:31:37Z

@aldogonzalez8 Just letting you know that I've jumped back into the PR today and am working on a unit test for the new sort/filter parameters. Hoping to have the changes in by tomorrow, but will set the release date to Monday as requested. Thanks again for your patience.

aldogonzalez8 · 2024-09-19T18:03:08Z

@aldogonzalez8 Just letting you know that I've jumped back into the PR today and am working on a unit test for the new sort/filter parameters. Hoping to have the changes in by tomorrow, but will set the release date to Monday as requested. Thanks again for your patience.

@ehearty That makes sense to me, thanks!

octavia-squidington-iii added the area/connectors Connector related issues label Aug 29, 2024

ehearty had a problem deploying to community-ci-auto August 29, 2024 15:48 — with GitHub Actions Failure

octavia-squidington-iii added community connectors/source/hubspot labels Aug 29, 2024

ehearty had a problem deploying to community-ci-auto August 29, 2024 15:48 — with GitHub Actions Failure

ehearty temporarily deployed to community-ci-auto August 29, 2024 15:48 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci August 29, 2024 15:48 — with GitHub Actions Error

ehearty force-pushed the ehearty/source-hubspot-search-stream-fix branch from 777f1d8 to 19a3a26 Compare August 29, 2024 20:25

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Aug 29, 2024

ehearty had a problem deploying to community-ci-auto August 29, 2024 20:25 — with GitHub Actions Failure

ehearty had a problem deploying to community-ci August 29, 2024 20:25 — with GitHub Actions Error

ehearty temporarily deployed to community-ci-auto August 29, 2024 20:25 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci-auto August 29, 2024 20:25 — with GitHub Actions Error

ehearty had a problem deploying to community-ci August 29, 2024 20:25 — with GitHub Actions Error

ehearty had a problem deploying to community-ci August 29, 2024 20:30 — with GitHub Actions Error

ehearty had a problem deploying to community-ci-auto August 29, 2024 20:30 — with GitHub Actions Failure

ehearty temporarily deployed to community-ci-auto August 29, 2024 20:30 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci August 29, 2024 20:30 — with GitHub Actions Error

vercel bot deployed to Preview August 29, 2024 20:35 View deployment

fix: use primary key instead of lastmodifieddate as a placeholder whe…

41a0622

…n iterating through search results to avoid infinite loop fixes airbytehq/airbyte/airbytehq#43317

ehearty force-pushed the ehearty/source-hubspot-search-stream-fix branch from 1b80a48 to 41a0622 Compare August 29, 2024 21:36

ehearty temporarily deployed to community-ci-auto August 29, 2024 21:36 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci August 29, 2024 21:36 — with GitHub Actions Error

ehearty had a problem deploying to community-ci-auto August 29, 2024 21:36 — with GitHub Actions Failure

ehearty temporarily deployed to community-ci-auto August 29, 2024 21:36 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci August 29, 2024 21:36 — with GitHub Actions Error

aldogonzalez8 reviewed Sep 9, 2024

View reviewed changes

added a unit test for id-sorted incremental search results

e75981b

ehearty temporarily deployed to community-ci-auto September 19, 2024 21:44 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci September 19, 2024 21:44 — with GitHub Actions Error

ehearty had a problem deploying to community-ci-auto September 19, 2024 21:44 — with GitHub Actions Failure

ehearty had a problem deploying to community-ci September 19, 2024 21:44 — with GitHub Actions Error

updating version number

8f911ec

ehearty had a problem deploying to community-ci-auto September 19, 2024 21:50 — with GitHub Actions Error

ehearty had a problem deploying to community-ci September 19, 2024 21:50 — with GitHub Actions Error

ehearty had a problem deploying to community-ci-auto September 19, 2024 21:50 — with GitHub Actions Error

Merge branch 'master' into ehearty/source-hubspot-search-stream-fix

041499b

ehearty temporarily deployed to community-ci-auto September 19, 2024 21:53 — with GitHub Actions Inactive

ehearty had a problem deploying to community-ci September 19, 2024 21:53 — with GitHub Actions Error

ehearty had a problem deploying to community-ci-auto September 19, 2024 21:53 — with GitHub Actions Failure

vercel bot deployed to Preview September 19, 2024 21:57 View deployment

fix formatting

c6956cd

ehearty temporarily deployed to community-ci-auto September 19, 2024 22:07 — with GitHub Actions Inactive

ehearty temporarily deployed to community-ci September 19, 2024 22:07 — with GitHub Actions Inactive

ehearty deployed to community-ci September 19, 2024 22:07 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Source HubSpot: fix infinite loop when iterating through search results #44899

🐛 Source HubSpot: fix infinite loop when iterating through search results #44899

ehearty commented Aug 29, 2024

vercel bot commented Aug 29, 2024 •

edited

Loading

CLAassistant commented Aug 29, 2024 •

edited

Loading

aldogonzalez8 commented Sep 9, 2024 •

edited

Loading

aldogonzalez8 Sep 9, 2024

aldogonzalez8 Sep 9, 2024

aldogonzalez8 Sep 9, 2024

aldogonzalez8 commented Sep 9, 2024 •

edited

Loading

ehearty commented Sep 10, 2024

aldogonzalez8 commented Sep 12, 2024

ehearty commented Sep 16, 2024

ehearty commented Sep 19, 2024

aldogonzalez8 commented Sep 19, 2024

🐛 Source HubSpot: fix infinite loop when iterating through search results #44899

Are you sure you want to change the base?

🐛 Source HubSpot: fix infinite loop when iterating through search results #44899

Conversation

ehearty commented Aug 29, 2024

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Aug 29, 2024 • edited Loading

CLAassistant commented Aug 29, 2024 • edited Loading

aldogonzalez8 commented Sep 9, 2024 • edited Loading

aldogonzalez8 Sep 9, 2024

Choose a reason for hiding this comment

aldogonzalez8 Sep 9, 2024

Choose a reason for hiding this comment

aldogonzalez8 Sep 9, 2024

Choose a reason for hiding this comment

aldogonzalez8 commented Sep 9, 2024 • edited Loading

ehearty commented Sep 10, 2024

aldogonzalez8 commented Sep 12, 2024

ehearty commented Sep 16, 2024

ehearty commented Sep 19, 2024

aldogonzalez8 commented Sep 19, 2024

vercel bot commented Aug 29, 2024 •

edited

Loading

CLAassistant commented Aug 29, 2024 •

edited

Loading

aldogonzalez8 commented Sep 9, 2024 •

edited

Loading

aldogonzalez8 commented Sep 9, 2024 •

edited

Loading