Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Source HubSpot: fix infinite loop when iterating through search results #44899

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ehearty
Copy link

@ehearty ehearty commented Aug 29, 2024

🐛 Source HubSpot: fix infinite loop when iterating through search results

What

Fixes airbytehq/airbyte/#43317

How

I updated our custom HS connector to filter on both primary key AND timestamp, then to sort by object's primary key. The timestamps new return out of order, but once we finish iterating through all the results we should have all the data (similar to full refresh processing).

Review guide

  1. airbyte-integrations/connectors/source-hubspot/source_hubspot/streams.py
    • updates to the "process_search" function:
      • added a "last_id" parameter
      • updated filter logic to include:
        • self.primary_key >= last_id
        • self.last_modified_field >= self._state.timestamp()
        • self.last_modified_field <= self._init_sync.timestamp()
          note: I added this logic because I don't like "open ended queries" where the result set might change depending on whether new records were modified after the start of the pull.
      • updated sort logic to use the primary key instead of the last_modified_field to ensure that we're always grabbing a unique result set
    • added a "get_max" function that will first attempt a numeric comparison, but fall back to a string comparison if int conversion isn't possible - this is mostly to account for unit tests where the "id" field is a string instead of an integer, and any future object searches where we might want to use a non-numeric primary key
    • updated read_records to:
      • grab the maximum primary key returned in the result set, and pass that as the last_id value to the process_search function every time we go over 10k results
      • no longer update the _state with the last_modified_timestamp now that we're sorting by primary key instead
      • once the search completes, update the state with the init_sync_timestamp to ensure that the next run starts from exactly where we left off

User Impact

  • What is the end result perceived by the user? None.
  • If there are negative side effects, please list them.
    • Now that we're no longer updating the state with the last modified date after each iteration, we'll lose any partial progress if an error occurs during the sync. Given how rarely this scenario occurs, and how quickly and easily we can re-pull this data, impact should be minimal.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Copy link

vercel bot commented Aug 29, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Sep 19, 2024 10:07pm

@CLAassistant
Copy link

CLAassistant commented Aug 29, 2024

CLA assistant check
All committers have signed the CLA.

@octavia-squidington-iii octavia-squidington-iii added the area/connectors Connector related issues label Aug 29, 2024
@ehearty ehearty force-pushed the ehearty/source-hubspot-search-stream-fix branch from 777f1d8 to 19a3a26 Compare August 29, 2024 20:25
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Aug 29, 2024
…n iterating through search results to avoid infinite loop

fixes airbytehq/airbyte/airbytehq#43317
@aldogonzalez8
Copy link
Contributor

aldogonzalez8 commented Sep 9, 2024

I will put this here as a comment just if in the future, somebody wants to understand the change and has some inquiries:

Before:

  1. Request data to HS on the search endpoint
    • filter on the latest state (timestamp).
    • ordered. by timestamp
  2. We use the token returned in response to paginate.
  3. If the token is none, pagination is over
  4. If pagination >10000 records, we start a new request as we can't get further because HS doesn't want that.
  5. New request will use the latest/greatest timestamp
  6. if all data has identical timestamps because some batch update occurred, a new request would try to use the same initial timestamp, and we get the infinite loop.

After:

  1. Request data on HS in the search endpoint:
    • filter on the latest state (timestamp)
    • Filter on latest object id/primary_key with initial None
    • Filter on timestamp < current_sync to not have "open-ended queries"
  2. We use the token returned in response to paginate.
  3. If the token is none, pagination is over
  4. If pagination > 1000 records, we start a new request because we can't go further, as HS doesn't want that.
  5. New requests will filter like:
    • Filter still on the latest state before sync started
    • Use the latest object ID/primary_key obtained from the previous pagination process (technically, it's the same loop, so pagination is not over, but we restart because of the 10,000 limit and that stuff.).
    • Again, Filter on timestamp < current_sync to not have "open-ended queries"
  6. if all data has identical timestamps because some batch update occurred, this will not affect the pagination process as we also filtered and now instead sorted by object ID rather than timestamp.
  7. We are happy because it is not an infinite loop.

Comment on lines +1137 to +1142
"filters": [
{"value": int(self._state.timestamp() * 1000), "propertyName": self.last_modified_field, "operator": "GTE"},
{"value": int(self._init_sync.timestamp() * 1000), "propertyName": self.last_modified_field, "operator": "LTE"},
{"value": last_id, "propertyName": self.primary_key, "operator": "GTE"},
],
"sorts": [{"propertyName": self.primary_key, "direction": "ASCENDING"}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehearty, Would you mind putting here some documentation about why this may be needed? It doesn't need to be as large as my comment in the PR, but some TL;DR on why we have a second filter may be complementary to the one in the read records section that talks about the 10,000 limitations.

My only concern is that we don't have a specific test for this scenario. Would you mind taking a look at the chance to have one? Please let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*docummentation = comments

@@ -331,7 +331,8 @@ The connector is restricted by normal HubSpot [rate limitations](https://legacyd
<summary>Expand to review</summary>

| Version | Date | Pull Request | Subject |
|:--------|:-----------|:---------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|:--------|:-----------| :------------------------------------------------------- |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 4.2.19 | 2024-08-29 | [42688](https://github.com/airbytehq/airbyte/pull/44899) | Fix incremental search to use primary key as placeholder instead of lastModifiedDate |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the date here.

@aldogonzalez8
Copy link
Contributor

aldogonzalez8 commented Sep 9, 2024

@ehearty LGTM left some comments, but I'm happy the unit test is good, and regression is returning good results.

@ehearty
Copy link
Author

ehearty commented Sep 10, 2024

@ehearty LGTM left some comments, but I'm happy the unit test is good, and regression is returning good results.

Thanks @aldogonzalez8 - time permitting, I'll try to have the changes you requested in by the end of the week.

@aldogonzalez8
Copy link
Contributor

@ehearty If you make changes before the weekend, can you put the release date on Monday? I prefer not to ship on Thursday afternoon-Friday. If all tests pass and everything looks good, I will approve and later merge that day. Again, just if you do it before Friday. Thanks.

@ehearty
Copy link
Author

ehearty commented Sep 16, 2024

@ehearty If you make changes before the weekend, can you put the release date on Monday? I prefer not to ship on Thursday afternoon-Friday. If all tests pass and everything looks good, I will approve and later merge that day. Again, just if you do it before Friday. Thanks.

Sorry for the delay. I've specifically blocked off time this week to make those final updates and will have them submitted in the next couple of days.

@ehearty
Copy link
Author

ehearty commented Sep 19, 2024

@aldogonzalez8 Just letting you know that I've jumped back into the PR today and am working on a unit test for the new sort/filter parameters. Hoping to have the changes in by tomorrow, but will set the release date to Monday as requested. Thanks again for your patience.

@aldogonzalez8
Copy link
Contributor

@aldogonzalez8 Just letting you know that I've jumped back into the PR today and am working on a unit test for the new sort/filter parameters. Hoping to have the changes in by tomorrow, but will set the release date to Monday as requested. Thanks again for your patience.

@ehearty That makes sense to me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/source/hubspot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants