Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add address_getAppearances #453

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

perama-v
Copy link
Contributor

@perama-v perama-v commented Jul 29, 2023

Description

Introduction of a method that returns all historical transactions associated with an address on Ethereum.

Motivation

Upon installation of an archive node, a user is unable to get a basic history of their own wallet history. This history might
include a number of kinds of interactions including receiving ether by CALL-style opcodes (so called "internal transactions"). These are not accessible to an archive node user without re-executing the entire chain or using external sources of data.

This PR introduces a method that fixes that.

Nodes that opt in to implement address_getAppearances enable dapp developers to create local user interfaces that contain historical information as well as current state.

Background

The UnchainedIndex consists of "address appearances". It is a distributed index constructed by calling an Ethereum node using the trueblocks-core software (https://github.com/TrueBlocks/trueblocks-core). It finds addresses as they appear in each block and publishes the index.
Users may then find and obtain a part of that index and then query their node to get more information.

This system has proven that there is real demand for a user to be able start with an address, and discover specific transactions. By introducing this method into an archive node, users would gain this ability out of the box.

Changes made

  • New address_ namespace
  • New address_getAppearances method for fetching every "address appearance" of a single address.
  • A README explaining the rationale for and explanation of the address_ namespace
  • Test cases

Test case generator

A test case generator has been written (https://github.com/perama-v/unchained-reader). It uses data from the trueblocks-core (chifra list <address>) to create test cases for individual addresses (across all blocks).

Dependencies

The concept of "address appearances" is introduced and specified in the following PR:

Future considerations

The data supporting the address_ namespace allows other methods to be proposed. For example

  • address_getInteractions(wallet_address, dapp_address) which could return the intersection of transactions that contain a user wallet and a dapp contract. Such an endpoint would not create much additional work on top of the base address_getAppearances.

@perama-v perama-v marked this pull request as draft July 29, 2023 10:23
@tjayrush
Copy link

tjayrush commented Aug 3, 2023

Why are you forcing the user to use hex numbers? It makes the endpoint harder to use for no good reason. I think we should explicitly say that both integers and hex are accepted. I understand that some of the eth_ routines for some of the clients only accept hex, but this is not consistent. We should make it consistent, and make it required to accept both.

@perama-v
Copy link
Contributor Author

perama-v commented Aug 3, 2023

We should make it consistent, and make it required to accept both.

Good suggestion , I will look to include it in the PR.

Edit: The acceptance of a block number as a JSON number type introduces behaviour that I consider to be beyond the scope of this PR. See the following separate PR, which can introduce this behaviour via a change from uint to uintParam. This method can then be modified to include that broader class of inputs:

@perama-v
Copy link
Contributor Author

As raised by @sslivkoff, the address_ namespace may not be semantically appropriate for derivative calls:

down the road I would love to see additional advanced querying rpc methods (like getBlockAccessList / getTransactionAccessList). not all of these additional methods will be address-related, but they have a similar enough flavor that they might go in the same namespace

so maybe something like query_blockAddresses / query_AddressAppearances would be more future proof?

In this PR I have a section for the rationale for the namespace being a new one. Which is that it makes explicit that a client must create a new index/indices to serve methods in that namespace.

  • address_
  • query_
  • ...

One thought is that whatever namespace is chosen, it should not be used for methods that do not require the index that the namespace implies.

So for a trite example query_getRecentBlockHashes would be discouraged because a client that wishes to serve this method (and not any methods that require the above index), would not have a simple way for the user to toggle this method on.

@perama-v
Copy link
Contributor Author

Here is a classification scheme that might be useful: derivative vs non-derivative

Non-derivative

Note that the following methods can be delivered by a node that doesn't implement an additional address (address -> transactions) database:

  • getTransactionAccessList: fetch transaction, get and return accessList field
  • getBlockAccessList: fetch block, for each transaction use getTransactionAccessList

Another example is

  • New eth_getAddressesInBlock method #452
  • Fetch block, parse for address appearances, return
  • Perhaps this should be query_getAddressesInBlock if it is felt that the eth_ namespace really implies that nodes must provide all eth_ endpoints in order to be part of the Ethereum protocol.

To include any of the above methods under the same namespace would prevent their use in a node configured to not create the additional address database.

Potential appropriate namespaces:

  • query_
  • eth_

Derivative

Some examples of possible derivative methods that would require the address database:

  • address_getInteractions(wallet_address, dapp_address)
    • Look up each address to get appearances, return intersection
  • address_getTransactionsSent(address) or address_getTransactionsReceived(address)
    • Look up address to get appearances, filter transactions for those where the address is sender/recipient
  • address_getCreator(address)
    • contract address -> deployer address
    • Look up address to get appearances, filter for transactions where contract creation occurs, return deployer address

For methods that are derivatives, like the last three above, the address_ namespace is likely to be appropriate because the query is linked to lookups in the address -> transactions database.

Potential appropriate namespaces:

  • address_ (this PR)

@sslivkoff
Copy link

not sure if this is what youre suggesting but I believe a 1:1 mapping of namespaces to indexes would be too restrictive. for example there will be different indexes to support address_getCreator vs address_getAppearances vs address_getTransactionsSent. a user could potentially want just a subset of these indices (especially if more rpc methods are added down the road), so they would want to enable just a part of the address_ namespace (which seems ok). needing to enable an entire namespace will be too coarse

the reason why I suggested a more general namespace like query_ is that down the road there might be additional RPC methods for many types of objects including addresses, blocks, txs, traces, slots, etc. I don't think it would be great to create new namespaces for each of these. from a node operator perspective I think the ideal UX for controlling these optional indices is letting node operators directly select the set of optional indices / RPC methods they want to enable. and if node operators can select the particular set of indices that they want, then there will be additional operational overhead if they have to also figure out the subset of these 5+ new namespaces they need. creating 5+ new optional namespaces also creates more implementation overhead for node client teams. I think the lowest friction way to add new optional indices over time is to stick them all in the same namespace

@perama-v
Copy link
Contributor Author

As these methods apply to different clients, the new namespace seemed like the clearest system to introduce the standard. If introducing new namespaces is really more of a burden for clients, that is good to know.

I think the ideal UX for controlling these optional indices is letting node operators directly select the set of optional indices / RPC methods they want to enable.

Would be interested to hear more about this. What mechanism do you think is best for node operators to enact this through? Perhaps leave it to client implementations to sort out (e.g. CLI or json_rpc_methods.json config file listing desired methods).

Perhaps the following could be part of the spec in this PR?:

## New `query` namespace
- Namespace: `query`
- Ethereum protocol requirements: Not required for participation in the Ethereum protocol
- Definition: A collection of methods for retrieving chain data for a variety of applications.
- Method requirements:
    - Specification: method name, method paramaters, response format
    - Dependencies: A list of methods that the method depends upon, if any. 
    - Implicit data requirements: A description of any indexes or data structures the method requires. 
          Do not include indexes implied by the `eth` namespace or already covered by a dependency. 

Users can read client docs and see that the available methods, their requirements and dependencies. There could be a few groups that tend to be turned on or off together.

there will be different indexes to support address_getCreator vs address_getAppearances vs address_getTransactionsSent.

Once support for query_getAppearances (previously address_getAppearances) is enabled, the other methods I saw as feasible candidates for on-the-fly computation. That is, not requiring different indices. So in this way they are methods you get for free (no additional database). Provided this is true, a node configured to enable query_getAppearances should probably automatically enable query_getCreator and query_getTransactionsSent (and any other "free" method in that group).

If the database claim above is true, the specs for the following methods would be as follows:

- `query_getCreator`
    - Dependencies: `query_getAppearances`
- `query_getTransactionsSent`
    - Dependences: `query_getAppearances`

A user that starts some client with query_getTransactionsSent toggled on, but query_getAppearances toggled off would get a warning on startup.

@sslivkoff
Copy link

yes as a node operator I would want to either use a json config file or a cli arg to specify a list of optional index names, e.g. --extra-indices address_appearances,block_timestamps,X,Y,Z

maybe I missed it, why is it that getAppearances would enable those other methods? you could build an index that supports getAppearances(address) --> list[transaction] without tracking things like contract creator and outbound txs sent. regardless though, down the road there might be new address-related RPC methods that are supported by a different set of indices than the appearance index. I would like there to be the possibility of adding these in the future without breaking any conventions that we come up here

@tjayrush
Copy link

tjayrush commented Aug 13, 2023

There's something I think that is SUPER important that you'all should not ignore. That's the issue of what I call distribution. I feel like the best index would be (a) complete (i.e., include every appearance of any type), (b) compact (be as absolutely as small as possible -- this means only <blknum.txid>, (c) be a simple as is conceivable, and (d) easily sharable in a permissionless way.

(a-complete) is important because the solution that's needed is reconcilable state re-creation. All other use cases are subsets of this (other than indexes related to other data types -- I'm not commenting on that here). (For example, we have a --factory option to chifra export that exports just contract creations. Yes, it spins through all appearances for a given address, but for 99% of cases (regular humans) it's perfectly fine, and for the other cases (Uniswap) the end user would be scanning once and caching the result.

(b-compact) the use case I think that's most important is the small entity/person on a smaller machine. Size of hard drive storage on disc is very important. The smaller the better because that increases the number of people who can afford to have permissionless access.

(c-simple) there's a very low probability that this will ever make it into the node software. To the extent it gets more and more complicated, that probability decreases. It will be hard enough to convince people to do the weird things you have to do to make it complete. Overcomplicating it will make it that much harder. Plus--two things: (1) sub-options can be accomplished (like --factory) as options on top of a complete index, and (2) the existence of sub-options may make it more likely that people stop before finishing completeness. I would start with completeness because that's the pressing need and add complication later.

(4-sharable) this, to me, is the most important thing. If you build a complicated, heavy, hard to implement, hard to maintain index, only a few super-technical people will "have it." We want every human to have it (in the best world) or every human to have undeniable, permissionless access to it (in an okay-but-not-great world). This is what Unchained Index is all about. With the complete index, if it's chunked, one can place a Bloom filter in front of each chunk and share only the Bloom filters with the entire world -- through, for example, a smart contract. Then, each user can download and pin only those chunks of the index they need. Small guy 1 gets small portion 1. Small guy 2 gets small portion 2. Giant Uniswap user gets the entire index. All of them pin the Bloom filters and the index portions they hold. In this way, over time the entire world has permissionless, uncensorable (and thereby zero-cost) access to the entire index. This should be built in from the start.

@perama-v
Copy link
Contributor Author

yes as a node operator I would want to either use a json config file or a cli arg to specify a list of optional index names, e.g. --extra-indices address_appearances,block_timestamps,X,Y,Z

👍

maybe I missed it, why is it that getAppearances would enable those other methods? you could build an index that supports getAppearances(address) --> list[transaction] without tracking things like contract creator and outbound txs sent.

Suppose an address has appeared in 100 transactions.

  1. The user calls getTransactionsSent(wallet_address) to their client
  2. The client internally calls the equivalent of getAppearances, which produces a list of 100 transaction ids very quickly.
  3. For each, it internally calls the equivalent of eth_getTransactionByBlockNumberAndIndex and looks at the "from" field
  4. Any that match wallet_address are returned to the user

One cannot have getTransactionsSent today because you would have to check every transaction ever. Post-getAppearances its possible, and requires very little work and no DB, hence it is enabled/"free" once its dependency (getAppearances) is implemented.

I would like there to be the possibility of adding these in the future without breaking any conventions that we come up here

❤️

@perama-v
Copy link
Contributor Author

perama-v commented Aug 14, 2023

You make a good case for query, I think I'll update everything to reflect that soonish.

The real win here is two different clients being able to come to this repo and see a spec clearly layed out and implement it using this standard.

Which is then very handy for users and devs who may use and swap between multiple clients. A unified query namespace allows for simpler integration. No need to call X for client A and Y for client B

@sslivkoff
Copy link

Suppose an address has appeared in 100 transactions. ...

ok I see now. so an appearances index supports multiple methods by performing some extra steps and leveraging other indices that already exist on the node

You make a good case for query, I think I'll update everything to reflect that soonish.

another possible name besides query_ is index_, derived from the fact that the methods are related to optional indices. not a perfect name tho because rpc methods of other namespaces also utilize some form of index. I would be happy with either name, or some other name that is similarly general/inclusive

The real win here is two different clients being able to come to this repo and see a spec clearly layed out and implement it using this standard.

💪

has anyone temperature checked this with the erigon devs?

@perama-v
Copy link
Contributor Author

Good idea. I have posted a link to this PR in the Erigon discord

@tjayrush
Copy link

I sort of tempature-checked these ideas with Eriogn about three years ago. (If fact, I used to work with that team.)

Back then, they were talking about adding the ability to do optional "modules," and I was exploring that idea, but they sort of abandoned that.

If you mention the work we've done on TrueBlocks, the entire Erigon team will be familiar with the idea of an "appearance." I've spoken to all of them about it numerous times. @ledgerwatch definitely knows what I mean by an appearance.

@0xSileo
Copy link

0xSileo commented Apr 29, 2024

I found this PR while looking for previous efforts to add a method fetching all transaction hashes of an address. It's surprising that we have a method returning the nonce but no method giving the nth tx hash.

What would be needed to take this a step further ? Would love to see it included

@tjayrush
Copy link

I'm not being flippant, but I really think the first thing that's needed is a champion. The feature is very clearly needed, and (at least to me), it's very clearly needed inside the node software. But the lift is pretty heavy. A champion would be the first thing required, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants