Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data structure for particle pair #275

Open
miniufo opened this issue Sep 20, 2023 · 7 comments
Open

data structure for particle pair #275

miniufo opened this issue Sep 20, 2023 · 7 comments
Labels
question Further information is requested

Comments

@miniufo
Copy link

miniufo commented Sep 20, 2023

Hi, I've recently doing relative dispersion analysis, in which particles are grouped into pairs like

pair = [p1, p2] # p1 is a particle represented as xr.Dataset, containing x, y, u, v

Then I write some diagnostics calculation such as relative dispersion:

def relative_dispersion(pair, power=2):
    p1, p2 = pair
    dx = p1.x - p2.x
    dy = p1.y - p2.y
    return np.hypot(dx, dy) ** power

If there is 100 particles, through combination one would get ~5000 (100*99/2) possible pairs. So a simple loop like:

rd = [relative_dispersion(pair) for pair in pairs]

would be quite slow due to unacceptable overhead. Although I can do parallel mapping in this case, I just wondering if there is a better solution for this. Do you guys have any suggestions on dealing with pair particle calcualtions?

@selipot
Copy link
Member

selipot commented Sep 21, 2023

This project has not looked at two-particle statistics yet! We've been focusing on single particle statistics/analysis for now, especially through our apply_ragged function in the clouddrift.analysis module. In any case it is a great topic. I am wondering if @dhruvbalwada would have some ideas/pointers with his SF3To_KEflux rep?

@miniufo
Copy link
Author

miniufo commented Sep 21, 2023

Thanks @selipot. I just found an old issue here. At that time, I was trying to do something like this project. But I don't know the best data structure for even single particle statistics.

I find this project solved my previous issue by introducing the ragged array, which is effective in dealing with particles of different lengths. The regular array is more effective in dealing with those with the same record lengths. Following your basic design, I just define a pair as a list of two xr.Datasets. Then I can define a series of statistical functions as:

def relative_dispersion(pair):
    return rd
def velocity_structure(pair):
    return S2
def FSLE(pair):
    return fsle

Then a unified apply_to_pairs function, like apply_to_ragged, can be defined:

def apply_to_pairs(func, pairs, **kwargs):
    res = [func(pair, **kwargs) for pair in pairs]
    return xr.concat(res, dim='pair')

which is convenient for parallelism and also give rise to a new pair dimension for later average.

The thing is that when pairs are large, like ~1000 particles in which FSLE uses all possible pairs (~million pairs), this simple design of pair as [xr.Dataset, xr.Dataset] induces unacceptable large overhead and requires about half day to finish the calculation of FSLE. But if an explicit loop is used in numba fashion, this can be done in minutes. Currently I do not use any parallelism. So I guess we need better data structure for this, since two-particle statistics should be an important part for this exciting project.

Just want to know if you have any better solution here. I would like to try some benchmark with my data on hand. And if you feel the design is compatible for the project, I definitely love to make a little contribution here.

@philippemiron
Copy link
Contributor

philippemiron commented Sep 21, 2023

Maybe you could store everything in a large single xr.Dataset. That way, you will be sure that each variable is contiguous in memory, which should speed up calculation considerably.

import xarray as xr
import numpy as np

nb_particles = 2
nb_pairs = 10**6 # for example
x = np.random.rand(nb_particles, nb_pairs)
y = np.random.rand(nb_particles, nb_pairs)

# Create the xarray Dataset
ds = xr.Dataset({
    'x': (('particle', 'pair'), x),
    'y': (('particle', 'pair'), y),
})

Then you could use the clouddrift.analysis.distance, to get the distance between all the particles (that function returns the great circle distance in meters, but we (or you) could modify it if you are on a regular 2d plane)

%time
from clouddrift.analysis import distance
d = distance(ds.y[0], ds.x[0], ds.y[1], ds.x[1])
> CPU times: user 36 ms, sys: 15.2 ms, total: 51.2 ms
Wall time: 49.9 ms

@miniufo
Copy link
Author

miniufo commented Sep 21, 2023

Thanks @philippemiron. We may still reqiure a time axis, which is generally ragged for different pairs. Note that a pair is grouped according to some criteria (e.g., separation smaller than $r_0$ ). So in a pair, particle p1 is usually copied or subset to fit the other, while in a second pair, p1 can be subset differently.

I open an issue here, but their suggestion of a list of indexes similar to the above solution does not make it much faster (this also makes the dataset much much larger).

@miniufo miniufo mentioned this issue Oct 12, 2023
@dhruvbalwada
Copy link

I don't know if I have the most optimal solution to the problem (sorry didn't have chance to try through all your suggestions above).

But here are some things that I have done in the past:

  • When I used matlab I would store the pairs as struct. Shown here.
  • Now with python I usually create xarray datasets. Example I might store something like (where dul is the velocity difference between two samples, and I store all associated coordinates):
Screen Shot 2023-10-12 at 10 30 14 AM
  • I also usually use pdist, and the norm can be defined as distance (or zonal distance, or meridional distance, or velocity difference etc). I also use @jit decorator for the functions that calculate this norm, which helps some.

@miniufo
Copy link
Author

miniufo commented Oct 12, 2023

Hi @dhruvbalwada, that's interesting. How do you calculate relative-dispersion (RD) statistics when you use this data structure?

I am really new to this, but I find that the data structure of pair particles really affects how APIs look like. I had a very prelimiary draft of the APIs like:

class RelativeDispersion:
    def __init__(self, xpos, ypos, uvel, vvel, time, coord='cartesian', Rearth=6371.2):
        
    # get pairs
    def group_pairs(self, particles) -> pairs:
    def find_pairs(self, pairs, r0=[lower, upper], chancePair=True) -> originalPairs, chancePairs:
    
    # time- and separation-based statistics
    def PDF(self, r, bins) -> PDFs[bins, time]:
    def CDF(self, PDFs, bins) -> CDFs[bins, time]:
    
    # time-based statistics
    def stat_r(self, pairs, power=2.0, mean=True) -> rx2[pairs, time], ry2[pairs, time], r2[pairs, time]:
    def stat_v(self, pairs, power=2.0, mean=True) -> u2[pairs, time], v2[pairs, time], mag2[pairs, time]:
        
    # separation-based statistics
    def FSLE(self, pairs, bins, interpT=1, mean=True) -> FSLE[pairs, bins]:
    def CIST(self, CDF, lower, upper) -> CIST[pairs, bins]:

    # helper functions
    def get_separation_bins(self, r_lower, r_upper, alpha=np.sqrt(2.0)) -> bins[bins]:

Just feel that this fits well to the thinking of scientists (may be improved further), at least for a newbie like me:

  1. pair is the core data structure, as a list of two particles: [xr.Dataset, xr.Dataset] (different pairs have different time lengths). So group_pairs gets all possible pairs and find_pairs gives you (basically a filter) original pairs and/or chance pairs within r0;
  2. Most of the statistics accept pairs (a list of pair) as its input, and return xr.DataArrays as outputs, whose dimensions depends on [pair, time] if they are time-based, and on [pair, bins] if they are separation-based.

The good side is that this is easy to parallelization (apply a core function to a list of pair). The down side is that the overhead is large. Just sharing some practice here and see if we could find better solutions

@miniufo
Copy link
Author

miniufo commented Oct 31, 2023

I just found this nice plot on ragged array:
image

I am thinking if we could similarly get a ragged pair array like:
1698747217329

Then possibly, many existing functions can be designed in a similar fashion. The downside is that a particle will be subset many times to fit other particles so that the whole size of the dataset will be increased many times.

@kevinsantana11 kevinsantana11 added the question Further information is requested label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants