Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap gmtselect #1429

Merged
merged 16 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ Operations on tabular data:
blockmode
nearneighbor
project
select
sph2grd
sphdistance
sphinterpolate
Expand Down
1 change: 1 addition & 0 deletions pygmt/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
makecpt,
nearneighbor,
project,
select,
sph2grd,
sphdistance,
sphinterpolate,
Expand Down
1 change: 1 addition & 0 deletions pygmt/src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
from pygmt.src.plot3d import plot3d
from pygmt.src.project import project
from pygmt.src.rose import rose
from pygmt.src.select import select
from pygmt.src.solar import solar
from pygmt.src.sph2grd import sph2grd
from pygmt.src.sphdistance import sphdistance
Expand Down
172 changes: 172 additions & 0 deletions pygmt/src/select.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
"""
select - Select data table subsets based on multiple spatial criteria.
"""
import pandas as pd
from pygmt.clib import Session
from pygmt.helpers import (
GMTTempFile,
build_arg_string,
fmt_docstring,
kwargs_to_strings,
use_alias,
)


@fmt_docstring
@use_alias(
A="area_thresh",
D="resolution",
G="gridmask",
I="reverse",
J="projection",
N="mask",
R="region",
V="verbose",
Z="z_subregion",
b="binary",
d="nodata",
e="find",
f="coltypes",
g="gap",
h="header",
i="incols",
o="outcols",
r="registration",
s="skiprows",
w="wrap",
)
@kwargs_to_strings(M="sequence", R="sequence", i="sequence_comma", o="sequence_comma")
def select(data=None, outfile=None, **kwargs):
r"""
Select data table subsets based on multiple spatial criteria.

This is a filter that reads (x, y) or (longitude, latitude) positions from
the first 2 columns of *data* and uses a combination of 1-7 criteria to
pass or reject the records. Records can be selected based on whether or not
they are:

1. inside a rectangular region (**region** [and **projection**])
2. within *dist* km of any point in *pointfile*
3. within *dist* km of any line in *linefile*
4. inside one of the polygons in the *polygonfile*
5. inside geographical features (based on coastlines)
6. has z-values within a given range, or
7. inside bins of a grid mask whose nodes are non-zero

The sense of the tests can be reversed for each of these 7 criteria by
using the **reverse** option.

Full option list at :gmt-docs:`gmtselect.html`

{aliases}

Parameters
----------
data : str or {table-like}
Pass in either a file name to an ASCII data table, a 2D
{table-classes}.
Comment on lines +66 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pass in either a file name to an ASCII data table, a 2D
{table-classes}.
Pass in either a file name to an ASCII data table or a 2D
{table-classes}.

I"m not familiar with using gmtselect, but can't it accept 3D tables (as indicated by z_subregion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring gets rendered like so:

image

So 2D numpy.array is correct. I don't think it's common to speak of a 3D table (though I did find https://stackoverflow.com/questions/24290495/constructing-3d-pandas-dataframe), but I know what you mean by having a z-column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like "2D or 3D {table-classes}"? I understand that 3D tables aren't the norm, but it doesn't make sense to have a parameter specifically for a 3D table but then say that gmtselect only accepts a 2D table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I get when I search for 3D data table:

image

If we are talking about a 3D numpy array, the shape will be something like (1, 2, 3), but in this case, our input cannot be like that. It can only be a 2D shape like (2, 3). The extra z-column isn't considered an extra dimension, more like an extra attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! That makes sense to me; I was thinking a 3D table meant one with data for the third dimension.

outfile : str
The file name for the output ASCII file.
{A}
resolution : str
*resolution*\ [**+f**].
Ignored unless **mask** is set. Selects the resolution of the coastline
data set to use ((**f**)ull, (**h**)igh, (**i**)ntermediate, (**l**)ow,
or (**c**)rude). The resolution drops off by ~80% between data sets.
[Default is **l**]. Append (**+f**) to automatically select a lower
resolution should the one requested not be available [Default is abort
if not found]. Note that because the coastlines differ in details it is
not guaranteed that a point will remain inside [or outside] when a
different resolution is selected.
gridmask : str
Pass all locations that are inside the valid data area of the grid
*gridmask*. Nodes that are outside are either NaN or zero.
reverse : str
[**cflrsz**].
Reverses the sense of the test for each of the criteria specified:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reverses the sense of the test for each of the criteria specified:
Reverses the criteria of the test for each of the criteria specified:

The use of "sense" doesn't make much sense to me. I know it's from the GMT docs, but would there be a better way to phrase this?

Copy link
Member Author

@weiji14 weiji14 Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it doesn't sound nice, but I think the following statements ("select records NOT inside/within ...") should (hopefully) make it clear to users what is meant by reverse. So maybe just keep it as with the GMT docs?

Copy link
Contributor

@willschlitzer willschlitzer Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easier to keep it in line with the GMT docs, I'm fine with it. But I might open a PR over at GMT; I don't think the current version makes much sense.


- **c** select records NOT inside any point's circle of influence.
- **f** select records NOT inside any of the polygons.
- **g** will pass records inside the cells with z equal zero of the
grid mask in **gridmask**.
- **l** select records NOT within the specified distance of any line.
- **r** select records NOT inside the specified rectangular region.
- **s** select records NOT considered inside as specified by **mask**
(and **area_thresh**, **resolution**).
- **z** select records NOT within the range specified by
**z_subregion**.
{J}
mask : str or list
Pass all records whose location is inside specified geographical
features. Specify if records should be skipped (s) or kept (k) using
1 of 2 formats:

- *wet/dry*.
- *ocean/land/lake/island/pond*.

[Default is s/k/s/k/s (i.e., s/k), which passes all points on dry
land].
{R}
{V}
z_subregion : str
*min*\ [/*max*]\ [**+a**]\ [**+c**\ *col*]\ [**+i**].
Pass all records whose 3rd column (*z*; *col* = 2) lies within the
given range or is NaN (use **skiprows** to skip NaN records). If *max*
is omitted then we test if *z* equals *min* instead. This means
equality within 5 ULPs (unit of least precision;
http://en.wikipedia.org/wiki/Unit_in_the_last_place). Input file must
have at least three columns. To indicate no limit on min or max,
specify a hyphen (-). If your 3rd column is absolute time then remember
to supply ``coltypes="2T"``. To specify another column, append
**+c**\ *col*, and to specify several tests just repeat the
**z_subregion** option as many times as you have columns to test.
**Note**: When more than one **z_subregion** option is given then the
``reverse="z"`` option cannot be used. In the case of multiple tests
you may use these modifiers as well: **+a** passes any record that
passes at least one of your *z* tests [Default is all tests must pass],
and **+i** reverses the tests to pass record with *z* value NOT in the
given range. Finally, if **+c** is not used then it is automatically
incremented for each new **z_subregion** option, starting with 2.
{b}
{d}
{e}
{f}
{g}
{h}
{i}
{o}
{r}
{s}
{w}

Returns
-------
output : pandas.DataFrame or None
Return type depends on whether the ``outfile`` parameter is set:

- :class:`pandas.DataFrame` table if ``outfile`` is not set.
- None if ``outfile`` is set (filtered output will be stored in file
set by ``outfile``).
"""

with GMTTempFile(suffix=".csv") as tmpfile:
with Session() as lib:
# Choose how data will be passed into the module
table_context = lib.virtualfile_from_data(check_kind="vector", data=data)
with table_context as infile:
if outfile is None:
outfile = tmpfile.name
arg_str = " ".join([infile, build_arg_string(kwargs), "->" + outfile])
lib.call_module(module="gmtselect", args=arg_str)

# Read temporary csv output to a pandas table
if outfile == tmpfile.name: # if user did not set outfile, return pd.DataFrame
try:
column_names = data.columns.to_list()
result = pd.read_csv(tmpfile.name, sep="\t", names=column_names)
except AttributeError: # 'str' object has no attribute 'columns'
result = pd.read_csv(tmpfile.name, sep="\t", header=None, comment=">")
elif outfile != tmpfile.name: # return None if outfile set, output in outfile
result = None

return result
64 changes: 64 additions & 0 deletions pygmt/tests/test_select.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
"""
Tests for select.
"""
import os

import numpy.testing as npt
import pandas as pd
import pytest
from pygmt import select
from pygmt.datasets import load_sample_bathymetry
from pygmt.helpers import GMTTempFile


@pytest.fixture(scope="module", name="dataframe")
def fixture_dataframe():
"""
Load the table data from the sample bathymetry dataset.
"""
return load_sample_bathymetry()


def test_select_input_dataframe(dataframe):
"""
Run select by passing in a pandas.DataFrame as input.
"""
output = select(data=dataframe, region=[250, 251, 26, 27])
assert isinstance(output, pd.DataFrame)
assert all(dataframe.columns == output.columns)
assert output.shape == (65, 3)
npt.assert_allclose(output.median(), [250.31464, 26.33893, -270.0])


def test_select_input_table_matrix(dataframe):
"""
Run select using table input that is not a pandas.DataFrame but still a
matrix.

Also testing the reverse (I) alias.
"""
data = dataframe.values
output = select(data=data, region=[245.5, 254.5, 20.5, 29.5], reverse="r")
assert isinstance(output, pd.DataFrame)
assert output.shape == (9177, 3)
npt.assert_allclose(output.median(), [247.235, 20.48624, -3241.0])


def test_select_input_filename():
"""
Run select by passing in an ASCII text file as input.

Also testing the z_subregion (Z) alias.
"""
with GMTTempFile() as tmpfile:
output = select(
data="@tut_ship.xyz",
region=[250, 251, 26, 27],
z_subregion=["-/-630", "-120/0+a"],
outfile=tmpfile.name,
)
assert output is None # check that output is None since outfile is set
assert os.path.exists(path=tmpfile.name)
output = pd.read_csv(tmpfile.name, sep="\t", header=None)
assert output.shape == (5, 3)
npt.assert_allclose(output.median(), [250.12149, 26.04296, -674.0])