Demo #1589

vasudeva8 · 2023-03-28T11:08:40Z

This pull request adds a number of sample programs to demonstrate the usage of HTSLib apis.
https://docs.google.com/document/d/1d6EECYkezwZWW4zWpyqGNHK4LIrcai8anJAvNbtxiVo/edit# gives a documentation of the same.

jkbonfield

Just a cursory glance through flags_demo.c and split.c so far. More tomorrow.

We need to also decide on a house style for bracing. We have multiple ones in use in this code, but we should be consistent. Personally my preference is this as it's fewer lines of code.

if (expr)
    statement;
else
    statement;
    
if (multi_line_expr) {
    do1;
    do2;
} else {
    do3;
}

I know others have open braces in the same column as the closing one (but as I say I see both here).

samples/flags_demo.c

samples/split.c

jkbonfield · 2023-03-28T15:57:20Z

samples/split.c

+    //open input file
+    infile = sam_open(inname, "r");
+    if (!infile) {
+        fprintf(stderr, "Could not open %s\n", inname);


sam_open will complain anyway, so this is redundant - for both input and output files. Eg:

@ seq4-head2[htslib/samples]; ./a.out notfound -o /tmp/nonexist [E::hts_open_format] Failed to open file "notfound" : No such file or directory Could not open notfound

All samtools code that invokes sam_open() writes its own error message itself.

Personally I think it's a bug that sam_open prints its own diagnostic. It is a library routine similar to fopen so should be quiet by default, leaving it to the invoking code to print a diagnostic — which can print a better diagnostic because it knows what the purpose of the file is (e.g. it can write “can't open reference file” or similar to distinguish further which of several input files is missing).

Some htslib library functions print their own diagnostics because it's likely that invoking code won't print its own diagnostic. However file opening functions like sam_open are the one case where HTSlib can depend on the invoking application code universally printing the diagnostic itself!

Agreed on the library not reporting errors this way, but right now it does so. Ideally it needs an hts_strerror type of interface, maybe optionally taking a file pointer (like ferror), or we split those into two APIs. (See also comments in #1049). A long standing wish-list!

This code serves as two purposes really. One is standalone basic starting points that people may wish to copy and build on top of - a simple template design. In that things like error checking and reporting are important to put in as they foster good design principles.

The second is as text quoted in a tutorial or how-to guide on using htslib. This isn't just doxygen type manuals, which don't easily lend themselves to guidance. For those I think we could probably do some liberal pruning of error reporting (but not detection) for the sake of brevity. We already have some of this with liberal "..." to jump to the relevant code blocks. It may be better to avoid putting too much in main so we can document entire functions as isolated code fragments. That also lends itself better to just returning error codes instead of printing up messages. That depends on the application though.

jkbonfield · 2023-03-28T16:03:17Z

samples/split.c

+        inname = "-";
+    //output directory
+    if (!outdir) {
+        fprintf(stderr, "Output directory not specified\n");


One for @daviesrob to consider maybe, but we never use hts_log functions from samtools and I don't know why. We could have instead:

hts_log_warning("no output directory specified");

or

hts_log(HTS_LOG_WARNING, "split","no output directory specified");

I'm not sure why we always use fprintf here, but we even simulate the same style of output:

https://github.com/samtools/samtools/blob/develop/sam_view.c#L1164

I suspect it's just history.

That said, it can be useful to not include this in the initial demonstration programs as we don't want to bamboozle people with a large API to start with.

samples/split.c

jkbonfield

General comments for header.c.

The header API is complex and has a lot of features. We probably want to focus on the most common and basic operations people may want to do. That's things like checking the length of a reference sequence, or a general purpose find value X for key Y in header type @ZZ.

This code specifies the -nth line, and it then converts that to a string form using sam_hdr_line_name and the idf ("SN" for SQ and "ID" for PG/RG) to a string, which is then looked up using sam_hdr_find_tag_id. This doesn't really feel like something the user is ever going to want to do. We're unlikely to want to know about the 85th reference, but we may want to know about ref "hs37d5".

Furthermore there's no specific block on sam_hdr_find_tag_id working with any key type. We can query SQ by SN (the most likely) but also by M5 if we wished to. The only real difference is that SN has an internal index (a hash table) for fast lookup while the others will do a full scan to find the line. However that's something we could describe in text if needs be.

Also as a demonstrationtool it feels a little muddled by trying to do many things. I don't think it needs the whole complexity of reporting all header lines types unless the user specifies one, which then gives rise to the whole headercnt and line types arrays etc. Viewing the entire header is easier via other mechanisms, so as demonstration of the API we should focus on the querying specific components aspect. This will make the code much simpler to follow, and hence focus on the key points we want to explain which is our API, rather than the program flow and logic.

Eg:

$ read_header
Usage: read_header file line-type key value [tag]

# Uses sam_hdr_find_line_id
$ read_header in.bam SQ SN hs37d5
@SQ	SN:hs37d5	LN:35477943	M5:5b6a4b3a81a2d3c134b7d14bf6ad39f1	UR:/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa

# Uses sam_hdr_find_tag_id
$ read_header in.bam SQ SN hs37d5 LN
35477943

We can do that already (-h SQ -n 85 -t LN ~/scratch/data/novaseq.10m.cram) but with a -n option instead.

samples/read_header.c

jkbonfield · 2023-03-29T14:10:09Z

samples/read_header.c

+    //iterate and show 1st of each type or requested one
+    for (c = 0; headercnt && (c < (sizeof(types) / sizeof(types[0]))); ++c) {
+        //get count of given header type
+        linecnt = sam_hdr_count_lines(in_samhdr, *(header + c));


*(header + c) is better just written as header[c] and makes it clearer we're just using an array of header types.

jkbonfield · 2023-03-29T14:24:37Z

samples/read_header.c

+            //find the id_key field for the requested header type - uses NULL for HD
+            for (c = 0; c < (sizeof(types) / sizeof(types[0])); ++c) {
+                if (!strcmp(types[c], optarg)) {
+                    idf = &idfield[c];
+                    break;
+                }
+            }


This comment is confusing.

jkbonfield

I think both read_bam and read_aux would benefit from deliberately not trying to look like SAM. Maybe it more of a debug or "dump" tool.

Also for the aux handling, demonstrating the aux iterator is nice, but we should also demonstrate querying with eg

char *c = bam_aux_get(b, "NM");
int NM = c ? bam_aux2i(c) : 0;

So maybe the tool can print up entire lines or just report on specific tag types.

samples/read_bam.c

samples/read_aux.c

jkbonfield · 2023-04-03T11:45:34Z

I still have specifics to go through with the demos, but I decided to take a more high level overview of things. I should have started with this before, sorry.

Basically we have a few categories of tools, and their closest matching samtools equivalent where possible:

Samtools flagstat: flags_demo / flags_htsopt_field / flags_tpool.
Samtools view read/write loop: read_bam / index_read / index_reg_read / index_multi_read / index_multireg_read / index_write.
Samtools index: index_create
Pileup API: pileup.c / mpileup.c
SAM aux tag manipulation: read_aux.c
Header API: read_header / update_header
split.c ?

So we have a lot of tools, but not too many themes. That's good as it permits a better narrative in a tutorial or sets of related worked examples. Some of the APIs though I think are in the wrong place.

Flagstat is a read-only tool, which is good for demonstrating decoding of the basic sequence API. This may mean read_bam is irrelevant. Or perhaps read_bam belongs in the same chapter / topic as the flagstat one. Flagstat is a perfect example of the basic hts_set_opt API with HTS_OPT_NTHREADS and CRAM_OPT_REQUIRED_FIELDS. It's not appropriate for the thread pool, as we only have one thing to do.

The read/write loops could may be considered as a file format conversion tool, so we can introduce filename handling and setting format based on suffixes. All tools currently just use sam_open(..., "w") to write SAM, so we're missing useful functionality here. A read/write loop is the perfect place to bring in using a shared thread pool, instead of explicitly controlling a per-file-descriptor thread count.

Pileup is its own thing, and definitely needs its own tutorial section. I'm happy with pileup vs mpileup, but the constructor/destructor interface is an extra complexity that has no purpose in the examples. We probably need something more advanced, such as optimising away an overhead from being computed per-column to per-read. There are choices there too - caching numerics in pileup1_t->cd.i vs caching complex things needing memory allocation, which is where the constructor/destructor come in. I'd need to think of a concise example, but test/pileup_mod.c is one such use where bam_parse_basemod is called during the constructor.

The index querying can come in at any point I guess, as it applies equallty to both a read-only scenario and a read-write scenario. I'm not sure which topic is the best one to add it to, but it's OK in the read/write code.

The header API feels a bit disjoint, as well as split.c which I can't follow the intention of. Maybe these two can be merged into the same overall topic. Eg split by read-group requires processing of the header, and maybe we'd want to show how we can add new PG entries or remove other RG lines when splitting. Again this is a perfect example of where the thread pool can be useful as we're writing out several files at once.

Aux tag manipulation could do with extending. Right now it's just regurgitating the aux tag data, which isn't so helpful (and also not the ideal way to do it). Maybe we need something to summarise tag utilisation (tagstats), or to add a new tag computed from the read data. Eg adding a cg:f tag to report the CG %age for each read. Maybe also for fun gather statistics on CG content vs MQUAL. Infact the latter half of that could be a read-BAM-only while the former is a read-write loop, so perhaps it's better than the flagstat and view alternatives? Thoughts?

The README.md file is a basic introduction to the sample code in this directory, while DEMO.md has more detailed information per program and API calls introduced. This is a work in progress and updated documentation may be available on www.htslib.org.

daviesrob assigned jkbonfield Mar 28, 2023

jkbonfield reviewed Mar 28, 2023

View reviewed changes

jkbonfield reviewed Mar 29, 2023

View reviewed changes

vasudeva8 force-pushed the demo branch from 6275486 to e8b2501 Compare May 9, 2023 12:55

vasudeva8 force-pushed the demo branch 2 times, most recently from 86b3809 to d474a19 Compare June 2, 2023 09:04

vasudeva8 force-pushed the demo branch 4 times, most recently from ffd8119 to 57bac5c Compare June 9, 2023 16:03

vasudeva8 force-pushed the demo branch from 6898cd8 to bda88fa Compare July 7, 2023 16:47

vasudeva8 force-pushed the demo branch 2 times, most recently from 0347a59 to 22ca20a Compare July 20, 2023 16:09

jkbonfield force-pushed the demo branch from 22ca20a to bc37946 Compare July 21, 2023 09:17

jkbonfield merged commit d06b598 into samtools:develop Jul 21, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo #1589

Demo #1589

vasudeva8 commented Mar 28, 2023 •

edited

Loading

jkbonfield left a comment

jkbonfield Mar 28, 2023

jmarshall Mar 28, 2023

jkbonfield Mar 29, 2023

jkbonfield Mar 28, 2023 •

edited

Loading

jkbonfield left a comment

jkbonfield Mar 29, 2023

jkbonfield Mar 29, 2023

jkbonfield left a comment

jkbonfield commented Apr 3, 2023

Demo #1589

Demo #1589

Conversation

vasudeva8 commented Mar 28, 2023 • edited Loading

jkbonfield left a comment

Choose a reason for hiding this comment

jkbonfield Mar 28, 2023

Choose a reason for hiding this comment

jmarshall Mar 28, 2023

Choose a reason for hiding this comment

jkbonfield Mar 29, 2023

Choose a reason for hiding this comment

jkbonfield Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

jkbonfield left a comment

Choose a reason for hiding this comment

jkbonfield Mar 29, 2023

Choose a reason for hiding this comment

jkbonfield Mar 29, 2023

Choose a reason for hiding this comment

jkbonfield left a comment

Choose a reason for hiding this comment

jkbonfield commented Apr 3, 2023

vasudeva8 commented Mar 28, 2023 •

edited

Loading

jkbonfield Mar 28, 2023 •

edited

Loading