Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo #1589

Merged
merged 1 commit into from
Jul 21, 2023
Merged

Demo #1589

merged 1 commit into from
Jul 21, 2023

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Mar 28, 2023

This pull request adds a number of sample programs to demonstrate the usage of HTSLib apis.
https://docs.google.com/document/d/1d6EECYkezwZWW4zWpyqGNHK4LIrcai8anJAvNbtxiVo/edit# gives a documentation of the same.

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a cursory glance through flags_demo.c and split.c so far. More tomorrow.

We need to also decide on a house style for bracing. We have multiple ones in use in this code, but we should be consistent. Personally my preference is this as it's fewer lines of code.

if (expr)
    statement;
else
    statement;
    
if (multi_line_expr) {
    do1;
    do2;
} else {
    do3;
}

I know others have open braces in the same column as the closing one (but as I say I see both here).

samples/flags_demo.c Outdated Show resolved Hide resolved
samples/flags_demo.c Outdated Show resolved Hide resolved
samples/flags_demo.c Outdated Show resolved Hide resolved
samples/flags_demo.c Outdated Show resolved Hide resolved
samples/split.c Outdated Show resolved Hide resolved
samples/split.c Outdated
//open input file
infile = sam_open(inname, "r");
if (!infile) {
fprintf(stderr, "Could not open %s\n", inname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sam_open will complain anyway, so this is redundant - for both input and output files. Eg:

@ seq4-head2[htslib/samples]; ./a.out notfound -o /tmp/nonexist
[E::hts_open_format] Failed to open file "notfound" : No such file or directory
Could not open notfound

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All samtools code that invokes sam_open() writes its own error message itself.

Personally I think it's a bug that sam_open prints its own diagnostic. It is a library routine similar to fopen so should be quiet by default, leaving it to the invoking code to print a diagnostic — which can print a better diagnostic because it knows what the purpose of the file is (e.g. it can write “can't open reference file” or similar to distinguish further which of several input files is missing).

Some htslib library functions print their own diagnostics because it's likely that invoking code won't print its own diagnostic. However file opening functions like sam_open are the one case where HTSlib can depend on the invoking application code universally printing the diagnostic itself!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the library not reporting errors this way, but right now it does so. Ideally it needs an hts_strerror type of interface, maybe optionally taking a file pointer (like ferror), or we split those into two APIs. (See also comments in #1049). A long standing wish-list!

This code serves as two purposes really. One is standalone basic starting points that people may wish to copy and build on top of - a simple template design. In that things like error checking and reporting are important to put in as they foster good design principles.

The second is as text quoted in a tutorial or how-to guide on using htslib. This isn't just doxygen type manuals, which don't easily lend themselves to guidance. For those I think we could probably do some liberal pruning of error reporting (but not detection) for the sake of brevity. We already have some of this with liberal "..." to jump to the relevant code blocks. It may be better to avoid putting too much in main so we can document entire functions as isolated code fragments. That also lends itself better to just returning error codes instead of printing up messages. That depends on the application though.

samples/split.c Outdated
inname = "-";
//output directory
if (!outdir) {
fprintf(stderr, "Output directory not specified\n");
Copy link
Contributor

@jkbonfield jkbonfield Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One for @daviesrob to consider maybe, but we never use hts_log functions from samtools and I don't know why. We could have instead:

hts_log_warning("no output directory specified");

or

hts_log(HTS_LOG_WARNING, "split","no output directory specified");

I'm not sure why we always use fprintf here, but we even simulate the same style of output:

https://github.com/samtools/samtools/blob/develop/sam_view.c#L1164

I suspect it's just history.

That said, it can be useful to not include this in the initial demonstration programs as we don't want to bamboozle people with a large API to start with.

samples/split.c Outdated Show resolved Hide resolved
Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments for header.c.

The header API is complex and has a lot of features. We probably want to focus on the most common and basic operations people may want to do. That's things like checking the length of a reference sequence, or a general purpose find value X for key Y in header type @ZZ.

This code specifies the -nth line, and it then converts that to a string form using sam_hdr_line_name and the idf ("SN" for SQ and "ID" for PG/RG) to a string, which is then looked up using sam_hdr_find_tag_id. This doesn't really feel like something the user is ever going to want to do. We're unlikely to want to know about the 85th reference, but we may want to know about ref "hs37d5".

Furthermore there's no specific block on sam_hdr_find_tag_id working with any key type. We can query SQ by SN (the most likely) but also by M5 if we wished to. The only real difference is that SN has an internal index (a hash table) for fast lookup while the others will do a full scan to find the line. However that's something we could describe in text if needs be.

Also as a demonstrationtool it feels a little muddled by trying to do many things. I don't think it needs the whole complexity of reporting all header lines types unless the user specifies one, which then gives rise to the whole headercnt and line types arrays etc. Viewing the entire header is easier via other mechanisms, so as demonstration of the API we should focus on the querying specific components aspect. This will make the code much simpler to follow, and hence focus on the key points we want to explain which is our API, rather than the program flow and logic.

Eg:

$ read_header
Usage: read_header file line-type key value [tag]

# Uses sam_hdr_find_line_id
$ read_header in.bam SQ SN hs37d5
@SQ	SN:hs37d5	LN:35477943	M5:5b6a4b3a81a2d3c134b7d14bf6ad39f1	UR:/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa

# Uses sam_hdr_find_tag_id
$ read_header in.bam SQ SN hs37d5 LN
35477943

We can do that already (-h SQ -n 85 -t LN ~/scratch/data/novaseq.10m.cram) but with a -n option instead.

samples/read_header.c Outdated Show resolved Hide resolved
samples/read_header.c Outdated Show resolved Hide resolved
samples/read_header.c Outdated Show resolved Hide resolved
//iterate and show 1st of each type or requested one
for (c = 0; headercnt && (c < (sizeof(types) / sizeof(types[0]))); ++c) {
//get count of given header type
linecnt = sam_hdr_count_lines(in_samhdr, *(header + c));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*(header + c) is better just written as header[c] and makes it clearer we're just using an array of header types.

Comment on lines 78 to 84
//find the id_key field for the requested header type - uses NULL for HD
for (c = 0; c < (sizeof(types) / sizeof(types[0])); ++c) {
if (!strcmp(types[c], optarg)) {
idf = &idfield[c];
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing.

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both read_bam and read_aux would benefit from deliberately not trying to look like SAM. Maybe it more of a debug or "dump" tool.

Also for the aux handling, demonstrating the aux iterator is nice, but we should also demonstrate querying with eg

char *c = bam_aux_get(b, "NM");
int NM = c ? bam_aux2i(c) : 0;

So maybe the tool can print up entire lines or just report on specific tag types.

samples/read_bam.c Outdated Show resolved Hide resolved
samples/read_bam.c Outdated Show resolved Hide resolved
samples/read_bam.c Outdated Show resolved Hide resolved
samples/read_bam.c Outdated Show resolved Hide resolved
samples/read_aux.c Outdated Show resolved Hide resolved
samples/read_aux.c Outdated Show resolved Hide resolved
samples/read_aux.c Show resolved Hide resolved
@jkbonfield
Copy link
Contributor

I still have specifics to go through with the demos, but I decided to take a more high level overview of things. I should have started with this before, sorry.

Basically we have a few categories of tools, and their closest matching samtools equivalent where possible:

  • Samtools flagstat: flags_demo / flags_htsopt_field / flags_tpool.
  • Samtools view read/write loop: read_bam / index_read / index_reg_read / index_multi_read / index_multireg_read / index_write.
  • Samtools index: index_create
  • Pileup API: pileup.c / mpileup.c
  • SAM aux tag manipulation: read_aux.c
  • Header API: read_header / update_header
  • split.c ?

So we have a lot of tools, but not too many themes. That's good as it permits a better narrative in a tutorial or sets of related worked examples. Some of the APIs though I think are in the wrong place.

Flagstat is a read-only tool, which is good for demonstrating decoding of the basic sequence API. This may mean read_bam is irrelevant. Or perhaps read_bam belongs in the same chapter / topic as the flagstat one. Flagstat is a perfect example of the basic hts_set_opt API with HTS_OPT_NTHREADS and CRAM_OPT_REQUIRED_FIELDS. It's not appropriate for the thread pool, as we only have one thing to do.

The read/write loops could may be considered as a file format conversion tool, so we can introduce filename handling and setting format based on suffixes. All tools currently just use sam_open(..., "w") to write SAM, so we're missing useful functionality here. A read/write loop is the perfect place to bring in using a shared thread pool, instead of explicitly controlling a per-file-descriptor thread count.

Pileup is its own thing, and definitely needs its own tutorial section. I'm happy with pileup vs mpileup, but the constructor/destructor interface is an extra complexity that has no purpose in the examples. We probably need something more advanced, such as optimising away an overhead from being computed per-column to per-read. There are choices there too - caching numerics in pileup1_t->cd.i vs caching complex things needing memory allocation, which is where the constructor/destructor come in. I'd need to think of a concise example, but test/pileup_mod.c is one such use where bam_parse_basemod is called during the constructor.

The index querying can come in at any point I guess, as it applies equallty to both a read-only scenario and a read-write scenario. I'm not sure which topic is the best one to add it to, but it's OK in the read/write code.

The header API feels a bit disjoint, as well as split.c which I can't follow the intention of. Maybe these two can be merged into the same overall topic. Eg split by read-group requires processing of the header, and maybe we'd want to show how we can add new PG entries or remove other RG lines when splitting. Again this is a perfect example of where the thread pool can be useful as we're writing out several files at once.

Aux tag manipulation could do with extending. Right now it's just regurgitating the aux tag data, which isn't so helpful (and also not the ideal way to do it). Maybe we need something to summarise tag utilisation (tagstats), or to add a new tag computed from the read data. Eg adding a cg:f tag to report the CG %age for each read. Maybe also for fun gather statistics on CG content vs MQUAL. Infact the latter half of that could be a read-BAM-only while the former is a read-write loop, so perhaps it's better than the flagstat and view alternatives? Thoughts?

The README.md file is a basic introduction to the sample code in this
directory, while DEMO.md has more detailed information per program and
API calls introduced.

This is a work in progress and updated documentation may be available
on www.htslib.org.
@jkbonfield jkbonfield merged commit d06b598 into samtools:develop Jul 21, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants