Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with empty annotation intersection #150

Closed
ZabalaAitor opened this issue Jun 18, 2024 · 12 comments · Fixed by #159
Closed

Problems with empty annotation intersection #150

ZabalaAitor opened this issue Jun 18, 2024 · 12 comments · Fixed by #159
Labels
bug Something isn't working

Comments

@ZabalaAitor
Copy link

Description of the bug

Hello,

I am trying to run nf-core/circRNA on sncRNA samples, and I encountered an error during the annotation part for some of the samples. I noticed that the samples with errors have an empty intersect.bed file.

I am wondering what information is supposed to be in the intersect.bed file and what biological reasons could cause it to be empty.

Thank you very much,

Aitor Zabala

Command used and terminal output

nextflow run nf-core/circRNA \
	-r dev \
	-profile apptainer \
	--input /data/azabala/NIM_005/samplesheet.csv \
	--phenotype /data/azabala/NIM_005/phenotype.csv \
	--module circrna_discovery,mirna_prediction \
	--outdir /scratch/azabala/sncRNA/results_circRNA \
	--tool 'circrna_finder' \
	--max_cpus 36 \
	--max_memory 512GB \
	-w /scratch/azabala/work_sncRNA_circRNA \
	--genome GRCh38 \
	--save_reference false \
	-resume

...............................


Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION (HC19)` terminated with an error exit status (1)

Command executed:

  annotation.py --input HC19.intersect.bed --exon_boundary 200 --output HC19.annotation.bed
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION":
      python: $(python --version | sed 's/Python //g')
      pandas: $(python -c "import pandas; print(pandas.__version__)")
      numpy: $(python -c "import numpy; print(numpy.__version__)")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/home/azabala/.nextflow/assets/nf-core/circRNA/bin/annotation.py", line 55, in <module>
      df = df.groupby(['chr', 'start', 'end', 'strand']).aggregate({
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/groupby/generic.py", line 894, in aggregate
      result = op.agg()
               ^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 169, in agg
      return self.agg_dict_like()
             ^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 478, in agg_dict_like
      arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 601, in normalize_dictlike_arg
      raise KeyError(f"Column(s) {cols_sorted} do not exist")
  KeyError: "Column(s) ['gene_id', 'transcript_id'] do not exist"

Work dir:
  /scratch/azabala/work_sncRNA_circRNA/83/3b958d1d7194efaa23a82450c6e7f5

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

Nextflow: 23.04.2
Hardware: HPC
Executor: slurm
Conatiner: Apptainer
OS: Linux
nf-core/circrna: dev

@ZabalaAitor ZabalaAitor added the bug Something isn't working label Jun 18, 2024
@nictru
Copy link
Contributor

nictru commented Jun 18, 2024

Hey,
This happens if the GTF file does not meet the expectations. In this case, the gene_id and transcript_id fields in the attributes column are missing. Please make sure to use an appropriate GTF file.
Also the pipeline version seems to be a bit outdated - please update using nextflow pull nf-core/circrna

@ZabalaAitor
Copy link
Author

Hey,

I used the default GTF file provided by eGenomes, which I believe should have the correct format. Regarding the pipeline, I did update it using nextflow pull nf-core/circrna, but it's possible that the update didn't complete properly due to issues with the HPC environment. I'll look into it to ensure the pipeline is fully updated.

Thanks,

@nictru
Copy link
Contributor

nictru commented Jun 19, 2024

I am sure the GTF will have the correct format; otherwise, errors will look different. The problem occurs because the GTF contains regions on sequences not present in the FASTA file.

This problem will also occur on the latest pipeline version, as I have not yet had time to fix it - this was just a side note.

EDIT: This message was a mixup - forget about it

@ZabalaAitor
Copy link
Author

The FASTA file is also provided by eGenomes...

@nictru
Copy link
Contributor

nictru commented Jun 24, 2024

Oh I'm sorry, I got mixed up between two issues. This issue does not have anything to do with the FASTA file. The one with the FASTA file compatibility problems is #151.

Still, the error you encounter is due to missing gene_id and transcrip_id entries in the GTF file. nf-core also discourages the usage of iGenomes as stated here. Maybe look inside the GTF file and see for yourself, but I can also add a check to the pipeline, which will give a user-friendly message if this happens again. To fix this I can recommend reference data from here.

@ZabalaAitor
Copy link
Author

I tried using another GTF file and encountered an error while running CIRIquant because it is unable to find the GTF file, whereas other tools, such as circRNA_finder, are able to do.

I have written about the issue in #155 . Please feel free to delete or close that entry if you prefer to resolve the issue here.

Thank you very much for your time and assistance.

@ZabalaAitor
Copy link
Author

This error persists despite using different GTF files. Could it be because there are no circRNAs in those samples?

@nictru
Copy link
Contributor

nictru commented Jul 2, 2024

You are absolutely right, this can also occur if no circRNAs are found. I should have thought about this earlier. You can confirm this is the case by switching to /scratch/azabala/work_sncRNA_circRNA/83/3b958d1d7194efaa23a82450c6e7f5 and investigating the GTF file there.

If it is really the case, I will implement a clear error message pointing this out for future users.

@ZabalaAitor
Copy link
Author

I cannot find the GTF file in that directory, but the intersect.bed file is empty.

@nictru
Copy link
Contributor

nictru commented Jul 3, 2024

Yes okay, this is the reason then. Is the data you used confidential? Otherwise I would like to use it as test data for coming up with a clean solution

@nictru nictru changed the title ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION' Problems with empty annotation intersection Jul 12, 2024
@nictru nictru linked a pull request Jul 12, 2024 that will close this issue
@nictru
Copy link
Contributor

nictru commented Jul 12, 2024

Hey @ZabalaAitor, please re-execute the pipeline with the branch connected to the PR I just opened (#159) and provide me with the updated error message

@xfk274280
Copy link

Oh I'm sorry, I got mixed up between two issues. This issue does not have anything to do with the FASTA file. The one with the FASTA file compatibility problems is #151.

Still, the error you encounter is due to missing gene_id and transcrip_id entries in the GTF file. nf-core also discourages the usage of iGenomes as stated here. Maybe look inside the GTF file and see for yourself, but I can also add a check to the pipeline, which will give a user-friendly message if this happens again. To fix this I can recommend reference data from here.

An error occurred due to the absence of transcript_id in the rows where the flag equals gene in the GTF (Gene Transfer Format) file. Furthermore, you are inquiring about which branch, between dev and 150-problems-with-empty-annotation-intersection, should be regarded as the most updated one.

df_incomplete = df_incomplete[df_incomplete != ""]
if len(df_incomplete) > 0:

1 1223243 1223968 1:1223243-1223968:- 11.0 - 1 ensembl_havana gene 1216908 1232067 . -. gene_id "ENSG00000078808"; gene_version "18"; gene_name "SDF4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants