Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing performance of external graph usage #3823

Open
chenejac opened this issue Feb 20, 2023 · 4 comments
Open

Testing performance of external graph usage #3823

chenejac opened this issue Feb 20, 2023 · 4 comments
Assignees

Comments

@chenejac
Copy link
Contributor

chenejac commented Feb 20, 2023

Is your feature request related to a problem? Please describe.
It looks performance in communication with external/remote graph bases is quite low. We need to test those performance and create baseline for comparation with further improvements.

Describe the solution you'd like
A wiki page for performance testing should be updated - https://wiki.lyrasis.org/display/VIVO/Performance+Testing. The page is documenting the scenario of testing and results. Taking into account that ingestion has been improved recently in this PR, all existing tests at wiki page should be repeated. The definition of the environment should be also included (operating system, processor, RAM, SDD or HDD disk).

Additional context
Wiki page - https://wiki.lyrasis.org/display/VIVO/Performance+Testing
Take a look in another Lyrasis communities testing performance procedures and persentation of results:

@brianjlowe
Copy link
Member

VIVO's internal logging of SPARQL query timings may be useful here: https://wiki.lyrasis.org/display/VIVO/The+Developer+Panel See sections about SPARQL Query Settings and developer.properties.

@michel-heon
Copy link
Member

@chenejac , @brianjlowe , @wwelling , @kaladay

VIVO (TDB/Neptune) Performance measurement - Preliminary results-

Here are some of the results of the analysis to date

Prerequisite

read https://wiki.lyrasis.org/display/VIVO/Performance+Testing

System details

developer.enabled = true
developer.loggingRDFService.enable = true
developer.loggingRDFService.queryRestriction = .*
developer.loggingRDFService.stackRestriction = .*
  • JAVA- Oracle JDK 11
  • Env. Vars
export CATALINA_OPTS="-Xms6g -Xmx14g"
export JAVA_OPTIONS="-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
-Dcom.sun.management.jmxremote=true \
-Dcom.sun.management.jmxremote.port=3614 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.autodiscovery=true "
  • Data:
    - From a UQAM triplestore in pre-production
    - Contains 396047 triples
    - Data are internationalized fr-CA and en-US
    - Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events
  • Computational resources
    • Tomcat host: AWS EC2 m6i.2xlarge
      • 8 CPU, 32 GB, 3.5 GhZ, Network up To 12.5 GB/s
    • AWS-Neptune instance: db.t3.medium
      • 2 CPU, 4 GB, Moderate Network

Testing Procedure

  1. Configure applicationSetup.n3 for appropriate situation: tdbConfigurationTripleSource or sparqlConfigurationTripleSource with appropriate sparql-endpoint URL
  2. Copy data-graph in $VIVO_HOME/rdf/abox/filegraph
  3. Start Tomcat (make sure solr is running)
  4. Wait for the VIVO boot and data loading to be completed
  5. Log in as administrator in VIVO and go to the administration window
  6. Empty the contents of the tomcat log files => Echoing "" in all vivo.all.log's file in $TOMCAT_HOME/logs/
  7. Start a VIVO reindexing
  8. At the end of the indexing, start the performance evaluation with the attached script
  9. Save the result in a file
  10. Select the 'Person' tab in VIVO
  11. Make steps 8-9

Results analysis

Test with TDB

Script result after reindexing (step 8):

   calls      sec   sec/call               method
==================================================
   52915    21.58     0.0004 sparqlConstructQuery
  211517    85.92     0.0004    sparqlSelectQuery
   10578     2.89     0.0003       sparqlAskQuery
Total time: 110.394 sec (~1 mins, or ~0 hrs)	

Script result after reindexing (step 11):

   calls      sec   sec/call               method
==================================================
   53313    21.64     0.0004 sparqlConstructQuery
  212285    86.02     0.0004    sparqlSelectQuery
   10958     2.93     0.0003       sparqlAskQuery
Total time: 110.596 sec (~1 mins, or ~0 hrs)

Remarks:

A total of 1546 SPARQ operations are requested to refresh the Person tab form
According to our calculations, these operations are performed in .202 seconds

Test with NEPTUNE

Script result after reindexing (step 8):

   calls      sec   sec/call               method
==================================================
   51760   685.09     0.0132 sparqlConstructQuery
  205288  3036.38     0.0148    sparqlSelectQuery
   10269   126.92     0.0124       sparqlAskQuery
Total time: 3848.392 sec (~64 mins, or ~1 hrs)

Script result after reindexing (step 11):

   calls      sec   sec/call               method
==================================================
   52156   685.89     0.0132 sparqlConstructQuery
  206056  3038.81     0.0147    sparqlSelectQuery
   10647   127.57     0.0120       sparqlAskQuery
Total time: 3852.273 sec (~64 mins, or ~1 hrs)

Remarks:

A total of 1544 SPARQ operations are requested to refresh the Person tab form
According to our calculations, these operations are performed in 3,31 seconds

General remarks

It appears that changing pages (person, organization, search, etc.) requires a number of queries proportional to the number of triples in the TriplesStore. For a networked database detached from the VIVO application, the consequences are quite small for a little database. On the other hand, the sensitivity to latency becomes more perceptible for larger databases.

The hypothesis that emerges from the observation is that the high latency is mainly due to repeated sending of small queries. We believe that it would be advantageous to group small queries into larger ones to reduce the number of calls to the triplestore.

@chenejac
Copy link
Contributor Author

chenejac commented Mar 3, 2023

@michel-heon thanks for this. This is a nice first step. I am wondering can we make more concrete definition for further steps including code analysis and listing links to code lines where the improvement is needed (for instance link to a method which should use bulk fetching of data).

@michel-heon
Copy link
Member

A comprehensive investigation

This commentary describes the investigations that have been conducted to identify the source of the network latency. In addition, a first correction is proposed in the Vitro PR 378

The investigation's goal is to determine the entry point requiring optimization. Inspired by the defined procedure VIVO-Wiki/Performance+Testing, several executions with a step-by-step trace have been performed.

Experimental procedure to evaluate the performance of the solution

The investigation is carried out by performing five iterations, each comprising two tests

Test setup

Test flow

The tests are produced in the following sequence:

flowchart TD 
    Z[Clean tomcat logs ] --> A[Set parameters]
    A -->|STGE-Type, #-indv, #-Thread| B(compile/start-vivo)
    BB[(Data set)] -. Loading data set .-> C
    B --> C{{Wait for the reindexing to be completed}}
    C -.  trace log .-> E[(test-#.txt)]
    C --> F{{Select Person in VIVO}}
    F-. trace Log .-> G[(test-#+1.txt)]
    F--> H[Stop tomcat]
Loading

Test parameters

Iteration are parameterized according to the following three variables

  • the graph storage type (Local with tdb (T) or remote with Neptune (N))
  • the number of individuals per page (15 or 30)
  • the execution approach: monotrhead (1) or multithread (X)

Log compilers

The result of the tests is realized by the execution of this script which is an adaptation of the script proposed in parse
by

#!/bin/bash
###################################################################
# Script Name   :
# Description   :
# Args          : extract from https://wiki.lyrasis.org/download/attachments/183076653/parse.sh?version=1&modificationDate=1583178696512&api=v2&download=true
# Author        : Michel Héon PhD
# Institution   : Université du Québec à Montréal
# Copyright     : Université du Québec à Montréal (c) 2022
# Email         : heon.michel@uqam.ca
###################################################################
export SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd -P)"
source $SCRIPT_DIR/../conf/vs-conf.sh
source $BASH_LIB/cleanup.sh
LOG_FILES="$TOMCAT_HOME/logs/$APP_NAME.all.*"
log_file=$TMPDIR/file.log

for f in $LOG_FILES
do
  echo -e '\n' >> $log_file
  cat $f >> $log_file
done
set +x


grep  'ingest:' < $log_file

# Output formatting
divider===================================
divider=$divider$divider

header="\n %7s %8s %10s %20s\n"
format=" %7d %8.2f %10.4f %20s\n"

width=50

printf "$header" "calls" "sec" "sec/call" "method" ;
printf "%$width.${width}s\n" "$divider"
# Output formatting - END

total=0
total_count=0
for method in {changeSetUpdate,sparqlConstructQuery,sparqlDescribeQuery,sparqlSelectQuery,sparqlAskQuery,serializeAll,serializeGraph,isEquivalentGraph};
  do 
  count=`grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep -c $method;`
  sum=`grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep $method | cut -d' ' -f1 | paste -sd+ - | bc -l ;`
  if [ ! -z "$sum" ]
    then
      rate=`echo "${sum}/${count}" | bc -l`
      printf "$format" "$count" "$sum" "$rate" "$method";
      math="${total}+${sum}"
      total=`echo $math | bc -l`
      total_count=`echo "${total_count}+${count}"|bc`
  fi
done

mins=`echo "${total}/60.0" | bc -l`
hours=`echo "${total}/3600.0" | bc -l`
printf "Total time/calls: %.2f/%d sec (~%.2f mins, or ~%.2f hrs) avrg %f sec.\n"  "${total}" "$total_count" "${mins}" "${hours}" "$(echo "${total}/$total_count" | bc -l)"
echo "Number of lines $(wc -l < $log_file)"

Data set

  • From a UQAM triplestore in pre-production
  • Contains 396047 triples
  • Data are internationalized fr-CA and en-US
  • Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events

Java classes affected by the tests

see details in Vitro PR 378

Test results

Summary of experiments

Test # STGE Type nbrindv/page Threading after-reindex after-touch-person
test-1 T 30 1 X
test-2 T 30 1 X
test-3 T 15 X X
test-4 T 15 X X
test-5 N 30 1 X
test-6 N 30 1 X
test-7 N 15 X X
test-8 N 15 X X
test-9 N 15 1 X
test-10 N 15 1 X

Test details

test-1

   calls      sec   sec/call               method
==================================================
   59468    27.36     0.0005 sparqlConstructQuery
  236584   108.38     0.0005    sparqlSelectQuery
   11829     3.80     0.0003       sparqlAskQuery
Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec.
Number of lines 307929

test-2

   calls      sec   sec/call               method
==================================================
   59864    27.44     0.0005 sparqlConstructQuery
  237352   108.50     0.0005    sparqlSelectQuery
   12207     3.84     0.0003       sparqlAskQuery
Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.
Number of lines 309471

test-3

   calls      sec   sec/call               method
==================================================
   51926    24.37     0.0005 sparqlConstructQuery
  207613    98.01     0.0005    sparqlSelectQuery
   10379     3.70     0.0004       sparqlAskQuery
Total time/calls: 126.08/269918 sec (~2.10 mins, or ~0.04 hrs) avrg 0.000467 sec.
Number of lines 269963

test-4

   calls      sec   sec/call               method
==================================================
   52109    24.54     0.0005 sparqlConstructQuery
  207997    98.19     0.0005    sparqlSelectQuery
   10551     3.74     0.0004       sparqlAskQuery
Total time/calls: 126.47/270657 sec (~2.11 mins, or ~0.04 hrs) avrg 0.000467 sec.
Number of lines 270703

test-5

   calls      sec   sec/call               method
==================================================
   55176   352.31     0.0064 sparqlConstructQuery
  209343  1613.03     0.0077    sparqlSelectQuery
   10465    58.37     0.0056       sparqlAskQuery
Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec.
Number of lines 275029

test-6

   calls      sec   sec/call               method
==================================================
   55574   353.08     0.0064 sparqlConstructQuery
  210133  1615.31     0.0077    sparqlSelectQuery
   10845    59.00     0.0054       sparqlAskQuery
Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.
Number of lines 276599

test-7

   calls      sec   sec/call               method
==================================================
   56134   351.02     0.0063 sparqlConstructQuery
  209061  1598.22     0.0076    sparqlSelectQuery
   10453    57.96     0.0055       sparqlAskQuery
Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec.
Number of lines 275693

test-8

   calls      sec   sec/call               method
==================================================
   56301   351.57     0.0062 sparqlConstructQuery
  209408  1599.73     0.0076    sparqlSelectQuery
   10607    58.34     0.0055       sparqlAskQuery
Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec.
Number of lines 276362

test-9

   calls      sec   sec/call               method
==================================================
   58137   361.48     0.0062 sparqlConstructQuery
  217714  1664.85     0.0076    sparqlSelectQuery
   10887    61.15     0.0056       sparqlAskQuery
Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec.
Number of lines 286784

test-10

   calls      sec   sec/call               method
==================================================
   58328   361.86     0.0062 sparqlConstructQuery
  218091  1666.02     0.0076    sparqlSelectQuery
   11068    61.46     0.0056       sparqlAskQuery
Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.
Number of lines 2875344

Discussion

Performance between the local TDB type and the remote triples server (Neptune)

The two tests representing this test case are test-2 and test-6:

Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.
Test-3: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.

We observe an average resolution of obtaining information of 0.000452 sec for the TdbLocal mode and 0.007331 sec for the TdbDistant mode, that is to say a ratio for local TDB of 16 times faster than the distant mode.

Impact of reducing the number of individuals to read per page

This test case aims at analyzing the performance of the reduction of the individual number to be read to obtain the necessary information to supply the page. In its initial state, Vitro reads the information of 30 individuals. Each individual uses sometimes up to 60 SPARQL queries, which for 30 individuals is equivalent to nearly 1800 queries.
More specifically, by referring to the results below for these particular tests (309423-307881 = 1542):

Test-1: Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec.
Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.

To evaluate the effect of the decrease in the number of individuals we will review the results of the test-6 and test-10 for a reduction from 30 to 15 individuals on Neptune.

To make this comparison, it is necessary to consider only the totals obtained when the person page is called. The totals from the indexing must be subtracted from the totals after the call to the person's page. The result of the test-5 and test-9 must be part of the calculation

test-5: Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec.
test-6: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.
test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec.
Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.

Result
Total Time (TT) from test-6 (2027.39 sec) minus TT-5 (2023.71 sec) = 3.68 seconds to refresh page of (30 indvs)
TT-10 (2089.34 sec ) minus TT-9 (2087.49 sec) = 1.85 seconds to refresh page of (15 indvs)

So the page containing 15 individuals refreshes within a factor of 2... which makes sense!

Impact of parallel processing (Multithreading) of individuals

test-7: Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec.
test-8: Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec.
test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec.
Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.

TT-8 (2009.64 sec) minus TT-7 (2007.19 sec) = 2.45 seconds to refresh page Multi-Thread
TT-10 (2089.34 sec ) mins TT-9 (2087.49 sec) = 1.85 seconds to refresh page of Mono-Thread

It appears in this experiment that the results obtained are in the opposite direction of the expected results. This result can be explained in several ways:

  • The generation of threads takes more time than the transmission of the request
  • A possible overload of the Neptune triplet server
  • Unexpected network overload
  • Other possible factors.

Without any doubt, this is a case to be reproduced several times in order to extract more significant statistical results.

Tracks of future investigations

Here are some of the important classes involved in finding the information needed to produce the list of individuals. Classes that are more or less in order of call.

  • GetRenderedSearchIndividualsByVClass
  • IndividualTemplateModelBuilder
  • ShortViewServiceImpl
  • IndividualSDB
  • RequestModelAccessImpl
  • RDFServiceSparql

Two approaches are possible to reduce latency:
1- Redefine the call logic to reduce the number of queries sent to the network.
2- Buffering the treatment by making a first request extracting all the information specific to the individual to form a model in local memory and then applying the treatment logic on the model in local memory.

FYI: @chenejac , @brianjlowe , @wwelling , @litvinovg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants