Testing performance of external graph usage #3823

chenejac · 2023-02-20T07:22:51Z

Is your feature request related to a problem? Please describe.
It looks performance in communication with external/remote graph bases is quite low. We need to test those performance and create baseline for comparation with further improvements.

Describe the solution you'd like
A wiki page for performance testing should be updated - https://wiki.lyrasis.org/display/VIVO/Performance+Testing. The page is documenting the scenario of testing and results. Taking into account that ingestion has been improved recently in this PR, all existing tests at wiki page should be repeated. The definition of the environment should be also included (operating system, processor, RAM, SDD or HDD disk).

Additional context
Wiki page - https://wiki.lyrasis.org/display/VIVO/Performance+Testing
Take a look in another Lyrasis communities testing performance procedures and persentation of results:

brianjlowe · 2023-02-28T15:32:33Z

VIVO's internal logging of SPARQL query timings may be useful here: https://wiki.lyrasis.org/display/VIVO/The+Developer+Panel See sections about SPARQL Query Settings and developer.properties.

michel-heon · 2023-03-01T19:54:44Z

@chenejac , @brianjlowe , @wwelling , @kaladay

VIVO (TDB/Neptune) Performance measurement - Preliminary results-

Here are some of the results of the analysis to date

Prerequisite

read https://wiki.lyrasis.org/display/VIVO/Performance+Testing

System details

VIVO & Vitro code extract from
- Infrastructure improvements - Migration to Java 11 from Java 8 #3836
- Infrastructure improvements - Migration to Java 11 from Java 8 Vitro#368
VIVO/Vitro configuration based-on https://wiki.lyrasis.org/display/VIVO/Performance+Testing
- with logging patch
- Inference and search NOT disable
- $VIVO_HOME/config/developer.properties configured as:

developer.enabled = true
developer.loggingRDFService.enable = true
developer.loggingRDFService.queryRestriction = .*
developer.loggingRDFService.stackRestriction = .*

JAVA- Oracle JDK 11
Env. Vars

export CATALINA_OPTS="-Xms6g -Xmx14g"
export JAVA_OPTIONS="-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
-Dcom.sun.management.jmxremote=true \
-Dcom.sun.management.jmxremote.port=3614 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.autodiscovery=true "

Data:
- From a UQAM triplestore in pre-production
- Contains 396047 triples
- Data are internationalized fr-CA and en-US
- Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events
Computational resources
- Tomcat host: AWS EC2 m6i.2xlarge
  - 8 CPU, 32 GB, 3.5 GhZ, Network up To 12.5 GB/s
- AWS-Neptune instance: db.t3.medium
  - 2 CPU, 4 GB, Moderate Network

Testing Procedure

Configure applicationSetup.n3 for appropriate situation: tdbConfigurationTripleSource or sparqlConfigurationTripleSource with appropriate sparql-endpoint URL
Copy data-graph in $VIVO_HOME/rdf/abox/filegraph
Start Tomcat (make sure solr is running)
Wait for the VIVO boot and data loading to be completed
Log in as administrator in VIVO and go to the administration window
Empty the contents of the tomcat log files => Echoing "" in all vivo.all.log's file in $TOMCAT_HOME/logs/
Start a VIVO reindexing
At the end of the indexing, start the performance evaluation with the attached script
Save the result in a file
Select the 'Person' tab in VIVO
Make steps 8-9

Results analysis

Test with TDB

Script result after reindexing (step 8):

   calls      sec   sec/call               method
==================================================
   52915    21.58     0.0004 sparqlConstructQuery
  211517    85.92     0.0004    sparqlSelectQuery
   10578     2.89     0.0003       sparqlAskQuery
Total time: 110.394 sec (~1 mins, or ~0 hrs)

Script result after reindexing (step 11):

   calls      sec   sec/call               method
==================================================
   53313    21.64     0.0004 sparqlConstructQuery
  212285    86.02     0.0004    sparqlSelectQuery
   10958     2.93     0.0003       sparqlAskQuery
Total time: 110.596 sec (~1 mins, or ~0 hrs)

Remarks:

A total of 1546 SPARQ operations are requested to refresh the Person tab form
According to our calculations, these operations are performed in .202 seconds

Test with NEPTUNE

Script result after reindexing (step 8):

   calls      sec   sec/call               method
==================================================
   51760   685.09     0.0132 sparqlConstructQuery
  205288  3036.38     0.0148    sparqlSelectQuery
   10269   126.92     0.0124       sparqlAskQuery
Total time: 3848.392 sec (~64 mins, or ~1 hrs)

Script result after reindexing (step 11):

   calls      sec   sec/call               method
==================================================
   52156   685.89     0.0132 sparqlConstructQuery
  206056  3038.81     0.0147    sparqlSelectQuery
   10647   127.57     0.0120       sparqlAskQuery
Total time: 3852.273 sec (~64 mins, or ~1 hrs)

Remarks:

A total of 1544 SPARQ operations are requested to refresh the Person tab form
According to our calculations, these operations are performed in 3,31 seconds

General remarks

It appears that changing pages (person, organization, search, etc.) requires a number of queries proportional to the number of triples in the TriplesStore. For a networked database detached from the VIVO application, the consequences are quite small for a little database. On the other hand, the sensitivity to latency becomes more perceptible for larger databases.

The hypothesis that emerges from the observation is that the high latency is mainly due to repeated sending of small queries. We believe that it would be advantageous to group small queries into larger ones to reduce the number of calls to the triplestore.

chenejac · 2023-03-03T10:28:16Z

@michel-heon thanks for this. This is a nice first step. I am wondering can we make more concrete definition for further steps including code analysis and listing links to code lines where the improvement is needed (for instance link to a method which should use bulk fetching of data).

michel-heon · 2023-03-09T16:38:52Z

A comprehensive investigation

This commentary describes the investigations that have been conducted to identify the source of the network latency. In addition, a first correction is proposed in the Vitro PR 378

The investigation's goal is to determine the entry point requiring optimization. Inspired by the defined procedure VIVO-Wiki/Performance+Testing, several executions with a step-by-step trace have been performed.

Experimental procedure to evaluate the performance of the solution

The investigation is carried out by performing five iterations, each comprising two tests

Test setup

Test flow

The tests are produced in the following sequence:

flowchart TD 
    Z[Clean tomcat logs ] --> A[Set parameters]
    A -->|STGE-Type, #-indv, #-Thread| B(compile/start-vivo)
    BB[(Data set)] -. Loading data set .-> C
    B --> C{{Wait for the reindexing to be completed}}
    C -.  trace log .-> E[(test-#.txt)]
    C --> F{{Select Person in VIVO}}
    F-. trace Log .-> G[(test-#+1.txt)]
    F--> H[Stop tomcat]

Test parameters

Iteration are parameterized according to the following three variables

the graph storage type (Local with tdb (T) or remote with Neptune (N))
the number of individuals per page (15 or 30)
the execution approach: monotrhead (1) or multithread (X)

Log compilers

The result of the tests is realized by the execution of this script which is an adaptation of the script proposed in parse
by

#!/bin/bash
###################################################################
# Script Name   :
# Description   :
# Args          : extract from https://wiki.lyrasis.org/download/attachments/183076653/parse.sh?version=1&modificationDate=1583178696512&api=v2&download=true
# Author        : Michel Héon PhD
# Institution   : Université du Québec à Montréal
# Copyright     : Université du Québec à Montréal (c) 2022
# Email         : heon.michel@uqam.ca
###################################################################
export SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd -P)"
source $SCRIPT_DIR/../conf/vs-conf.sh
source $BASH_LIB/cleanup.sh
LOG_FILES="$TOMCAT_HOME/logs/$APP_NAME.all.*"
log_file=$TMPDIR/file.log

for f in $LOG_FILES
do
  echo -e '\n' >> $log_file
  cat $f >> $log_file
done
set +x


grep  'ingest:' < $log_file

# Output formatting
divider===================================
divider=$divider$divider

header="\n %7s %8s %10s %20s\n"
format=" %7d %8.2f %10.4f %20s\n"

width=50

printf "$header" "calls" "sec" "sec/call" "method" ;
printf "%$width.${width}s\n" "$divider"
# Output formatting - END

total=0
total_count=0
for method in {changeSetUpdate,sparqlConstructQuery,sparqlDescribeQuery,sparqlSelectQuery,sparqlAskQuery,serializeAll,serializeGraph,isEquivalentGraph};
  do 
  count=`grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep -c $method;`
  sum=`grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep $method | cut -d' ' -f1 | paste -sd+ - | bc -l ;`
  if [ ! -z "$sum" ]
    then
      rate=`echo "${sum}/${count}" | bc -l`
      printf "$format" "$count" "$sum" "$rate" "$method";
      math="${total}+${sum}"
      total=`echo $math | bc -l`
      total_count=`echo "${total_count}+${count}"|bc`
  fi
done

mins=`echo "${total}/60.0" | bc -l`
hours=`echo "${total}/3600.0" | bc -l`
printf "Total time/calls: %.2f/%d sec (~%.2f mins, or ~%.2f hrs) avrg %f sec.\n"  "${total}" "$total_count" "${mins}" "${hours}" "$(echo "${total}/$total_count" | bc -l)"
echo "Number of lines $(wc -l < $log_file)"

Data set

From a UQAM triplestore in pre-production
Contains 396047 triples
Data are internationalized fr-CA and en-US
Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events

Java classes affected by the tests

see details in Vitro PR 378

Test results

Summary of experiments

Test #	STGE Type	nbrindv/page	Threading	after-reindex	after-touch-person
test-1	T	30	1	X
test-2	T	30	1		X
test-3	T	15	X	X
test-4	T	15	X		X
test-5	N	30	1	X
test-6	N	30	1		X
test-7	N	15	X	X
test-8	N	15	X		X
test-9	N	15	1	X
test-10	N	15	1		X

Test details

test-1

   calls      sec   sec/call               method
==================================================
   59468    27.36     0.0005 sparqlConstructQuery
  236584   108.38     0.0005    sparqlSelectQuery
   11829     3.80     0.0003       sparqlAskQuery
Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec.
Number of lines 307929

test-2

   calls      sec   sec/call               method
==================================================
   59864    27.44     0.0005 sparqlConstructQuery
  237352   108.50     0.0005    sparqlSelectQuery
   12207     3.84     0.0003       sparqlAskQuery
Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.
Number of lines 309471

test-3

   calls      sec   sec/call               method
==================================================
   51926    24.37     0.0005 sparqlConstructQuery
  207613    98.01     0.0005    sparqlSelectQuery
   10379     3.70     0.0004       sparqlAskQuery
Total time/calls: 126.08/269918 sec (~2.10 mins, or ~0.04 hrs) avrg 0.000467 sec.
Number of lines 269963

test-4

   calls      sec   sec/call               method
==================================================
   52109    24.54     0.0005 sparqlConstructQuery
  207997    98.19     0.0005    sparqlSelectQuery
   10551     3.74     0.0004       sparqlAskQuery
Total time/calls: 126.47/270657 sec (~2.11 mins, or ~0.04 hrs) avrg 0.000467 sec.
Number of lines 270703

test-5

   calls      sec   sec/call               method
==================================================
   55176   352.31     0.0064 sparqlConstructQuery
  209343  1613.03     0.0077    sparqlSelectQuery
   10465    58.37     0.0056       sparqlAskQuery
Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec.
Number of lines 275029

test-6

   calls      sec   sec/call               method
==================================================
   55574   353.08     0.0064 sparqlConstructQuery
  210133  1615.31     0.0077    sparqlSelectQuery
   10845    59.00     0.0054       sparqlAskQuery
Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.
Number of lines 276599

test-7

   calls      sec   sec/call               method
==================================================
   56134   351.02     0.0063 sparqlConstructQuery
  209061  1598.22     0.0076    sparqlSelectQuery
   10453    57.96     0.0055       sparqlAskQuery
Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec.
Number of lines 275693

test-8

   calls      sec   sec/call               method
==================================================
   56301   351.57     0.0062 sparqlConstructQuery
  209408  1599.73     0.0076    sparqlSelectQuery
   10607    58.34     0.0055       sparqlAskQuery
Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec.
Number of lines 276362

test-9

   calls      sec   sec/call               method
==================================================
   58137   361.48     0.0062 sparqlConstructQuery
  217714  1664.85     0.0076    sparqlSelectQuery
   10887    61.15     0.0056       sparqlAskQuery
Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec.
Number of lines 286784

test-10

   calls      sec   sec/call               method
==================================================
   58328   361.86     0.0062 sparqlConstructQuery
  218091  1666.02     0.0076    sparqlSelectQuery
   11068    61.46     0.0056       sparqlAskQuery
Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.
Number of lines 2875344

Discussion

Performance between the local TDB type and the remote triples server (Neptune)

The two tests representing this test case are test-2 and test-6:

Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.
Test-3: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.

We observe an average resolution of obtaining information of 0.000452 sec for the TdbLocal mode and 0.007331 sec for the TdbDistant mode, that is to say a ratio for local TDB of 16 times faster than the distant mode.

Impact of reducing the number of individuals to read per page

This test case aims at analyzing the performance of the reduction of the individual number to be read to obtain the necessary information to supply the page. In its initial state, Vitro reads the information of 30 individuals. Each individual uses sometimes up to 60 SPARQL queries, which for 30 individuals is equivalent to nearly 1800 queries.
More specifically, by referring to the results below for these particular tests (309423-307881 = 1542):

Test-1: Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec.
Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.

To evaluate the effect of the decrease in the number of individuals we will review the results of the test-6 and test-10 for a reduction from 30 to 15 individuals on Neptune.

To make this comparison, it is necessary to consider only the totals obtained when the person page is called. The totals from the indexing must be subtracted from the totals after the call to the person's page. The result of the test-5 and test-9 must be part of the calculation

test-5: Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec.
test-6: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.
test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec.
Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.

Result
Total Time (TT) from test-6 (2027.39 sec) minus TT-5 (2023.71 sec) = 3.68 seconds to refresh page of (30 indvs)
TT-10 (2089.34 sec ) minus TT-9 (2087.49 sec) = 1.85 seconds to refresh page of (15 indvs)

So the page containing 15 individuals refreshes within a factor of 2... which makes sense!

Impact of parallel processing (Multithreading) of individuals

test-7: Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec.
test-8: Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec.
test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec.
Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.

TT-8 (2009.64 sec) minus TT-7 (2007.19 sec) = 2.45 seconds to refresh page Multi-Thread
TT-10 (2089.34 sec ) mins TT-9 (2087.49 sec) = 1.85 seconds to refresh page of Mono-Thread

It appears in this experiment that the results obtained are in the opposite direction of the expected results. This result can be explained in several ways:

The generation of threads takes more time than the transmission of the request
A possible overload of the Neptune triplet server
Unexpected network overload
Other possible factors.

Without any doubt, this is a case to be reproduced several times in order to extract more significant statistical results.

Tracks of future investigations

Here are some of the important classes involved in finding the information needed to produce the list of individuals. Classes that are more or less in order of call.

GetRenderedSearchIndividualsByVClass
IndividualTemplateModelBuilder
ShortViewServiceImpl
IndividualSDB
RequestModelAccessImpl
RDFServiceSparql

Two approaches are possible to reduce latency:
1- Redefine the call logic to reduce the number of queries sent to the network.
2- Buffering the treatment by making a first request extracting all the information specific to the individual to form a model in local memory and then applying the treatment logic on the model in local memory.

FYI: @chenejac , @brianjlowe , @wwelling , @litvinovg

chenejac added the High label Feb 20, 2023

chenejac assigned MichelHeon and michel-heon and unassigned MichelHeon Feb 21, 2023

chenejac removed the High label Feb 22, 2023

This was referenced Feb 22, 2023

Migration to Java 11 from Java 8 #3834

Open

Reorganizing sample data #3825

Open

This was referenced Mar 4, 2023

Testing performance of external graph usage - performance patch vivo-project/Vitro#377

Closed

Network latency correction between Vitro and the outsourced tripplet vivo-project/Vitro#378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing performance of external graph usage #3823

Testing performance of external graph usage #3823

chenejac commented Feb 20, 2023 •

edited

Loading

brianjlowe commented Feb 28, 2023

michel-heon commented Mar 1, 2023

chenejac commented Mar 3, 2023

michel-heon commented Mar 9, 2023

Testing performance of external graph usage #3823

Testing performance of external graph usage #3823

Comments

chenejac commented Feb 20, 2023 • edited Loading

brianjlowe commented Feb 28, 2023

michel-heon commented Mar 1, 2023

VIVO (TDB/Neptune) Performance measurement - Preliminary results-

Prerequisite

System details

Testing Procedure

Results analysis

Test with TDB

Test with NEPTUNE

General remarks

chenejac commented Mar 3, 2023

michel-heon commented Mar 9, 2023

A comprehensive investigation

Experimental procedure to evaluate the performance of the solution

Test setup

Test flow

Test parameters

Log compilers

Data set

Java classes affected by the tests

Test results

Summary of experiments

Test details

test-1

test-2

test-3

test-4

test-5

test-6

test-7

test-8

test-9

test-10

Discussion

Performance between the local TDB type and the remote triples server (Neptune)

Impact of reducing the number of individuals to read per page

Impact of parallel processing (Multithreading) of individuals

Tracks of future investigations

chenejac commented Feb 20, 2023 •

edited

Loading