Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a string library that works on encrypted data using TFHE-rs #80

Closed
zaccherinij opened this issue Sep 28, 2023 · 63 comments
Closed
Assignees
Labels
💰 Awarded This project is now completed and awarded 🎯 Bounty This bounty is currently open 📁 TFHE-rs library targeted: TFHE-rs
Milestone

Comments

@zaccherinij
Copy link
Collaborator

zaccherinij commented Sep 28, 2023

Winners

🥇 1st place: A submission by JoseSK999
🥈 2nd place A submission by Tomtau
🥉 3rd place : A submission by M-Bln


Overview

This TFHE-rs bounty looks to provide a set of APIs working over encrypted string reproducing APIs from the rust std library str type (see documentation at https://doc.rust-lang.org/std/primitive.str.html)

This document will specify the expected structure of the example that you will produce and specify various constraints on the types that will be introduced and the APIs that are expected on the primitives.

How to participate?

1️⃣ Register here.
2️⃣ When ready, submit your code here.
🗓️ Submission deadline: December 17, 2023.

Description

String encoding

The input string encoding is expected to be ASCII.

To avoid leaking the length of the input string we may want to encrypt zeros after the string actual content, to easily mark the end of the string we will make the FheString null terminated, i.e., it must always end by the value 0u8 encrypted, like C Strings are null terminated.

Note
[ADDED 23.10.2023]
A null character "\0" may mark the early end of a string but is not required (as was first asked in the bounty) as it can make some algorithms more complicated and slower.
The option to have encrypted strings with a padding made with 0s is still required (to leave the option to the user to obfuscate the string's length), differentiating between the two kind of strings (padded or not) can and should be used to chose better algorithms when possible.

Functions to implement

Your submission should at least implement, the following method for an encrypted string:

  • contains with clear / encrypted pattern
  • ends_with with clear pattern / encrypted pattern
  • eq_ignore_case
  • find with clear pattern / encrypted pattern
  • is_empty
  • len
  • repeat with clear / encrypted number of repetitions
  • replace with clear pattern / encrypted pattern
  • replacen with clear pattern / encrypted pattern
  • rfind with clear pattern / encrypted pattern
  • rsplit with clear pattern / encrypted pattern
  • rsplit_once with clear pattern / encrypted pattern
  • rsplitn with clear pattern / encrypted pattern
  • rsplit_terminator with clear pattern / encrypted pattern
  • split with clear pattern / encrypted pattern
  • split_ascii_whitespace
  • split_inclusive with clear pattern / encrypted pattern
  • split_terminator with clear pattern / encrypted pattern
  • splitn with clear pattern / encrypted pattern
  • starts_with with clear pattern / encrypted pattern
  • strip_prefix with clear pattern / encrypted pattern
  • strip_suffix with clear pattern / encrypted pattern
  • to_lowercase
  • to_uppercase
  • trim
  • trim_end
  • trim_start
  • + (concatenation)
  • Comparisons between strings >=, <=, !=, ==

API to use for the bounty

This bounty should make use of the integer API from TFHE-rs, more precisely we expect you to use RadixCiphertexts with the default parallelized functions available in the crate.

Structure of the example directory

You will put your code in the directory tfhe/src/examples/fhe_strings

This directory will contain the following:

main.rs:

The code for a command line executable that will take an input string to be encrypted and a pattern for string functions which require it. This executable will encrypt the first string and run all the available APIs on the input string using the pattern when necessary. The results will be compared to the clear APIs provided by rust’s std::str, the ouptuts will be nicely formatted for the user to see that the results match (if there are errors then the bounty would not be considered valid) with timing information for the FHE version.

Use clap (the version pinned in TFHE-rs) to build the command line.

ciphertext.rs:

This module will contain:

  • FheAsciiChar: a wrapper type that will hold a RadixCiphertext from integer which must be able to store at least 8 bits of data to be able to fit a single ASCII char;
  • FheString: a wrapper type around a Vec<FheAsciiChar>, the last FheAsciiChar of the string always encrypts a 0u8, it is possible to have 0u8 earlier than the last char, this would allow the user to hide the actual length of the string that is encrypted, accessors should be made to be able to iterate easily on the inner vec both mutably and immutably, do not provide a from_blocks primitives as it would be easy to misuse, a new function should be enough to construct the type at encryption time with a client key, see for example tfhe/src/integer/ciphertext/mod.rs to see how Integer RadixCiphertext are built to give access to their content for use by algorithms.

client_key.rs:

Provide a ClientKey type that can be built from shortint parameters (that are also used by the integer API) or from an IntegerClientKey directly, realistically this ClientKey type will wrap the IntegerClientKey and provide primitives to encrypt a rust str (validating before encryption that the str is a valid ASCII str), decrypt it in a fresh String. There should be an encryption primitive that allows the user to specify how many “padding” encryption of zeros should be appended after the 0-terminated string that will be encrypted from the provided input string.

ClientKey must derive serde::Serialize, serde::Deserialize and Clone.

directory server_key:

mod.rs:
Provide a ServerKey that wraps an Integer ServerKey that can be built from a ClientKey from client_key.rs or directly from an Integer ServerKey.

ServerKey must derive serde::Serialize, serde::Deserialize and Clone.

Create files for families of functions that share similar algorithmic needs, e.g., IF IT MAKES SENSE (here we are speculating, this is just an example supposing similar functions will require similar algorithms)

split.rs

Will contain an impl Block for the above ServerKey dedicated to split functions like

  • rsplit
  • rsplit_terminator
  • split
  • split_inclusive
  • split_terminator

and all relevant helper functions. If some functions are needed everywhere then those functions can be put in the mod.rs file from the directory.

trim.rs

Will contain an impl Block for the above ServerKey dedicated to trim functions like

  • trim
  • trim_end
  • trim_start

Note that the above functions will not literally trim blocks off of the provided FheAsciiString but rather will zero out the correct blocks to make the null terminated string the trimmed version as appropriate.

Also, it could make sense to separate functions depending on the type of the pattern (i.e., encrypted or clear) in separate files.

Other requirements

Each function should have a docstring describing what it does (those can be adapted from the rust std docs) with a doctest demonstrating the usage on simple hard coded cases.
We expect standard algorithms if it makes sense (with links to the papers describing them or publicly available content on the web like a Wikipedia page for example), if some tricks are used proper comments are expected to be provided in the code.

Note

As submissions are evaluated on a 64 cores machine, your code should heavily use parallelization

A Makefile target (see Makefile for the template of our targets) for running the tests of the example is expected. A command to run the binary easily from the terminal is also requested.
The code must compile on the latest stable rust. No #[allow(clippy::...)] are authorized unless absolutely necessary but that should never be required.
The code must pass the make pcc (pre commit check) available from our Makefile, if you get errors then the code needs to be fixed to satisfy the pcc checks.
Good luck!

Reward

🥇Best submission: up to €10,000.

To be considered best submission, a solution must be efficient, effective and demonstrate a deep understanding of the core problem. Alongside the technical correctness, it should also be submitted with a clean code, clear explanations and a complete documentation.

🥈Second-best submission: up to €3,500.

For a solution to be considered the second best submission, it should be both efficient and effective. The code should be neat and readable, while its documentation might not be as exhaustive as the best submission, it should cover the key aspects of the solution.

🥉Third-best submission: up to €1,500.

The third best submission is one that presents a solution that effectively tackles the challenge at hand, even if it may have certain areas of improvement in terms of efficiency or depth of understanding. Documentation should be present, covering the essential components of the solution.

Reward amounts are decided based on code quality and speed performance on a m6i.metal AWS server.

Related links and references

Submission

Apply directly to this bounty by opening an application here.

Questions?

Do you have a specific question about this bounty? Join the live conversation on the FHE.org discord server here. You can also send us an email at: bounty@zama.ai

@zaccherinij zaccherinij added 🎯 Bounty This bounty is currently open 📁 TFHE-rs library targeted: TFHE-rs labels Sep 28, 2023
@aquint-zama aquint-zama added this to the Season 4 milestone Sep 28, 2023
@zaccherinij zaccherinij changed the title Create a String Library That Works on Encrypted Data Using TFHE-rs Create a string library that works on encrypted data using TFHE-rs Oct 2, 2023
@leyuskckiran1510
Copy link

Could you provide me the time frame,I will be working on it after two weeks; will that be okay?

@aquint-zama
Copy link
Collaborator

Could you provide me the time frame,I will be working on it after two weeks; will that be okay?

Deadline for submission is 17th December (you could see it with the associated milestone

@tomtau
Copy link

tomtau commented Oct 5, 2023

I have a few questions:

  1. patterns: I assume it should compile on the stable Rust, right? So in that case, it's not possible to use the pattern feature Tracking issue for string patterns rust-lang/rust#27721 so we'd just need to make a custom type or trait for it -- I also assume the patterns could only be ASCII (like strings in that bounty), is that correct? And should empty patterns be allowed?
  2. All those functions should be on the ServerKey, or should they also be on FheString? And they will all need ServerKey anyway, so it's not possible to implement e.g. std::ops traits (like for +) directly on FheString... hence those functions would be just standalone methods on ServerKey, is that correct or should those traits be somehow implemented (e.g. implement them for (ServerKey, FheString) or something like that)?
  3. split: similarly to patterns, I guess it's ok to create custom types for those different Split return types / iterators (RSplit, RSplitN, SplitAsciiWhitespace, SplitInclusive...), is that correct? For splitn and rsplitn, only the pattern can either be clear or encrypted, n is always clear, right?
  4. outputs: (I'm new to tfhe and) I assume outputs of all those functions would be encrypted, right? Should they use some special types from tfhe, or ok just to use a type alias or a wrapper around RadixCiphertext? And for methods that "branch" (i.e. I assume one cannot know with a ServerKey if a pattern was matched or not in a ciphertext, so needs to compute both options), should it return a vector of all computed ciphertexts (while ClientKey would have a method that can process it and pick the correct output)? I saw that in the regex tutorial: https://docs.zama.ai/tfhe-rs/application-tutorials/regex#matching. so I was wondering whether that's what's expected... and if not, do you have an example of a way how to handle it, e.g. somehow combine boolean and integer circuits for this kind of branching output aggregation?
  5. trim:

Note that the above functions will not literally trim blocks off of the provided FheAsciiString but rather will zero out the correct blocks to make the null terminated string the trimmed version as appropriate.

this makes sense for me with trim_end, but I'm not sure about trim or trim_start... is it ok to just treat 0-byte as empty anywhere in the string (rather than just at its end / null-termination), so ClientKey's decryption would just filter out these 0s to get a string? Or should it still respect C-style and what to do in that case? Should it use a special value to mark empty places (e.g. 256 outside of u8's range) before 0-termination?

@tomtau
Copy link

tomtau commented Oct 6, 2023

@aquint-zama one follow-up question depending on the answer to 4. about function outputs:

  1. repeat: how should the encrypted number of repetitions be handled? Again, based on my newbie understanding, one cannot know with a ServerKey whether a given string should be repeated 1-time or 1000000000-times... is it in that situation ok to return a "lazy" repeated string iterator (which will most likely return two encrypted values: a flag whether the number of repetitions was reached and the repeated encrypted string constructed up to that point)? ClientKey can then get the final string by iterating and checking the decrypted flags?

@IceTDrinker
Copy link
Member

hello @tomtau

  1. Yes this should compile on stable rust, ASCII patterns indeed, empty patterns should be handled the same way the std rust library handles them I guess ?
  2. On the ServerKey is perfectly fine, managing traits are not in the scope of this bounty
  3. n could be encrypted but with obvious limitations to small integer types (probably u8) as oblivious execution would mean a worst case scenario with respect to n which becomes impossible for large integers, if it proves impossible to implement or has very bad runtime this would be taken into account. Custom return types are fine as long as they are easy to manipulate and make sense (the std rust library has been built the way it is for good reasons), but in the FHE world some of those may not be expressible as you don't know where some of the sub strings end, this is part of why this is a bounty, some unknowns around the algorithms/data handling
  4. For the branching you have the if_then_else function for integer ciphertexts, we expect the return types to be FheStrings where it makes sense, and RadixCiphertexts for boolean values which will in the end (if the algorithm is correct) just encrypt a 0 or a 1
  5. This is a good point, so trim_end can fill with 0s while trim_start may probably need to shuffle blocks back to the start of the string to be valid, there may be something smarter to do with special characters but that may not be encodable if rust uses extended ascii (8 bits) vs standard ascii (7 bits)
  6. For encrypted n several ways, the one you propose seems interesting, to note that depending on the type of n (u8 is doable, u16 less so already, u32 is probably impossible on commodity hardware and would be way too slow) it could be doable fully

@tomtau
Copy link

tomtau commented Oct 8, 2023

Thanks @IceTDrinker . And for "replace" / "replacen" functions, should the "to" argument be FheString or a plain string/slice?

@IceTDrinker
Copy link
Member

I believe both are doable, with a complication/performance degradation if "to" is also encrypted

@aquint-zama aquint-zama pinned this issue Oct 9, 2023
@Lcressot
Copy link

@aquint-zama Some
You wrote about the main : "This executable will encrypt the first string and run all the available APIs on the input string using the pattern when necessary"
Should the pattern be encrypted or not, or both ? This would change the speed of the computation so it is important every one does the same for fairness.
Thanks

@IceTDrinker
Copy link
Member

Hello @Lcressot
submissions providing both clear and encrypted patterns would be rated better than a solution with only clear patterns, if some cases look impossible (see encrypted n for repeatn for large integer types which does not look like it can realistically work beyond u8 (and even then will be most likely very slow)) then obviously we would take that into account and in that case the relative speed/technical solutions would be evaluated. If you manage to do something clever where other submissions just ignore the difficulty by marking it as « won’t implement » then the submission proposing a solution will likely be rated better

@Lcressot
Copy link

@IceTDrinker Ok great thanks, so we need to measure and display the computation time for each operation and not just globally

@IceTDrinker
Copy link
Member

Absolutely @Lcressot

@Lcressot
Copy link

Hi, should the encrypted pattern also have a padding ? I think using padding to the pattern would lead to a huge additional complexity to some algorithms. Maybe we can implement both padding and no padding functions.

@IceTDrinker
Copy link
Member

IceTDrinker commented Oct 17, 2023

This is a good question, in the general case the pattern length can leak some amount of information about what users would be searching for.

With that said the performance could be abysmal indeed, potential ideas, we will be discussing on our end at Zama but if you have some inputs on that please feel free to share :

  • also have the length of the string as an encrypted value in the FheString struct (we can make it a 32 bits integer), not sure if that helps or not
  • have an enum in the FheString, to know if the encrypted string is padded with 0s at the end or not allowing to match on both the encrypted string and the pattern to use better algorithms if they exist in the various implementations (less error prone for the users as this enum would be set during encryption)
  • have different functions to manage all the cases, I'm not a fan of this option as it would clutter the API with a lot of function variations in addition to being error prone for the users
  • something else ?

@Lcressot
Copy link

Lcressot commented Oct 17, 2023 via email

@IceTDrinker
Copy link
Member

So after discussion @Lcressot and @tomtau it is acceptable to have a flag indicating if the string is padded with zeros at the end or if it has a single 0, meaning the string length can be deduced by the vector it stores.

Having better algorithms for when there is no padding is accepted, submissions which cover the most difficult cases (e.g. padded input and padded pattern) would likely be rated better as the other algorithms should be more straightforward

@Lcressot
Copy link

Lcressot commented Oct 18, 2023

Instead of duplicating all functions for both types FheString and String (encrypted or clear pattern), can we make template function that take FheString where T can be CipherText or char ? This will require in the main.rs to build both encrypted and clear FheString versions of the pattern string.
Or do you guys absolutly want duplicate functions that can take directly a String as argument for the pattern ? I am asking because doing templates would greatly simplify the code.

@IceTDrinker
Copy link
Member

having generic functions would be a nice touch I would say, they can then dispatch to the proper algorithm

@matthiasgeihs
Copy link

matthiasgeihs commented Nov 4, 2023

[ADDED 23.10.2023]
A null character "\0" may mark the early end of a string but is not required (as was first asked in the bounty) as it can make some algorithms more complicated and slower.

Regarding this note: Is it OK if the FheString has a cleartext flag isZeroPadded (as suggested in #80 (comment)) to signal whether it is padded with zeros or not? Without a cleartext flag, I don't think we can hope for much algorithmic improvement, as we can't branch on encrypted values. @IceTDrinker

@IceTDrinker
Copy link
Member

@matthiasgeihs a flag indicating the padding status is fine yes, or an enum works as well

@JoseSK999
Copy link

JoseSK999 commented Nov 8, 2023

strip_prefix and strip_suffix return the string without the matched pattern wrapped in Some, but if there’s no match they return None. Should we just return the string without prefix/suffix (if there’s match) and the original string (if there isn’t)? If we returned special characters representing None we would probably need to decrypt the output to handle a potential None before continuing with subsequent FHE operations.

@LokartBC
Copy link

LokartBC commented Nov 8, 2023

In my opinion you can return the result along with a boolean telling wether the prefix pattern was found or not

@IceTDrinker
Copy link
Member

Hello @MakisChristou so for the version we would recommend rebasing on top of the main commit zama-ai/tfhe-rs@ad41fdf (or later) the reason being that main has had improvements in low level performance and some other algorithms which should improve the runtime of your solution

Cheers

@MakisChristou
Copy link

So should we use the BooleanBlock instead of BaseRadixCiphertext<Ciphertext>? This is what I was currently using as an inner to the FheAsciiChar implementation.

@IceTDrinker
Copy link
Member

So should we use the BooleanBlock instead of BaseRadixCiphertext<Ciphertext>? This is what I was currently using as an inner to the FheAsciiChar implementation.

@MakisChristou No BooleanBlock is just a type we introduced that you can easily convert to Radix, the new type is to convey that some values encode a boolean value (i.e. a 0 or a 1) and should make some API usage less error prone, I think some functions may have had optimizations linked to BooleanBlock but don't take my word for it.

@IceTDrinker
Copy link
Member

@MakisChristou Though for some functions if it makes sense to return a Boolean then feel free to return a BooleanBlock !

@MakisChristou
Copy link

I see, is the convertion one way or 2 way? Like can I convert a BaseRadixCiphertext<Ciphertext> to a BooleanBlock? My if_then_else implementation requires it for example unless I will make changes to the API.

@IceTDrinker
Copy link
Member

IceTDrinker commented Dec 6, 2023

So you should be able to convert the boolean block to a radix with into_radix, you can use convert on a RadixCiphertext to a BooleanBlock, this will do the C-like conversion where

bool my_bool = some_int != 0; 

for further support please use the FHE.org Discord channel for TFHE-rs :)

@MakisChristou
Copy link

So this has been somewhat discussed before but wanted to make sure I am getting the requirements right. For example in many functions like repeat or trim_start we have 2 options after applying the algorithm in the padded case. Either shuffle zeroes at the end of the string or carefully place some special character such that the client can trivially get the correct result at decryption time. Just wanted to make sure that option 2 is considered valid since it's a lot faster than the former.

@MakisChristou
Copy link

For example we can use a non ascii byte as a placeholder for bytes to be removed by the client. The resulting string will be the valid unpadded result.

@IceTDrinker
Copy link
Member

well the special character of option 2 is a \0 right ?

@MakisChristou
Copy link

well the special character of option 2 is a \0 right ?

Not nessesarily. Lets say we run repeat on “hello/0” 2 times. Instead of for example having “hello/0hello/0” and applying shuffling to get “hellohello/0/0” we can do “helloxhellox” where x can be 255u8 or something out of the ASCII range. Then the client can trivially remove the 255u8s getting “hellohello”. In my code /0 marks the end of the string whereas this special char can mark “remove this byte on the client side”.

@IceTDrinker
Copy link
Member

can you call another of the API on the string with the special character and have a correct result ? 🤔 feels like a \0 with extra steps as other algorithms will need to handle that anyways ? Maybe I'm wrong there

@IceTDrinker
Copy link
Member

And so yeah I was just checking the ASCII managed by rust is the non extended one, so all characters with code < 128, so you have 128 "free" values, though, one could argue you could use a non printable value from the non extended ASCII so that you could easily support the extended ASCII (or a non printable value from the extended ASCII).

The problem of managing that special character in other algorithm still remains I believe as you could chain some methods on an FheString and each algorithm should be able to work properly I would say and it has to account for that special character that could be there as other algorithms could have put it there

@MakisChristou
Copy link

I see. Indeed if chaining has to be supported then option 2 is a no go unless I refactor all algorithms to handle the special character

@IceTDrinker
Copy link
Member

Sorry if this was not clear, but I think you can see how, the same way you could chain algorithms in rust, you should be able to chain algorithms in FHE

@Lcressot
Copy link

In rust "".split("") gives a different result than "/0/0/0".split("/0").
In FHE, should we consider server_key.split( &FheString("/0/0/0"), &FheString("/0")) to act like rust in the first or second case ? In other words, shoud a sequence of pure padding be considered a sequence or empty characters or an empty string ?
The issue in the first case (sequence of empty characters) is that something like "abcd\0".ends_with("cd") is false in rust and should be true in FHE.
In the second case (empty string), we have different behaviors with rust for functions like split, replace, when some inputs are pure padding.

@IceTDrinker
Copy link
Member

@Lcressot \0 is not a printable character and given that it is used as a terminator/padding you can consider that a string containing \0\0\0 is in effect the empty string ""

@IceTDrinker
Copy link
Member

which is not that crazy of an assumption as a lot of the world runs on C/C++ which means that \0 ends a string

@Lcressot
Copy link

As for the submission do we just make our repo public after we add the link? Also I have cloned the state of tfhe-rs as it was 2 or so months ago and added my tfhe_strings folder. Is that the expectation?

Forking is indeed the way to go, you could keep the repository private until the deadline (17th of December) but still submit it through the submission form. Then, on the 18th of December you turn it public.

We cannot set a fork to private :p

@aquint-zama
Copy link
Collaborator

We cannot set a fork to private :p

I know 😞 , you could just init a private repo with the tfhe-rs referenced commit here and add your own commits.

An even better structure for reviewing would be to keep main branch for tfhe-rs code and add your own code in an other branch so that you could init a PR in your repo with your submission.

@MakisChristou
Copy link

One final question, besides Docstrings on individual functions is there any preferred way to document our solution. I was thinking of a Readme or is that not necessary?

@aquint-zama
Copy link
Collaborator

One final question, besides Docstrings on individual functions is there any preferred way to document our solution. I was thinking of a Readme or is that not necessary?

Readme would be nice (sorry for late answer)

@zaccherinij zaccherinij added 💰 Awarded This project is now completed and awarded and removed 🎯 Bounty This bounty is currently open labels Feb 9, 2024
@zaccherinij zaccherinij unpinned this issue Feb 9, 2024
@matthiasgeihs
Copy link

Will the winners be announced?

@zaccherinij
Copy link
Collaborator Author

@matthiasgeihs Yes we will share more information about winning solutions of S4 as soon as
next week! Cheers

@zaccherinij zaccherinij added the 🎯 Bounty This bounty is currently open label Feb 12, 2024
@aquint-zama
Copy link
Collaborator

Winners

🥇 1st place: A submission by JoseSK999
🥈 2nd place A submission by Tomtau
🥉 3rd place : A submission by M-Bln

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💰 Awarded This project is now completed and awarded 🎯 Bounty This bounty is currently open 📁 TFHE-rs library targeted: TFHE-rs
Projects
Status: Awarded Contributions
Development

No branches or pull requests

10 participants