TLSH - Technical Overview

by Jonathan Oliver
12 / April / 2021.

Introduction

Intended Audience

In this page, we explain the technical details behind TLSH. We hope that this content is useful to researchers / academics and security experts who wish to understand the technical details for TLSH. In addition, we offer technical rationales for the design decisions in TLSH.

List of Papers/Presentations

This overview is based on content from the following conference presentations:

Title	Conference	Topic
TLSH--a locality sensitive hash	CTC 2013	Defines TLSH. Including (i) How to calculate TLSH for a file and (ii) How to calculate the distance between two TLSH digests.
Using randomization to attack similarity digests	ATIS 2014	Does experiments attacking TLSH, SSDEEP and SDHASH.
HAC-T and fast search for similarity in security	COINS 2020	Considers how you do Nearest Neighbour search for matches to a TLSH digest. Uses the Nearest Neighbour searech for a fast clustering algorithm (HAC-T).
Scalable Malware Clustering using Multi-Stage Tree Parallelization	ISI 2020	Scalable Clustering of TLSH digests.
Designing the Elements of a Fuzzy Hashing Scheme	Presented at EUC/Trustcom 2021	Explains the design of TLSH focussing on why TLSH uses the parameters it does.
Fast Clustering of High Dimensional Data Clustering the Malware Bazaar Dataset	Needs to be submitted	Detailed explanation of clustering using the TLSH distance score and the HAC-T algorithm. Analysis of the Malware Bazaar dataset.

TSLH Technical Overview

TLSH is a fuzzy hash. At first glance it would appear to be similar to other fuzzy hashes like SSDEEP and SDHASH. When selecting a fuzzy hash, some of the criteria that should be considered are:

1. Accuracy at identifying similar content,
2. Robustness to attack,
3. Time to calculate digest,
4. Time to perform a nearest neighbor search,
5. Time and suitability for large scale clustering, and
6. Size of the digest (fixed size is preferred).

According to these criteria, TLSH rates well:

1. TLSH has been evaluated a number of times for peer reviewed conferences, and it rates very well.
2. Robustness to adversarial attack is an area where TLSH excels in comparison to other fuzzy hashing schemes. We will discuss in further detail in the coming weeks.
3. The time to calculate the TLSH digest is very fast. You can see timing compirison at Version 3.11.0 of the Change History.
4. TLSH is very fast at nearest neighbor search at scale. This is due to TLSH being a distance metric (as per the mathematical definition) and hence has logarithmic search times. Note: other fuzzy hashes such as SDHASH and SSDEEP do not have this property We will discuss further below.
5. Large scale clustering is another area where TLSH excels. The logarithmic search can be used to perform hierarchical clustering in O(N log N) time.
6. TLSH has a fixed length 70 hex character digest with a "T1" prefix versioning string at the start for a total of 72 characters.

TLSH is different from other fuzzy hashes in the following ways:

Nearly all fuzzy hashes use a similarity score. TLSH uses a distance score. Using a distance score, and in particular obeying the triangle inequality (and hence being a distance metric) enables fast nearest neighbour search and scalable clustering.
Nearly all fuzzy hashes use byte sequences (ngram like). TLSH uses a k-skip ngrams (ngrams with holes in them). Using k-skip ngrams significantly reduces the attack surface of the hash.

Fast Nearest Neighbour Search and Scalable Clustering

The key to nearly all fast search methods is to use an index which will typically take the form of a binary tree. An example TLSH tree is shown in figure 1.

Figure 1. A TLSH Tree

Each node has a TLSH node and a distance threshold. The meaning of the root node from Figure 1 (with contents TLSH_1 and threshold T=225) is:

All TLSH values in the left subtree have a distance < 225 from TLSH_1
All TLSH values in the right subtree have a distance >= 225 from TLSH_1

If we are searching for the nearest neighbour for a search TLSH value (S), then we can traverse a TLSH tree and find a leaf using the Search algorithm in Figure 2 (from HAC-T and fast search for similarity in security).

Figure 2. Search Algorithm

As described in that paper, there are 2 approaches to using TLSH for fast search:

A backtrack policy can be used to guarantee that the closest item(s) in the tree are found; or
A forest of such trees can be built to arrive at an approximate nearest neighbour search. Experiments have found that using forests with of the order of 10 trees achieves good results.

Scalable Clustering

Consider clustering a data set D with N samples in it with a Cluster Distance CDist (CDist means that we will merge clusters which have items within distance CDist of each other).

Normally hierarchical clustering is an infeasible approach to use for clustering since it requires a pre-processing step of finding the closest neighbour for each item which is O(N²).
It is possible to use the fast search capabilities of a TLSH tree (or forest) to do fast hierarchical clustering. Using the TLSH tree, we can find the closest neighbours in O(log N). Thus to find the closest item for all N items is a O(N log N) operation which scales well to large datasets.
We have called the algorithm to do this HAC-T and it is shown in Figure 3 below.

Figure 3. The HAC-T Clustering Algorithm

Using trees and the HAC-T algorithm for other Similarity Digests

The preceeding sections lead us to ask the question, "Why is not a similar approach used for other fuzzy hashes such as SSDEEP and SDHASH?". The reason is that (for the vast majority of datasets) similarity measures will identify a small number of items as having a high similarity score and everything else has a similarity score of 0 (or the score which defines "not similar"). So if such similarity based methods are used to build a tree, it will be unbalanced as shown in Figure 4:

Figure 4. Similarity measures lead to unbalanced trees

Once the tree is unbalanced, the search reduces to being like a linear search through a linked list. This is not a scalable solution.

In addition, to really use the capabilities of such a tree for fast search, the criteria being used should be a distance metric to enable the use of Vantage Point Trees.

But isn't a Similarity Score just a Negative Distance Score?

In many papers where authors wished to compare and evaluate TLSH with similarity digests such as SSDEEP and SDHASH, authors would negate the TLSH distance in an attempt to make the score compatible using functions such as:
Similarity(TLSH_1, TLSH_2) = 100 - distance(TLSH_1, TLSH_2) and if the Simiarity was negative then treat it as zero.

We suggest that cutting off scores in this way should be avoided (or only used in a careful manner). This is because fast search and fast clustering are such valuable capabilities.

Robustness to Attack

During the design of TLSH, we made attacking the hash a part of the design process. We give more extensive details in 2 papers,

and will summarise it here.

We considered a range of file types:

text files,
executable files,
source code,
HTML files and
image files.

We considered modifications which were consistent with these files types. So for text files, we looked at changes such as changing the case of letters, changing how lines were split and substituting one word for another. For executable files, we considered changes such as changing the ordering of functions, swapping the arrangement of IF statements, and adding NO-OPs. For image files, we considered changes such as cropping the image, adding small dots and rotating the image. We implemented programs which would automatically modify content in these ways and/or collected data sets where adversaries had used such techniques.

At the high level, we optimised the various parameters of TLSH to be resistant to attack. We will now dig deeper to explain this.

Deep Dive into Robustness

Quick review of the design of TLSH

There are 2 key components for designing a fuzzy hash:

The algorithm to calculate the hash
The form and parameters scoring function that scores the distance between 2 hashes.

Algorithm to Calculate TLSH

The algorithm takes the typical form of a locality sensitive hash. It has 3 notable elements:

TLSH uses kskip-ngrams rather than ngram like features;
TLSH uses a 4 valued vector rather than a 2 valued vector; and
TLSH puts the feature counts into a vector with 128 buckets.

The choice of a 4 valued vector and 128 bucket counts was to optimise a ROC curve.

Figure 5. ROC Curve

Note: optimising the parameters using the area under a ROC curve means that the method will work with multiple thresholds enabling the fast search described above.

The Choice of Kskip-ngram

We chose to use kskip-ngrams in TLSH because it resulted in an overall design which was as accurate an ngram approaches on baseline data, but was far more resistant to attack when we applied the automated attack methods (as described in Designing the Elements of a Fuzzy Hashing Scheme). we also analysed the computational cost of using kskip-ngrams which grows very quickly as k and n grow. We determined that n=5 and k=2 resulted in an overall design which is both computationally efficient and highly resistant to attack. The resultant features consist of triplets of bytes out of a window of size 5 (features with holes in them) results in an overall design where the impact of minor changes to bytes is reflected in only minor changes in score.

The Scoring Function

The Form of the Scoring Function

We choose a scoring function which takes the form of a distance metric and not similarity measure so that we could achieve the fast search and scalable clustering as described above.

The Parameters of the Scoring Function

The scoring function has quite a few parameters which people have questioned why they take the value they do (such as the length scoring parameter of 12 and the various distance parmeters).
We choose the parameters of the scoring function (such as 12 for the score per length difference) based on optimising the ROC curve shown above in Figure 5 above. (further details in this paper).

Conclusion

TLSH was designed based on the criteria at the top of this article.

Most parameters were selected to optimise (1) accuracy;
Kskip-ngrams were selected to make the scheme (2) robust to attack;
We selected a distance metric and used an AUC-ROC optimisation procedure to give us (4) fast nearest neighbor search and (5) scalable clustering.

Appendix Summary of Papers

The CTC 2013 paper gives the algorithms for (i) calculating a TLSH hash, and (ii) calculating the distance between two TLSH hashes.
The ATIS 2014 paper looks at evading TLSH, SSDEEP and SDHASH. This paper looks at the effectiveness of these similarity digests at identifying files when the content of the file is deliberately changed. The paper looks at multiple files types including binary executables, image files, source code and HTML files. For the SSDEEP and SDHASH digest schemes, we were able to evade the scheme in a fairly straight forward way. In particular, we were able to construct very short SED scripts which would break these schemes for source code (1 line SED script) and HTML files (4 line SED script), while maintaining the orginal functionality of the file. TLSH proved a lot harder to break.
We sent a responsible disclosure to the authors of these schemes before the paper was published.