Storage of Vectorisation and Effective Querying

A while back I got to play with vectorisation of terms using a transformer model & Facebook Fais and I think there is a much better way.

The idea is each document (block of unstructured text) within a corpus (collection of documents), is tokenised (group of terms). For example, if we had the following document: 

The cat in the hat sat on the mat and drank milk from a jug. The child stared in alarm at the cat in the hat as that was his milk!

In Natural Language Processing the first step is to remove stop words. Stop words are commonly used words within a language. They are typically used to join adjective, nouns, etc.. and so quickly dominate statistical analysis. For example "the" is a stop word. So removing stop words from our example gives the following:

Cat hat sat mat drank milk from jug. Child stared alarm cat hat his milk
Now we want to convert this into a series of tokens, the token size is dependent on your document (you don't want it to be larger than a sentence). I will pick 4 terms per token.
  • cat hat sat mat
  • hat sat mat drank
  • sat mat drank milk
  • etc..

The next step is to supply them into a transformer model. A transformer model (e.g. sentence-bert) is trained on tokens and provides them a location in vector space. The location is known as a 'vector'.

Vector space is basically an N Dimensional Array (e.g. 2048 dimensional array).  We can think of vectors as array positions. The transformer model is trained to place tokens "near" each other based on similarity. So if we took these tokens:

  • cat hat sat mat
  • pet hat stood carpet
A "cat" is a "pet" and a "mat" is related to "carpet", the transformer model should be trained to place both tokens near each other within vector space.

This is where Facebook Fais comes in, it holds all tokens/vectors generated by a corpus and builds an index. So if we have a corpus that references pets and we generate the search token 'cat hat sat mat' Facebook Fais recognises the token "pet hat stood carpet" is similar and returns it.

The problem is Facebook Fais datastores can't be updated, you have to mount tokens within a docker image. Facebook Fais is started and begins training a model on the supplied data. Model training requires 8GiB-32GiB of storage and manually iterating over files is computationally and storage input/output intensive. 

There has to be a better way

Databases like PostgreSQL and MariaDB easily scale into millions of records and tend to perform quite well with numeric querying.

If we stored the token as a Unicode text string and each array position as a field within a table, we would be able to dynamically update the data and our memory requirements would be drastically reduced. e.g.

Token VectorPos1 VectorPos2 ... VectorPos2048
cat hat sat mat 223 445 123 543
pet hat stood carpet 224 445 156 499
 
SQL has the concept of 'between' so once we have 'vectorised' a token we can construct a query by supplying a range around each array position. e.g.
 
SELECT * FROM tokens WHERE VectorPos1 BETWEEN 221 AND 224

This should return a subset of data within the corpus, we can then calculate the distance between our token vector and each result and reorder them based on closet to furtherest. 

This lets us get around the need to initialise Facebook Fais, it means we don't need to create a new Facebook Fais instance as new data is supplied and it means we loose the machine learning model RAM requirements.

But this is Big Data!

In data engineering we typically look to break the data down into smaller subsets, ideally ways which aren't computationally expensive to do. If we keep with our Facebook example. A user represents a good 'corpus'.

Facebook has a limit of 25 posts per day and a post size limit of 33,000 characters and Facebook was founded in 2004. This means the worst case scenario is a user has posted 173, 375 times, resulting in 5,721,375,000 characters of text. The average English word length is 4.7 (rounded to 6 for spaces) which is 953, 562, 500 words. 

The average number of words in a  sentence is 15, so we set our tokens to half (7 words per token).  This results in 136, 223, 215 tokens.

If we assume a vector space of 2048, the Postgres 'smallint' type can hold each position in 2 bytes and there are 2048 smallints for each token. This results in 557, 970, 288, 640 bytes of data needed to hold each vector or 519 GiB. 

This sounds a lot, but Postgres is designed to process up to 32 TiB of data and this is our worst possible case.

So if we leverage cloud technologies (e.g. Database as a service like RDS, or Kubernetes), we can dynamically create a database for each user and each database should be fairly responsive when querying.

The nice part here is we have made it easy to concurrently search across users, we can have separate queries running for each user the search user is 'friends' with.

We can do more..

Our example is open to additional improvements. It is highly likely when a user searches they are interested in their latest posts only, searches further back in history can be slower.

If we decided to only store the last 2 years in a Postgres/MariaDB instance, we cut the number of tokens down to 14, 339, 286 or 54.7GiB. This is a fairly typical size for Postgres/MariaDB.
 
We can then shunt all  other records into a batch querying platform like Hadoop, Vectors being a sequence of numbers should allow Accumulator MapReduce jobs which are fairly efficient. Although once we move to a clustered design it probably makes sense to run large clusters holding hundreds/thousands of users.

Thoughts?

Comments

Popular posts from this blog

Why you should always check your requirements

Continuous Integration is an organisation problem

How tools have their own workflow