Storage of Vectorisation and Effective Querying
A while back I got to play with vectorisation of terms using a transformer model & Facebook Fais and I think there is a much better way.
The idea is each document (block of unstructured text) within a corpus (collection of documents), is tokenised (group of terms). For example, if we had the following document:
The cat in the hat sat on the mat and drank milk from a jug. The child stared in alarm at the cat in the hat as that was his milk!
In Natural Language Processing the first step is to remove stop words. Stop words are commonly used words within a language. They are typically used to join adjective, nouns, etc.. and so quickly dominate statistical analysis. For example "the" is a stop word. So removing stop
words from our example gives the following:
Cat hat sat mat drank milk from jug. Child stared alarm cat hat his milkNow we want to convert this into a series of tokens, the token size is dependent on your document (you don't want it to be larger than a sentence). I will pick 4 terms per token.
- cat hat sat mat
- hat sat mat drank
- sat mat drank milk etc..
The next step is to supply them into a transformer model. A transformer model (e.g. sentence-bert) is trained on tokens and provides them a location in vector space. The location is known as a 'vector'.
Vector space is basically an N Dimensional Array (e.g. 2048 dimensional array). We can think of vectors as array positions. The transformer model is trained to place tokens "near" each other based on similarity. So if we took these tokens:
- cat hat sat mat
- pet hat stood carpet
This is where Facebook Fais comes in, it holds all tokens/vectors generated by a corpus and builds an index. So if we have a corpus that references pets and we generate the search token 'cat hat sat mat' Facebook Fais recognises the token "pet hat stood carpet" is similar and returns it.
The problem is Facebook Fais datastores can't be updated, you have to mount tokens within a docker image. Facebook Fais is started and begins training a model on the supplied data. Model training requires 8GiB-32GiB of storage and manually iterating over files is computationally and storage input/output intensive.
There has to be a better way
Databases like PostgreSQL and MariaDB easily scale into millions of records and tend to perform quite well with numeric querying.
If we stored the token as a Unicode text string and each array position as a field within a table, we would be able to dynamically update the data and our memory requirements would be drastically reduced. e.g.
| Token | VectorPos1 | VectorPos2 | ... | VectorPos2048 |
| cat hat sat mat | 223 | 445 | 123 | 543 |
| pet hat stood carpet | 224 | 445 | 156 | 499 |
SELECT * FROM tokens WHERE VectorPos1 BETWEEN 221 AND 224
This should return a subset of data within the corpus, we can then calculate the distance between our token vector and each result and reorder them based on closet to furtherest.
This lets us get around the need to initialise Facebook Fais, it means we don't need to create a new Facebook Fais instance as new data is supplied and it means we loose the machine learning model RAM requirements.
But this is Big Data!
In data engineering we typically look to break the data down into smaller subsets, ideally ways which aren't computationally expensive to do. If we keep with our Facebook example. A user represents a good 'corpus'.
Facebook has a limit of 25 posts per day and a post size limit of 33,000 characters and Facebook was founded in 2004. This means the worst case scenario is a user has posted 173, 375 times, resulting in 5,721,375,000 characters of text. The average English word length is 4.7 (rounded to 6 for spaces) which is 953, 562, 500 words.
The average number of words in a sentence is 15, so we set our tokens to half (7 words per token). This results in 136, 223, 215 tokens.
If we assume a vector space of 2048, the Postgres 'smallint' type can hold each position in 2 bytes and there are 2048 smallints for each token. This results in 557, 970, 288, 640 bytes of data needed to hold each vector or 519 GiB.
This sounds a lot, but Postgres is designed to process up to 32 TiB of data and this is our worst possible case.
So if we leverage cloud technologies (e.g. Database as a service like RDS, or Kubernetes), we can dynamically create a database for each user and each database should be fairly responsive when querying.
The nice part here is we have made it easy to concurrently search across users, we can have separate queries running for each user the search user is 'friends' with.
We can do more..
We can then shunt all other records into a batch querying platform like Hadoop, Vectors being a sequence of numbers should allow Accumulator MapReduce jobs which are fairly efficient. Although once we move to a clustered design it probably makes sense to run large clusters holding hundreds/thousands of users.
Thoughts?
Comments
Post a Comment