Storage of Vectorisation and Effective Querying
A while back I got to play with vectorisation of terms using a transformer model & Facebook Fais and I think there is a much better way. The idea is each document (block of unstructured text) within a corpus (collection of documents), is tokenised (group of terms). For example, if we had the following document: The cat in the hat sat on the mat and drank milk from a jug. The child stared in alarm at the cat in the hat as that was his milk! In Natural Language Processing the first step is to remove stop words. Stop words are commonly used words within a language. They are typically used to join adjective, nouns, etc.. and so quickly dominate statistical analysis. For example "the" is a stop word. So removing stop words from our example gives the following: Cat hat sat mat drank milk from jug. Child stared alarm cat hat his milk Now we want to convert this into a series of tokens, the token size is dependent on your document (you don't want it to be larger than ...