From raw data to embeddings in vector stores
Embeddings convert any form of data (text, images, or audio) into real numbers. Thus, a document is converted into a vector. These mathematical representations of documents allow us to calculate the distances between documents and retrieve similar data.
The raw data (books, articles, blogs, pictures, or songs) is first collected and cleaned to remove noise. The prepared data is then fed into a model such as OpenAI text-embedding-3-small
, which will embed the data. Activeloop Deep Lake, for example, which we will implement in this chapter, will break a text down into pre-defined chunks defined by a certain number of characters. The size of a chunk could be 1,000 characters, for instance. We can let the system optimize these chunks, as we will implement them in the Optimizing chunking section of the next chapter. These chunks of text make it easier to process large amounts of data and provide more detailed embeddings of a document, as...