Clustering and similarity - retrieving documents of interest
The main questions are how do we measure similarity and how do we query over articles? If you think about it, we need to have some kind of scale (or a model if you like) to decide whether a specific document is close enough (similar) to our selected article.
Perhaps the simplest way to measure similarity between articles is count the similar words. We can simply create an object in which each word in the document will be the key, and the number of occurrences of that word in the document will be the value:
similarity-factor: { word1: 5 times, word2: 3 times, ...}
Then we can have an array of those objects for each document and use it as a factor for measuring similarity between two documents. For example, we take the following paragraph (from CNN news):
"Billy Bush ashamed of Donald Trump tape. Angry comments piled up on Billy Bush's Facebook page …" | ||
--CNN |
Then we organize it into a data structure like the following figure:
Looking...