Similarity Measure Tool, Text similarity has to determine how ‘close’ text/keywords of two document .To calculate similarity of two document similarity measure functions are used.
Similarity measure is the function which assigns a real number between 0 and 1 to the documents. A zero value means that the documents are dissimilar completely whereas one indicates that the documents are identical practically.
- Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. This is calculated as:
- With cosine similarity, we need to convert sentences into vectors. One way to do that is to use bag of words with either TF (term frequency) or TF-IDF (term frequency- inverse document frequency). The choice of TF or TF-IDFdepends on application and is immaterial to how cosine similarity is actually performed — which just needs vectors. TF is good for text similarity in general, but TF-IDF is good for search query relevance.
Steps to calculate cosine Similarity
Step 1, we will calculate Term Frequency using Bag of Words
Step 2, The main issue with term frequency counts is that it favors the documents or sentences that are longer. One way to solve this issue is to normalize the term frequencies with the respective magnitudes. Summing up squares of each frequency and taking a square root.
Step 3, as we have already normalized the two vectors to have a length of 1, we can calculate the cosine similarity with a dot product.
Flow of cosine Similarity Measure