For parts 3 and 4, we’ll develop a method called summarizeURL
:
def summarizeURL(url, total_pars):
url_text = getTextFromURL(url).replace(u"Â", u"").replace(u"â", u"")
fs = FrequencySummarizer()
final_summary = fs.summarize(url_text.replace("n"," "), total_pars)
return " ".join(final_summary)
The method calls getTextFromURL
above to retrieve the text and clean it from HTML characters and trailing new lines (n
).
Next, we execute the FrequencySummarizer algorithm on a given text. The algorithm tokenizes the input into sentences and then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low-frequency and high-frequency words. This way, it is able to discard the noisy words (such as determiners that are very common but don’t contain much information) or words that occur only a few times. To see the source code, head to GitHub.
Finally, we return a list of the highest-ranked sentences, which is our final summary.
The full source code is available on GitHub.