Visualizing Your Data from MongoDB to Word Cloud with Python 3
31 July, 2020
by Xiaonan
Introduction
Word clouds have become more and more popular nowadays as it is a straightforward and efficient way to communicate data. The main advantage is that it helps to simplify the process of understanding complex diagrams which usually need some prerequisite knowledge to assist with digestion.
One rule represents the word cloud’s simplicity — the text size, i.e. the bigger the text the more important it is. This obvious feature assures that people from a variety of backgrounds understand the content easily.
In this article, we introduce an example of converting data collected from a toddler's daily activities into a word cloud. The word cloud’s design helps to visualize the toddler’s activities into a way that parents and carers can easily see and then utilize to foster the way they educate the children.
Before diving into the technical details, let us check out the result first.
The above word clouds are based on teachers' comments for a kid throughout a fixed period of time. The left cloud is the original language used in comments and the right word cloud is its Chinese translation. The words in bigger font size than others highlight the kid’s focus and sentiment communication with others.
Technology Stack
- MongoDB : Store collected data
- Python 3 : Process data and generate word cloud images
- Libraries : spaCy, NumPy, WordCloud, googletrans
Connecting to MongoDB
MongoDB Atlas, a managed MongoDB cloud service, is chosen because the M0 Sandbox is free of charge and it saves time for deploying and maintaining a local database.
The connection details can be put into environment variables and read from Python as follows. Using environment variables is good practice in case if the application needs to be deployed to clouds as a service with Docker and Kubernetes.
MONGO_USERNAME = os.getenv('MONGO_USERNAME')
MONGO_PASSWORD = os.getenv('MONGO_PASSWORD')
MONGO_URL = os.getenv('MONGO_URL')
Then the MongoDB client needs to be initialized. All the MongoDB related code can be put into a separate Python file for organization and potential future reuse purposes.
from pymongo import MongoClient
client = MongoClient('mongodb+srv://{}:{}@{}/admin?retryWrites=true&w=majority'
.format(MONGO_USERNAME, MONGO_PASSWORD, MONGO_URL))
db = {}
In this example the db will be initialized from the command line so it is left empty here. A simple function for reading the collections is added as follows.
def find_document(collection, query):
documents = []
cursor = db[collection].find(query)
for document in cursor:
documents.append(document)
return documents
Retrieve the data from the main script
def load_data(collection, data_entry):
docs = MongoRepository.find_document(collection, {})
details = []
for doc in docs:
details.append(doc[data_entry])
return details
The collection
is the MongoDB collection name and the data_entry
is the key name of the values. Here is a piece of sample data. The data_entry
used here is the detail below.
{
"date": "2020-03-02",
"author": "Mathew Murray",
"title": "Big Sing",
"detail": "In our session with Mrs Clark of Big Sing,our children had lots of fun singing to their favourite songs and some new song,and performing a variety of dance style as the music required so."
}
Tokenizing Words
In the previous section, the paragraphs are loaded into an array. Streaming can be considered if the data size is huge.
The paragraphs need to be broken down into words. It sounds like a straightforward job however lots of effort is needed to ensure the accuracy of the final result. Processing natural language is a challenge to computers as they can never understand the words directly.
Natural Language Processing and Machine Learning are big topics. As this article focuses on the word cloud, only a few steps are considered to process these texts.
- Discard unwanted symbols
- Remove the organization and person names
- Normalize words (stemming)
- Filter stop-words
spaCy is a powerful tool for Natural Language Processing. We use it for tagging names and stemming which are just the tip of the iceberg.
The first step is to initialize spaCy and load data without unwanted symbols. The data is loaded to an array named content and the spaCy is initialized by en_core_web_sm.load()
import en_core_web_sm
nlp = en_core_web_sm.load()
symbols_to_replace = {'&': ' ', '#': ' ', '$': ' ', '£': ' ', '(': ' ', ')': ' ', '%': ' ', ':': ' ', '+': ' ',
'-': ' ', '*': ' ', '/': ' ', '<': ' ', '=': ' ', '>': ' ', '?': ' ', '@': ' ', '[': ' ',
']': ' ', "\\": ' ', '^': ' ', '_': ' ', '`': ' ', '{': ' ', '}': ' ', '|': ' ', '~': ' ',
'”': ' ', '\t': ' ', '\n': ' ', '\r': ' ', '\v': ' ', '\f': ' '}
symbols_translator = str.maketrans(symbols_to_replace)
content = content.translate(symbols_translator)
doc = nlp(content)
Secondly, we use spaCy to filter organization and person names. They are classified with label_equals
to PERSON
and ORG
. A complete list of the named entities can be found here.
# Retrieve all ORGs and PERSONs
org_and_persons = []
for ent in doc.ents:
if ent.label_ == 'PERSON' or ent.label_ == 'ORG':
org_and_persons.append(ent.text)
# Tokenize ORGs and PERSONs for filtering
all_org_persons = ''
for org_and_person in org_and_persons:
all_org_persons += org_and_person + ' '
org_persons_doc = nlp(all_org_persons)
org_persons_tokens = []
for token in org_persons_doc:
org_persons_tokens.append(token.text)
Then we do stemming, punctuation removal, and stop word removal in a loop. Extraction of the base form of a token can be done by getting the lemma_
attribute.
result = []
for token in doc:
if not token.is_punct and not token.is_stop:
if token.text not in org_persons_tokens:
if (len(token.text) == 1 and token.pos_ == 'NUM') or len(token.text) > 1:
result.append(token.lemma_.lower().strip())
Based on the study of the data we decide to also have a customized list of stop words. A file named stop-words.txt
is introduced for the script to load dynamically.
# Exclude stop words
stop_words = load_stop_words()
result = [k.lower().strip() for k in result if k not in stop_words and k is not '']
def load_stop_words():
stop_words_file = path.join(base_dir, "analyzer/resources/stop-words.txt")
with open(stop_words_file) as f:
stop_words_content = f.readlines()
stop_words = [x.strip() for x in stop_words_content]
logging.info(f'Loaded stop words: {stop_words}')
return stop_words
Here is a sample of the tokenized words.
['play', 'dough', 'disco', 'time', 'construction', 'activity', 'afternoon', 'engaged', 'stamp', 'pat', 'play', 'piano', 'play', 'dough', 'piece', 'spend', 'lot', 'morning', 'explore', 'painting', 'equipment', 'stamp', 'roll', 'create', 'different', 'print', 'friend', 'time', 'decorate', 'carefully', 'gingerbread', 'biscuit', 'colourful', 'flake', 'explore', 'conker', 'paint', 'roll', 'shake', 'box', 'show', 'able', 'independently', 'find', 'way', 'suit', 'conker', 'create', 'picture', 'watch', 'closely', 'see', 'different', 'mark', 'conker', 'make', 'move', 'keen', 'help', 'teacher', 'create', 'gloop', 'morning', 'spoon', 'mix', 'water', 'corn', 'flour']
Generating WordCloud Images
As the data is collected and processed, the next step is to get it visualized as word cloud images. The NumPy and WordCloud libraries make this step straightforward and simple.
import numpy as np
from PIL import Image
from wordcloud import WordCloud
def generate_wordcloud(tokens):
result = ','.join(tokens)
cloud_mask = np.array(Image.open(path.join(base_dir, "analyzer/resources/masks/mask_cloud.png")))
word_cloud = WordCloud(background_color="white", mask=cloud_mask, font_path='analyzer/msyh.ttf').generate(result)
word_cloud.to_file('word_cloud.png')
Similar to other online word cloud sites, a mask image can be chosen as the base image. Then quite a few parameters such as font_path
and background_color
can be tweaked for producing an image to suit your specific needs. Examples can be found on the official wordcloud library site here.
Translating to another language
As the data may be from various sources and the word cloud audience may be from different countries, it may be necessary to translate the words into another language for the best understanding.
googletrans is an unofficial Python library that implemented Google Translate. We initialize the library in our code and send a group of words at one time to Google Translate.
from googletrans import Translator
translator = Translator()
translation = translator.translate(content, src='en', dest=language_code)
It seems simple to do the translation however there are some concerns that need to be taken care of.
- The library is not from Google so the stability is not guaranteed.
- Maximum text size in a single submission is 15k
- Words may have different translations in various contexts
- Some translations need to be corrected
If stability is a concern, the official Google Translate API is recommended.
Based on our tokenized data, the limit of size can be simply done by segmenting the array for multiple submissions.
The most difficult issue is the 3rd point above. If the translation of the words needs to be based on context, other complex processing needs to be in place.
In this article, we take a simple approach to translate words without contexts. As the remote Google Translate API call is expensive in timing, we cannot submit one word at a time. The easiest and most straightforward solution is to find a good separator and add it to the list of words which makes the translation neutral to the contexts. In the below code snippet the tokens are submitted in a group of 500 with >
as separators.
def translate(tokens, language_code):
num_of_tokens = len(tokens)
result = []
i = 0
step = 500
while i < num_of_tokens:
end_index = i + step
if end_index > num_of_tokens:
end_index = num_of_tokens
content = '>'.join(tokens[i:end_index])
translation = translator.translate(content, src='en', dest=language_code)
logging.debug(f'Translated tokens {translation}')
translated_tokens = translation.text.split('>')
result = result + [k.strip() for k in translated_tokens]
i = end_index
return correct_translation(language_code, result)
The final step is to check your translations and correct those ones which have irrelevant meanings to the overall context or the translation is simply not good enough. A new translation mapping file translation-map-zh-cn.csv
is used for correction as follows. Words on the left side are to be replaced by the words on the right side.
股票,分享
铅,带头
匙,勺子
The code for reading the file and making the correction is as follows.
def correct_translation(language_code, tokens):
translation_map_file = path.join(base_dir, f'analyzer/resources/translation-map-{language_code}.csv')
translation_map = {}
if path.exists(translation_map_file):
with open(translation_map_file) as f:
translation_map_lines = f.readlines()
for line in translation_map_lines:
split_line = line.strip().split(',')
translation_map[split_line[0]] = split_line[1]
logging.info(f'Loaded {language_code} transaction map: {translation_map}')
for i, token in enumerate(tokens):
if translation_map.get(token) is not None:
tokens[i] = translation_map.get(token)
return tokens
In conclusion, 4 topics are covered in this article. Connecting to a remote MongoDB, tokenizing words, generating word cloud images, and finally translating the words to other languages. The goal is to produce a word cloud with the most meaningful words. Depending on the collected data and the purpose of the word cloud, an in-depth study of the raw data is crucial to draw your own plan of tokenizing words.
Here is the GitLab link to this showcase project. Please see the README
for how to generate a word cloud for your own data. PRs are welcome!