Using ML for Processing Large Set of Feedback Data and Present Findings
A good analyst can do wonders for any business. She can work with several data points and provide insights. But, even a good data analyst has to have a great process, the right set of tools, and access to resources. So, in essence, good insights are a function of an exceptional, tested, and systematic process.
Intellica solved a major problem for an enterprise client by working on a comprehensive data analysis solution using systematic clustering to derive valuable insights, despite working on several data points. A good manager can get the right insights once or twice. But, when feedback data has to be categorized across several clusters & categories even before the insights are derived from it, the enterprise needs a structured and tested process. And that is where the team’s human capital came in handy!
Defining the Problem Statement
The challenge was to collect data from several points, aggregate it, normalize it, categorize it, and derive valuable insights from it. The quality of insights was the priority and the process had to be efficient & repeatable. Does that sound familiar?
Working on a large set of feedback data is not a new problem. Such solutions have already been deployed across various use-cases:
a. A restaurant that wants to test different price-points from customer reviews.
b. An e-commerce platform that wants to test the response for its new product range by analysing initial ratings on the site.
c. A media house trying to get the traction of a video it posted based on the social media engagement it received.
The fundamental nature of the problem is common across these use-cases — the source of data-points is varied and it might be inefficient or nearly impossible to process them manually. For this particular client, the problem was operating in the context of analysing employee engagement data.
Intellica worked with a team of product owners to solve this problem. The firm under consideration here had an employee engagement platform that collected responses from all the employees at the firm and put them across 80 categories and sub-categories.
Goal: To group responses for generating maximum unique clusters irrespective of the category each answer belonged to. The clusters help in providing context to the decision-makers and analysts.
The output must be accessible for decision-makers and the management team, so they get the insights without having to put valuable time into understanding graphs & charts.
Solution Design and Execution: Overview
In line with the problem statement and the objective in place, the Intellica team designed an unsupervised machine learning solution using Natural Language Processing. The entire solution was divided into three phases — pre-processing, feature engineering, and unsupervised or supervised learning.
The Intellica team architected a comprehensive process based on its expertise and experience accumulated from similar engagements. Here is what the overall process resembled:
As illustrated in the architectural flow diagram, the entire process was channeled into two clustering exercises with an end-goal to execute an unsupervised approach. The output was to be generated using a word-cloud that proportionately sizes each term according to its prevalence in employee responses.
The first clustering exercise ends with a post-processing step that outputs terms with a high frequency of mentions and passes it on to the next analysis layer. The second layer of the process focuses on feature encoding using the earlier process’ output as inputs. To identify the similarities, the cosine similarity technique was used.
Post this, a pivot table was engineered using kneed to bridge the similarity between each word and the entire cluster of words for making optimal clusters. As the clustering process approaches its end, we have a final list of frequently mentioned words along with their frequency across clusters. This list is used as the foundation for generating a word cloud for all the accessible and accurate clusters for analysis.
Getting into the Details
While the overview gives the broad-stroke idea of the entire solution, the business logic gets highlighted only when you tap into the details. Here is how the process unfolded in detail:
1. Pre-Processing Data: One of the easiest ways to get incorrect results from an ML model is by rushing into building the model without cleaning and processing the data. No datasets are perfect, not even the ones a team collects for its analysis. The data, in this case, was generated in the form of texts entered by humans. Since language can have idiosyncratic usage, the dataset required considerable processing:
a. Removing sentences not written in English.
b. Converting all the text to lower-case to avoid redundant counting of words because of case-difference. For the same reasons, the team converted the entire dataset to singular form as well.
c. Clearing out all external and internal URLs, punctuation marks, stop words, colloquial terms, and extra spaces.
2. Feature Extraction: This is the step where the human inputs are converted into a machine-interpretable format. This makes the machine learning algorithm more efficient and accurate with computations.
Popular extraction techniques tend to include Count Vector, tf — idf, Word2Vec, and many more. The central challenge with these techniques is that they are not apt for establishing the context within textual datasets. As a workaround, the Intellica team decided to proceed with Bidirectional Encoder Representations from Transformers (BERT), engineered by Google AI.
Google has given some interesting details about the BERT model here. The simplest way to understand how BERT operates is by understanding how NLP models tend to interpret data. Before BERT became a popular option, most NLP models interpreted text from either left to right or from right to left. BERT, on the other hand, reads the entire text at once. This way, unlike the popular techniques that provide a context to the word based on its neighboring words, BERT establishes each word’s context using all the nearby words in the paragraph/dataset.
The Intellica Team used BERT for feature engineering that would later become the input for the model.
3. Unsupervised Modeling: Unsupervised modeling is a great way to establish early patterns and structures within a dataset. The team considered two of the more popular unsupervised learning approaches for the process — Clustering and Topic Modeling.
Topic Modeling gives great insights into the density of the textual data. That said, it is quite inefficient from an operational standpoint as it requires a ton of manual inputs. Hence, clustering was the preferable option as it could generate efficient output. Generally, K-Means, Hierarchical, and Density-Based Clustering are commonly used processes.
4. K-Means Clustering: In the simplest terms, K-Means relies on a certain number of clusters required for further analysis and the average value for the available input. It is considered one of the most efficient unsupervised machine learning techniques that can optimize the analysis process without major manual inputs. It tends to cover three major steps:
a. The team initializes ‘K’ centroids. Alongside this, each data point from the dataset is allocated to the nearest centroid. With several data-points around the centroid, a cluster is formed.
b. Post this, each centroid’s value is re-calculated using the average of available data-points in that cluster.
Google identifies K-Means as one of the few efficient methods for implementation and can still be scaled across a large dataset.
5. Kneed Algorithm and PCA: The catch with K-Means clustering is that it is relatively easy to execute. However, several experts would vouch for the other side of the process — K-Means requires manual selection of ‘K’ values. If the pre-processing step is not performed accurately, the outliers may compromise model accuracy instead of being eliminated for not contributing to the clustering.
To solve these challenges, the Intellica team used Kneed Algorithm. It is considered one of the most accurate methods of generating the optimal number of clusters for a given dataset. The team executed it and found the results to be very accurate.
Another challenge with K-Means is the scalability of the process. While the process is generally considered quite scalable, it poses challenges as new dimensions are added. With the increase in dimensions, a distance-based similarity measure that converges at a constant value between points becomes necessary. This dimensionality challenge is controlled using Principal Component Analysis on the feature data.
As mentioned earlier, the BERT model provides a vector of 768 features for each unit of the input text. PCA was deployed to control the dimension of vectors, which would then be used for clustering, and as a result, the number of vectors was reduced by 50% to 384.
6. Cosine Similarity: The process this far is quite textbook, with some novelty in it attributable to the team’s experience. The data has been processed, and the clusters have been formed. But, at a deeper level, the team is still trying to understand the underlying context of the data. This is where Cosine Similarity can be of great help.
Cosine Similarity establishes the similarity between two words or sentences using their dot product. Just by performing this analysis, it can be used for comparing texts and for more sophisticated processes like sentiment analysis.
7. Output Data — Word Clouds: There are several methods to generate the output of a Machine Learning process. Graphs and charts can be used to showcase the frequency of mentions, and the implicit intelligence can be captured in analyst reports. However, making the results accessible to the decision-makers was also one of the key challenges for the Intellica team. How do you conduct a sophisticated process and then simplify the results without missing the nuance? The solution to these three challenges is Word Clouds.
Word Clouds provide a visual process of accessing the results of an ML or, for that matter, any analytical process. Each word’s size and boldness are dependent on its prominence in the dataset and its analyzed results. The smaller, thinner, and cornered words are often not that important, while the big, bold, and central words are significant. This way, the results that would have taken several pages of a deck to be communicated using summary statistics or charts are made available in one visual illustration accessible even to individuals who do not have a strong grip on analytical processes.
In Conclusion
The major concern in the entire process was balancing between efficiency and accuracy of the results. The end-result had to capture each individual’s nuance without burning a large set of resources in manually analyzing the dataset. With the help of Intellica, the management team at the firm was able to initiate key programs directly impacting employee engagement.
Intellica is helping a varied set of firms generate strategic value for their business using its expertise in Conversational AI, Computer Vision, NLP, Knowledge Discovery, Data Visualization, Predictive Analytics, and Recommendation Engines.
Want to learn how Intellica can help you transform your business or garner the deepest insights with ML? Explore our areas of excellence here or get in touch with our team at info@intellica.ai