How we use Pandas to build super analytics for our customers
Crossknowledge is treading new ground in the way it provides learning experiences and knowledge to its customers. This will to innovate and to always renew is also reflected in the way we develop our solutions and in our technical choices.
Last year, we developed for our customers a brand new analytics functionality. They now can see several pertinent indicators in a user-friendly way, with a very fast-loading.
It was the fast-loading constraint which was challenging. How can we serve fresh computed data from several millions of entries in less than 3 seconds without overload our databases? That’s how we come to Pandas. What is it? What problems does it solve? How we use it in Crossknowledge? You will find all the answers by reading this article.
What is Pandas?
Pandas is the Python Data Analysis Library. It’s an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. In brief, it is made to aggregate and denormalize data in a very fast way:
What problems does it solve?
First, let’s review the challenges in the context of our analytics functionality:
- Aggregate and render millions of data in a very fast way;
- Do not overload our databases
As we need to update the data once a day, Pandas perfectly meets our problems. Indeed, it allows us to import data from a CSV file, so we only have to export the data of our databases in CSV file and let Pandas aggregate and denormalize them. Then, we just need to retrieve the Pandas data in the LMS and render them in our dashboards.
By treating our data with an external tool, we avoid database overloading and their aggregation makes them much more faster to retrieve.
How do we use it?
In our case, we want to aggregate daily data such as the number of connexions or the number of completed trainings for a given day.
So every day, in the middle of the night, we export the needed data from our database in a CSV file that we upload in an Amazon S3 bucket. Our Dataviz Pandas consumes the CSV file to aggregate and denormalize the data.
Once it is done, thanks to the REST API provided by the Dataviz, we can directly retrieve the Pandas data from our Learning Suite and display them. The following schema describes how it works:
This is how it works in a very simple way. In practice, we added a Varnish cash system between our Learning Suite and the Dataviz Pandas to reduce the data retrieving.
As you can see in the schema bellow, the only thing that change is that we make a purge request to the Varnish cache each time we upload a new CSV file in the Amazon bucket:
To be short, by aggregating a large amount of data and by making their reading fast, Pandas matches our expectations in the context of an analytics functionality that displays a lot a dashboards.