Apache Superset - A Swiss Army Knife for Analyzing Data
Apache Superset is a an open source visualization tool which provides out of the box integrations with a wide variety of databases and cloud platforms. It can be easily deployed on EC2 machines and has great features which meet the production grade requirements like
- POWERFUL YET EASY TO USE
Quickly and easily integrate and explore your data, using either our simple no-code viz builder or state of the art SQL IDE.
- INTEGRATES WITH MODERN DATABASES
Superset can connect to any SQL based data source through SQLAlchemy, including modern cloud native databases and engines at petabyte scale.
- MODERN ARCHITECTURE
Superset is lightweight and highly scalable, leveraging the power of your existing data infrastructure without requiring yet another ingestion layer.
- RICH VISUALIZATIONS AND DASHBOARDS
Superset ships with a wide array of beautiful visualizations. Our visualization plug-in architecture makes it easy to build custom visualizations that drop directly into Superset.
Why Superset is production ready and can be compared to various BI tools like Quicksight and Power BI and Metabase
- ROW LEVEL SECURITY:
Superset allows us to manage access to the different rows in a particular database or table by giving the flexibility of specifying a field or column for which access is to be denied.
USE CASE :
Assuming we have a table from a database containing branch level data (1 lakh records) as the unique key, and if we have to generate analytics for each branch then that would mean we have to generate 1 lakh dashboards. Superset helps us to easily manage this scenario by allowing us to specify the various roles and the access to the data, which means that we can create a single dashboard and create views to the different users based on the roles.
- GRANULARITY IN PERMISSIONS:
Superset offers us the flexibility of creating ROLES and specifying the permissions for each role. There are multiple drag and drop permissions which can be easily added to a CUSTOM role ensuring fine granularity in the permissions.
USE CASE:
When we have a large number of parties who want various views on a particular dashboard it becomes very difficult to create them. Hence Superset offers a granular approach based on ROLES which allows the admin to manage the groups and assign the users to the particular group as and when they are created.
- AUTO MAILING FEATURE AND SHARING OF DATA:
Superset allows us to configure mailing services which automates the entire process of creation to sharing of data to the respective stakeholder’s email.
USE CASE:
It is often difficult to share the dashboards in a standard format when there a large number of parties involved. Hence in Superset we can give the features of auto mailing which allows the admin to automate the sending and mailing of the dashboards.
- INTEGRATION WITH ACTIVE DIRECTORY
Superset allows us to integrate with various authentication services and integration with Active Directory using the LDAP protocol.
USE CASE:
If a company has > 100 employees, managing and creation of permissions with Row Level Security Manually for each of the records is a time consuming process and data entry errors are a pain to deal with
- DATA REDUNDANCY AND AVAILABILITY:
There are currently three ways of deploying Superset • DOCKER • KUBERNETES • PIP INSTALL
USE CASE:
Scalability is a major concern as the number of users increases which usually causes slow run times as and when the number of users increases. Hence the power of Kubernetes allows us to manage multiple deployments of docker images on the cloud and also allows up to orchestrate and spin up containers as the traffic increases, which ensures fast load times and improves the User Experience.
- SQL LAB
SQL Lab allows us to directly connect the data sources from a variety of data sources and this connection allows us to prototype the various tables in real time.
USE CASE:
When we have to create a visualizations, it is often done using Query logic. Hence Superset allows us to view the data, run the query and also visualize the prototype in a single window, this helps the developers to increase the productivity as all the information is available.
Installing Superset
There are mainly three methods of installing Superset according to the documentation as mentioned above and here we are installing it on and EC2 Instance from AWS on a DOCKER IMAGE.
- Let’s set up the EC2 Image on AWS
- Register with a free AWS account and navigate to the respective EC2 page and launch a free tier micro instance
- Let’s set up the EC2 Image on AWS
- After clicking on SELECT in the image above leave all the options as default and the CONFIGURE SECURITY GROUP TAB we need to open port 8088 as Superset runs here as shown in the image below.
- Login to the instance and proceed to install the docker image
- The next step is to install DOCKER COMPOSE and DOCKER ENGINE from the links below for an ubuntu instance
- Docker Engine : https://docs.docker.com/engine/install/ubuntu/
- Docker Compose: https://docs.docker.com/compose/install/
- Then we can install Superset according to the documentation below: https://superset.apache.org/docs/installation/installing-superset-using-docker-compose
Now if we want to add a connector to a specific database then before the command
$ docker-compose -f docker-compose-non-dev.yml up
We need to install the respective connector as given in the link below ,
https://superset.apache.org/docs/databases/dockeradddrivers Here in our use case we need to install the Athena driver hence in the command,
$echo “mysqlclient” » ./docker/requirements-local.txt
We need to replace “mysqlclient” with “pyathena” and the complete list of drivers can be found in the link below : https://superset.apache.org/docs/databases/installing-database-drivers
Once Superset is installed we can login using the default user name and password which is admin : admin
Row Level Security on User Level
Row level security provides an easy way to manage access to the various roles and permissions and access to the fine grained data. This can be implemented by going into the SETTINGS ICON and selecting the ROW LEVEL SECURITY as shown in the figure below
Then click on new (+) icon on the right side and select all the fields that need to be added as shown in the image below :
Then we can add a new user and specify this ROLE for them which ensures that the row level security is implemented and that they can see only those Records to which they have access.
Row Level Security on Data Set Level
USE CASE :
If we have a business in which there are about 10000 users and these users are a part of 100 branches and if we have to display a dashboard in such a way that only the current branch user can view their performance in the current quarter. i.e Each branch user must be able to view only their Branch performance and not other branches.
Solution:
We have to manually set the Row level security for each user based on their user name as shown in the image below. Not only this, but also we need to manually add all the tables in the ROLE section for a particular user. In the future if we have more tables then we need to manually add them to the row level security as well.
Although the documentation is not clear on how to solve this issue after a lot of research we found that there is a built in DYNAMIC FILTERING OPTION using the JINJA TEMPLATE. The line below uses the template to give the current username and this helps us to filter later in the DASHBORD LEVEL
‘{{current_username()}}’
This needs to be enabled,
-
We need to login to our EC2 instance and locate the superset docker folder.
-
Next we need to go to /superset/superset/config.py
-
If we open this file using nano and then search using ctrl+w ENABLE_TEMPLATE_PROCESSING, we can see that it is set to false
- Even though we set it to true it does not allow us to make any changes as superset is designed in such a way that we have to override this by making changes in another file called superset_config.py.
-
What ever changes we make in this file is overwritten in the file called config.py.
-
To enable template processing we need to check the config.py
- It tells us to change the FEATURE_FLAGS to TRUE. Open /superset/docker/pythonpathdev/superset_config.py and search for FEATURE_FLAGS . Then in a new line enable template processing and restart the instance.
- Once we restart the instance we can create a dataset in which we have a unique user name.
- Here we need to select edit dataset and then choose the legacy sql editor.
- Then paste the string based on your use case. Here we want only the logged in user to view the current dashboards containing his/her username and not the dashboards of other users.
Refreshing a Dataset:
Since we have enabled Athena Driver on the superset instance, if we have created dashboards on superset and if we change the source database or reload the data it causes issues in superset as old dashboard does not reflect the new dataset.
This issue can be solved by,
- Clicking on edit dataset and navigating to the columns part
- Now if we click on Sync Columns from source all the old dashboards will reflect the new dataset.
Increasing Storage on EC2 as the datasets increase:
Go to the console, and choose the volume, then go to modify and change the storage. This can only be increased and once done we cant downgrade.
We then need to manually allocate the storage on the ec2 instance.
$ Df -Th
This command is used to get the increased volume in this case it is XVDA
$growpart /dev/xvda 1
This will increase the partision
$resize2fs /dev/xvda1
User Analytics on Superset:
The analytics can be viewed in the SQL lab:
It is present under the PGSQL database, Schema: Public, table ab_user
Query : SELECT * from ab_user where EXTRACT(MONTH FROM last_login )=10
Embedding Dashboards on Another Application:
First, you need to update the PUBLIC ROLE under Settings with these options.
- can explore json on Superset
- can dashboard on Superset, all database access on all_database_access. Second, embed your dashboard in your HTML
iframe src=“localhost:8088/superset/dashboard/5/?standalone=true” iframe
Here ‘5’ in the URL specifies the dashboard number
Creation of Datasets:
To create a Dashboard we need to follow the steps below:
- Go to the DATA tab
- Select create a DATASET button as shown in the image below
-
Select the database, the schema, and the table name
-
Then choose the dataset and select the visualisation type as TABLE
- Add all the respective fields and run the query. After that save the result in a new DASHBOARD.
Backup and Restore Superset Via Docker
A docker image does not have persistent storage hence when the image is torn down or if the image crashes the entire persistent data stored in the DB will be terminated. We can find out more from the blog: https://docs.docker.com/storage/volumes/
Hence If we need to take backups of the data, it is crucial to identify the mounting point of the docker image on the Operating System and on Ubuntu this can be found in the /var/lib/docker. STEPS TO BACKUP AND RESTORE A DATA IN DOCKER
- Create a copy of the current volume
Cp -r [source] [destination] Here since we already have a backup in backup_vol we are copying it into /var/lib/docker
$ sudo cp -r /home/ubuntu/backup_vol/docker /var/lib/docker
- Stop the services as well:
$sudo service docker stop
- Stop docker containers by using the command
$ sudo docker stop $(sudo docker ps -q)
Make sure all the services show as exited and this takes a bit of time
Recheck by running $docker ps -a-non
- Stop containers and remove containers, networks, volumes, and images created by up.
$ sudo docker container down
-
For testing purposes remove the current docker volume which is present in the file system by running $ sudo su $ cd /var/lib $ rm -R docker
-
If the device is busy then we need to use umount command.
$umount overlay By doing this all the persistent data from the disk has been removed including the row level permissions that we had set.
Here there is no user with the row level security .
- Now we have a backup in the home folder, which needs to be copied to /var/lib/docker
sudo cp -r /home/ubuntu/backup_vol/docker /var/lib/
Here the name of the backup is docker in the folder backup_vol
- Now we need to restart it as it will cause errors as it will be reading the data from /var/lib/docker
Now use docker compose up and restart from scratch
$ sudo service docker restart
$ sudo docker-compose up
If we get errors ctrl+c then wait for the service to go down, then go to the main superset folder and run the main command to start superset.
$ cd superset $ sudo docker-compose -f docker-compose-non-dev.yml up
This ensures that the volumes will be read from the correct location from the backup folder.
- This data needs to be backed up to S3 layer as well this can be done by using the tar command. $cd /var/lib tar -zcvf name_of_file_to_be_saved folder_name
• -z : Compress archive using gzip program in Linux or Unix
• -c : Create archive on Linux • -v : Verbose i.e display progress while creating archive • -f : Archive File name
$tar -zcvf prev_backup.tar.gz /var/lib/docker
- Then we need to install the AWS CLI, https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html After this we need to send the data to the S3 bucket. This can be done by copying the S3 bucket URI.
To copy the files from EC2 to S3
$ aws s3 cp
Hence if we have a file called prev_backup.tar.gz and s3 bucket URI then
$ aws s3 cp /home/ubuntu s3://
Now this data is sent to the respective bucket and is stored successfully.
- To restore this copy from S3
$aws s3 cp s3://<dataengineer-out/supersetbackup> <Fully Qualified Local filename/Directory>
Using tar unzip the file (USE SUDO SU and then UNZIP)
$ tar -zxvf prev_backup.tar.gz
$ sudo mv prev_backup/docker /var/lib
Du – display disk usage: $du -h foldername
Df- Display file system file size $df -f (Shows the full file system file size)
Superset Reset automation: https://gist.github.com/pajachiet/62eb85805cee55053d208521e0bdaf13/revisions
Automation of Superset backup to AWS.
1.Create a script called automate_backup.sh. And give necessary permissions.
$nano automate_backup.sh
This script copies the entire docker folder in /var/lib to the home directory, then compresses it
HOME=/root LOGNAME=root PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin LANG=en_US.UTF-8 SHELL=bin/sh PWD=/root #!/bin/bash sudo cp -r /var/lib/docker /home/ubuntu/prev_backup tar -zcvf /home/ubuntu/prev_backup.tar.gz /home/ubuntu/prev_backup/docker
$sudo chmod 777 automate_backup.sh
- Make sure that the aws cli is configured in the instance and the Access code and key is set. https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html
$nano automate_s3_storage.sh
This script copies the zipped data from Ec2 to S3. We have to specify the s3 URI by creating a folder
aws s3 cp /home/ubuntu/prev_backup.tar.gz s3://bucket_url
$sudo chmod 777 automate_s3_storage.sh
-
Schedule the cron job to run the first script (automate_backup.sh)
$ cron tab –e
#this creates a local backup and it is stored as a tar file every day at 12 AM IST or 5:30am UTC
30 5 * * _ /home/ubuntu/automate_backup_test.sh
#this sends the backup to aws on every seventh day at 12 AM IST or 5:30 am UTC
30 5 _ * 0 /home/ubuntu/automate_s3_storage.sh
#check the status of the cron tab
$ cron tab -l
By following all these steps the data is successfully backed up into s3 on a periodic basis. Due to the compression the size of the file is reduced by a factor of 4. (15 GB file is reduced to 3.7Gb)