Data Science workflow with Pandas, Matplotlib and Seaborn.
A practical workflow in data science gives a brief detail of how certain libraries in Python can be used to get data insights to analyze the various databases and the information that is contained within them with the help of certain visualizations and table insights. Here is my project based practical workflow of Pandas, Matplotlib and Seaborn and its explanation about how certain data science projects are carried out initially to understand the requirements of the project.
The very first step begins with installing and importing two most effective and well used libraries in python — matplotlib and seaborn. These libraries not only give us proper visualizations but also focus on the statistical measures and values that are supposed to be communicated within the analysis and the project.
As shown above, a line chart has been used for better understanding the visualization of tons of yielded apples and their supply in different years and with the help of this statistical information we can derive that, the yields were increasing every year at a constant rate. But, what if we want to compare the data of both apples and oranges with yearly range starting from 2000 to 2012. In that case, we can see the following image that depicts the line of codes and the line chart to get both the information in one graph.
There are two different sort of markers used with the legends mentioning whether it is an orange or an apple yield, and by looking at the line chart we can say that the crop yield for apple has increased with the years but if we look at the oranges we can see that the it has decreased/dropped down over the years. This information helps us in understanding why the supply of oranges in comparison to apple was less in the markets during those years.
Till now I was using manually entered data in the from of an array and matplotlib, but if we use a real surveyed/collected data and that is where we’ll be using pandas, matplotlib and seaborn altogether to gain valuable insights. Over here I have extracted data from seaborn related to three different flowers and what are the qualities (sepal length, sepal width, petal length, petal width) upon which they’re compared.
We can understand that the different species of flowers are the unique values present in the dataset and the other measures that are given with it, for their comparison of different sizes, depending upon their quality. The main question arising here is that “ as there are multiple flowers of the same species but they still have different dimensions, so how can we bifurcate or differentiate which one is the largest of their kind and smallest of their kind with different dimensions and which one’s are of median/average size and value”.
The most understandable and appropriate statistical visualization that can be used here is a-Scatter plot, and that answers most of our questions. A detailed and brief statistical analysis with legends and axis information is sometimes enough to give valuable insights related to data. In the below given example we can see that depending upon the three types of flowers, they’re divided into three different legends and all of them show their different standings on the basis of every flower’s individual data. Before, this stage the data was wrangled thoroughly and then plotted to answer the common questions arising. The best part over here is the effectiveness of a scatterplot, that was able to summarize and give a detailed view of most of the complex questions, with the title as sepal dimensions, and x and y axis representing Sepal length and sepal width respectively.
Coming down to our third form of visualization — Histogram. This visualization helps us in understanding the level of data. It is easier to read in comparison to scatter plot and is sometimes related as an efficient, common and simple way of analyzing the data. Using NumPy we can arrange one axis in the level of dimension we want for our understanding, without compromising on the legends and the flower species information.
Coming to the fourth and probably the most important visualization for any sort of dataset that involves a three way variable analysis — Heat Map. A heat map has the most impactful effect on the analysis made by every data scientist, business intelligence engineer, data analyst or even anyone who is figuring out multiple insights from it. Now, to understand heat map and it’s working, the dataset used here is of aircrafts, the year in which they took off and the number of passengers in every flight.
The heat map used here shows a three way analysis of number of passengers, month and year. The darker to lighter color represents the lowest range to the highest range of passengers that took off in the specific flight. A proper heat map has the ability to present the most complex and composite information into readable visualizations giving the most effective insights. As we can see below, that black color represents the least number of passengers in the aircraft and white color represents the maximum number of passengers that took off in a flight. For more effective and better analysis we can use the annotation function in python to represent the numbers on the visualizations which also helps us in adding a common color for the same.
Last But not the least and the visualization representing all the visualizations in one format with the all the variables and comparison models combined in one visualization. To present this I’d refer to the flowers dataset and the information gained from it. The main function used here is pairplot in python to combine all the statistical visualizations in one place.
In conclusion, dashboard based visualizations are good to handle and are referred to as a quick process, but statistical visualizations that involve multiple line of codes can help us gain insights that can sometimes be more valuable and insightful with the kind of different codebase functions applied to it.
Project GitHub Link — advait27/datascience-project (github.com)