119 32
English Pages 122 Year 2024
1 Dataset 101 Visualizations Using Python Author Ahmed Abouraia
oata GCA 5CK©a
PC.iou
WM?F’
8o?ba oewstxs t>uooo©n ctc€>nopo€s wra* OWW1*
HeooacloeooeotHniufl soemodeo WmfiiroriTae cc w «jk
J! I.. SH;" 9^5 G5S: s-rr. s- r< :< ;r. Czm
■ OcKKlmaQ SUiroj OPS
B«cC5 Srttioy
'W 00 7$
S«MFUIVJW W l'J.0 3P ■< >r I? n CttCaroea S»cr- -
■;c' St.iS OS’.? Sid
STS
HM
ane
filo vneoortdiji eeiatuo ooltw omorwundMuy vioptrttaoo orelios
&s]Re^rs Awtse tenuorio&y MWjOocaitsa a 'j ii u StiOttH 8OM KST-6'G
f4n atts'fi
e«c cs«'ur.x--5
STMWrs
or.
MMS
MS
Wfi
9#6
STft
5K&
STM
5C6
WK
Table of Contents: 1. About the Author, Copyright and Abstract • • •
About the Author Copyright Abstract
2. Getting Started with Data Visualization • • • •
Why Data Visualization Matters Effective Data Visualization Python Libraries for Data Visualization Installing Required Libraries
3. Generating Synthetic Dataset with Faker • •
Installing Faker Library Generating Synthetic Sales Data
4. Visualizations with Python and Matplotlib •
101 visualizations for the loaded synthetic dataset
5. Conclusion 6. Useful resources
2
About the Author Ahmed Abouraia is a Data Architect, Writer, and Lecturer who has spent the last 15 years working in an international school in Cairo, Egypt, learning in the technology field, and achieving certifications from technology market leaders such as Microsoft, IBM, Oracle, AWS, VMware, Sophos, and others. He graduated from the Arab Academy for Science and Technology with a master's degree in E-Business in 2022, and he was the top of his class, and he truly keen to improve his academic records soon by pursuing a doctorate in data science. He couldn't have done it without the support of his family and his own constant motivation. Author email address: [email protected]
Copyright © 2023 Ahmed Abouraia, Egypt. All rights reserved. No part of this guidebook may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the author, except in the case of brief quotations embodied in critical reviews and certain other non commercial uses permitted by copyright law. This guidebook is intended for personal use and educational purposes only. The content provided herein, including data visualization techniques using Python, is based on the author's professional expertise and experience as a Data Architect and Lecturer. While efforts have been made to ensure the accuracy and reliability of the information presented, the author shall not be held liable for any errors, omissions, or damages arising from the use of this guidebook.
Any use of the materials and code examples provided in this guidebook should be made with proper acknowledgment and attribution to the author, Ahmed Abouraia. Unauthorized use, reproduction, or distribution of the content may be subject to legal action.
For inquiries or permission requests, please contact [email protected]. Thank you for respecting the author's intellectual property and for your interest in learning about data visualization using Python.
3
Disclaimer
The information provided in this guidebook is for educational and illustrative purposes only. The author of this guidebook holds no responsibility for any software or hardware damage that may occur as a result of using any part of this guide. Users are encouraged to exercise caution and use their discretion when applying any code or techniques described in this guide. It is essential to test the code in a safe and controlled environment before implementing it in any critical or production settings. Furthermore, while every effort has been made to ensure the accuracy and reliability of the information provided, the author does not guarantee the correctness or completeness of the content. The author shall not be held liable for any direct, indirect, incidental, consequential, or special damages arising out of or in any way connected with the use of this guidebook.
Users are advised to consult with appropriate experts and professionals in their respective fields before applying any concepts or practices described in this guidebook to ensure compliance with all relevant laws, regulations, and best practices. By using this guidebook, you acknowledge and agree to the terms of this disclaimer and assume all risks associated with its use. If you do not agree with the terms of this disclaimer, it is advised not to use the information provided in this guidebook.
4
Abstract Are you looking to level up your data visualization skills in Python? Look no further! Introducing the ultimate guide to 101 visualizations using just one dataset. Whether you're a beginner or an experienced data scientist, this comprehensive guide will take you on a visual journey through various plotting techniques, insights, and patterns that can be explored with Python and a single dataset.
In this guide, we cover everything you need to know to create impactful visualizations and effectively communicate your data-driven stories. From basic bar charts to intricate heatmaps, we've got you covered. The dataset we'll be using comprises diverse attributes, including purchase details, customer demographics, product categories, and more. Highlights of the Guide:
• Basics of Data Visualization in Python • Line Plots: Uncovering Trends Over Time • Bar Charts: Comparing Categories and Quantities • Pie Charts: Analyzing Proportions • Scatter Plots: Identifying Relationships • Box Plots: Understanding Data Distribution • Heatmaps: Visualizing Correlations • Word Clouds: Exploring Textual Data ... and many more! So grab your Python toolkit and embark on this exciting data visualization adventure! By the end of this guide, you'll have mastered various visualization techniques and gained invaluable insights from a single dataset.
5
1- Getting Started with Data Visualization
• Why Data Visualization Matters Data visualization matters because it is a powerful tool that allows us to comprehend complex data and extract meaningful insights quickly and effectively. Through the use of graphical representations, data visualization transforms raw numbers and statistics into visual patterns, trends, and relationships, making it easier for individuals to understand and interpret the information.
Here are the key reasons why data visualization matters and how it enhances our understanding of data:
•
•
•
•
•
•
•
•
Enhanced Comprehension: Humans are visual creatures, and we process visual information more efficiently than raw data. Visualizations provide a clear and concise representation of data, making it easier for users to grasp the main message, spot patterns, and identify outliers. Patterns and Trends Identification: Visualizations help reveal patterns, trends, and correlations that may not be apparent in tabular data. By observing data visually, we can detect relationships and insights that might otherwise go unnoticed. Storytelling and Communication: Visualizations have the power to tell a compelling data-driven story. They enable data analysts and communicators to present findings in a captivating and persuasive manner, making complex information accessible to a broader audience. Decision-Making and Insights: Well-designed visualizations provide valuable insights that lead to informed decision-making. They help businesses identify opportunities, optimize processes, and address challenges by presenting data in a way that facilitates critical thinking. Data Validation and Quality Assessment: Data visualizations aid in data validation by allowing us to identify errors, anomalies, and inconsistencies in the dataset. Visualizations can act as a data quality check, ensuring that data used for analysis is accurate and reliable. Interactivity and Exploration: Interactive visualizations empower users to explore data from different angles, drill down into specific details, and customize views based on their interests. This hands-on exploration fosters a deeper understanding of the data. Identifying Outliers and Anomalies: Visualizations make it easier to spot outliers and anomalies that may require further investigation. These unexpected data points may hold crucial information or indicate potential errors in data collection. Comparison and Benchmarking: Visualizations facilitate easy comparison between different datasets, groups, or time periods. They enable benchmarking
6
against previous performance or competitors, aiding in setting realistic goals and targets. • Effective Reporting: Data visualizations are vital for creating engaging and informative reports. A well-crafted visualization can convey the key findings quickly, saving time and effort for both creators and readers. • Public Understanding: In fields such as science, public health, and social issues, data visualizations play a crucial role in presenting complex information to the general public. They help bridge the gap between technical expertise and public understanding, fostering better-informed decisions and policies. In conclusion, data visualization matters because it transforms data into actionable insights, fosters better decision-making, and enables effective communication of complex information. It empowers individuals and organizations to explore, understand, and leverage the power of data, driving innovation and progress across various domains. •
Effective Data Visualization: Choosing the Right Visualizations for Quantitative and Qualitative Data" Data visualization plays a critical role in understanding and communicating insights from data. With the vast amount of information available, choosing the right visualization techniques is essential to effectively represent quantitative and qualitative data. In this guide, we explore recommended visualization types for both quantitative and qualitative data, highlighting their strengths and best use cases. Whether you are analyzing numerical values or categorical labels, understanding the appropriate visualization techniques can significantly enhance the understanding and impact of your data analysis. Join us as we delve into the world of data visualization and discover the power of visual storytelling with data.
For quantitative data, which represents numerical values, there are several recommended visualization types depending on the specific characteristics of the data and the insights you want to convey. Here are some commonly used visualization types and the reasons for their recommendation:
Quantitative Data Visualization: •
•
Histograms: Histograms are useful for visualizing the distribution of a single quantitative variable. They display the frequency or count of data points in predefined bins or intervals. Histograms are great for identifying patterns such as skewness, central tendency, and the presence of outliers. Box Plots (Box-and-Whisker Plots): Box plots provide a concise summary of the distribution's central tendency, spread, and skewness. They show the median, quartiles, and possible outliers, making them ideal for comparing multiple quantitative variables or groups.
7
Scatter Plots: Scatter plots are excellent for visualizing the relationship between two quantitative variables. They help identify correlations, clusters, and patterns in the data. Scatter plots are valuable for discovering any potential linear or nonlinear relationships. • Line Charts: Line charts are commonly used to show trends and changes in data over time. They connect data points with straight lines, making them effective for visualizing time series data or any data with a continuous x-axis. • Bar Charts: While often used for categorical data, bar charts can also display quantitative data when categories are grouped into intervals. This can be helpful for summarizing discrete quantitative data or comparing different ranges. • Area Charts: Area charts are similar to line charts but represent the area under the line. They are useful for visualizing accumulated quantities over time or displaying stacked data. • Heatmaps: Heatmaps are helpful for showing the intensity of a relationship between two quantitative variables. They use colors to represent data values and are effective for large datasets. For qualitative data, which represents categories or labels, different visualization types are recommended to effectively communicate insights. Here are some commonly used visualization types and their advantages for qualitative data:
•
Qualitative Data Visualization:
•
•
•
•
•
•
•
Bar Charts: Bar charts are one of the most common ways to display qualitative data. They show the frequency or count of each category, making it easy to compare different categories. Pie Charts: Pie charts are useful for showing the composition or proportion of different categories within a whole. However, they are best used when the number of categories is relatively small (typically less than 5-6) to avoid clutter. Stacked Bar Charts: Stacked bar charts display the composition of a single variable as a whole, showing how each category contributes to the total. They are effective for comparing multiple qualitative variables or categories. Donut Charts: Donut charts are a variation of pie charts with a hole in the center. They can be used to show the same information as pie charts while offering more space for annotations or additional data. Word Clouds: Word clouds visually represent the frequency of words or terms in a text dataset. They are often used to highlight the most common terms or topics. Stacked Area Charts: Stacked area charts show the evolution of different qualitative categories over time, displaying how each category contributes to the whole. Chord Diagrams: Chord diagrams are used to visualize relationships between different categories or groups. They are useful for demonstrating connections and flows between entities.
8
When choosing the right visualization type, it is essential to consider the nature of the data and the story you want to tell. Visualization should be clear, informative, and tailored to the audience to effectively communicate insights and patterns in the data.
• Python Libraries for Data Visualization Python offers a variety of powerful libraries for data visualization that cater to different user needs and preferences. Each library has its strengths and weaknesses, making it important to choose the right one based on the specific visualization requirements. Below are some of the most popular Python libraries for data visualization:
•
•
•
•
•
•
•
Matplotlib: Matplotlib is one of the oldest and most widely used data visualization libraries in Python. It provides a flexible and comprehensive set of tools for creating static, interactive, and animated visualizations. While it requires more code for complex plots, Matplotlib's versatility makes it suitable for a wide range of visualization tasks. Seaborn: Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics. It simplifies the creation of complex visualizations, such as violin plots, pair plots, and correlation heatmaps, by providing convenient APIs. Seaborn is particularly useful for exploratory data analysis and works well with pandas DataFrames. Plotly: Plotly is a popular library for creating interactive and web-based visualizations. It supports a wide range of chart types, including line charts, bar charts, scatter plots, and more. Plotly visualizations can be embedded in web applications or shared as standalone HTML files. It also has APIs for JavaScript, R, and other programming languages. Pandas Plot: Pandas, a popular data manipulation library, also provides a simple plotting API for DataFrames and Series. While not as feature-rich as Matplotlib or Seaborn, it is convenient for quick exploratory visualizations directly from pandas data structures. Bokeh: Bokeh is another library focused on interactive visualizations for web applications. It allows the creation of interactive plots with smooth zooming and panning. Bokeh provides both low-level and high-level APIs, making it suitable for both beginners and advanced users. Altair: Altair is a declarative statistical visualization library based on the Vega Lite specification. It enables the creation of visualizations using concise and intuitive Python code. Altair generates interactive visualizations and can be easily customized and extended. Geopandas and Folium: Geopandas and Folium are specialized libraries for geographic data visualization. Geopandas allows working with geospatial data (e.g., shapefiles) and integrates with Matplotlib for visualizations. Folium is focused on creating interactive maps and works well with Jupyter Notebooks.
9
WordCloud: WordCloud is used to create word clouds from text data. It is often employed for visualizing word frequency and popularity in textual datasets. • Holoviews: Holoviews is a high-level data visualization library that allows creating complex visualizations with minimal code. It provides a wide range of visual elements and automatically handles aspects like axes, legends, and color bars. These libraries, each with its unique strengths and characteristics, provide Python users with a broad range of options for creating compelling, insightful, and interactive data visualizations. The choice of library depends on the specific use case, the complexity of visualizations required, and personal preferences for coding style and interactivity.
•
• Installing Required Libraries To install required Python libraries for data visualization, you can use either pip or conda, depending on your package manager (Anaconda or standard Python distribution). Below are the detailed steps for installing libraries using both methods: • • •
Using pip (Standard Python Distribution): Step 1: Open a command prompt or terminal on your computer. Step 2: Ensure that you have Python installed. You can check your Python version by running: python --version
•
Step 3: Update pip to the latest version (optional but recommended): pip install --upgrade pip
•
Step 4: Install the required libraries. For data visualization, you might want to install libraries like Matplotlib, Seaborn, Plotly, and others. For example, to install Matplotlib and Seaborn, run: pip install matplotlib seaborn
•
Replace matplotlib seaborn with the names of other libraries you want to install. Using conda (Anaconda Distribution): Step 1: Open Anaconda Navigator or Anaconda Prompt. Step 2: If you are using Anaconda Navigator, go to the "Environments" tab, select the desired environment, and click on "Open Terminal." Step 3: If you are using Anaconda Prompt, activate the desired environment by running:
•
10
conda activate your_environment_name
• •
• •
Replace your_environment_name with the name of your desired environment. If you want to install libraries in the base environment, skip this step. Step 4: Install the required libraries. For data visualization, you can use conda to install libraries like Matplotlib, Seaborn, Plotly, and others. For example, to install Matplotlib and Seaborn, run: conda install matplotlib seaborn Replace matplotlib seaborn with the names of other libraries you want to install. Step 5: If a library is not available through conda, you can use pip within your conda environment. For example, to install Plotly, run: pip install plotly
After running the installation commands, the specified libraries and their dependencies will be downloaded and installed on your system. You can then use these libraries in your Python scripts or Jupyter Notebooks for data visualization and analysis. Note: If you are using Jupyter Notebooks, make sure to install the libraries within the same Python environment that your Jupyter Notebook is using to avoid compatibility issues. If you are using Anaconda, it is recommended to create a separate environment for each project to manage library dependencies effectively.
2- Generating Synthetic Dataset with Faker
• Installing Faker Library the steps to install the Faker library on Windows 10 with Anaconda distribution: • •
Open Anaconda Prompt: Click on the Windows Start button, type "Anaconda Prompt," and open the Anaconda Prompt application. Activate Environment (Optional): If you want to install Faker in a specific conda environment, activate that environment using the following command: conda activate your_environment_name
- Replace your_environment_name with the name of your desired environment. - Install Faker: In the Anaconda Prompt, type the following command to install the Faker library: pip install Faker
11
• •
Wait for Installation: The installation process will begin, and the required packages will be downloaded and installed. Verify Installation (Optional): To verify that Faker is installed correctly, you can open a Python interpreter or a Jupyter Notebook and try importing the library: import faker
• If there are no errors, the Faker library is successfully installed. That's it! You have now installed the Faker library on your Windows 10 machine using the Anaconda distribution. You can use Faker to generate synthetic data for testing, prototyping, or learning purposes. Remember that Faker is not meant for production use, and it is essential to use real data for any serious analysis or application.
• Generating Synthetic Sales Data To generate a synthetic dataset using Faker library for the previous 101 visualization examples, we'll create a Python script that generates random data for the specified columns. Since Faker generates random data, keep in mind that this dataset will be artificial and not representative of any real-world data. First, make sure you have installed the Faker library. You can install it using pip:
bash code
pip install Faker Let's generate the dataset with the required columns: python code
import pandas as pd| import random^^^B from faker import Faker from datetime import datetime, timedelta
# Set random seed for reproducibility| random.seed(42^^^^^^^^^^^^^^B # Initialize Faker and other necessary variables fake = Faker() start date = datetime(2020, 1, 1) end_date = datetime(202 , 1, 1) # Create emptylists to store the generated data order_ids~~=[]^^^^^M^^^^^^^^^^^^^^^B
12
customerids product ids purchase dates product_categories quantities~~=— totalsales genders = [] maritalstatuses price per unit customer types ages = [] # New list to store ages
# Number of rows(data points) to generate) numrows = 10000^^^^^^^^^^^^^^^^^H Generate the dataset
product ids.append(fake.uuid4()) purchase date = start date + timedelta(days=random.randint(0 , (end date - start date).days)) purchase_dates.append(purchase_date)
product_categories.append(fake.random_element(elements=('Electronics', 'Clothing , 'Books', 'Home', 'Beauty')))J^^^^^^^^^^^^^^^^^M quantities.append(random.randint( , 10))J^^^^^^^^^^^^^^^M total_sales.append(random.uniform(1( 111 iii^^^^^^^^^^^^^^M genders.append(fake.random_element(elements=('Male', 'Female')))^ # Only 'Male' and 'Female' will be added^^^^^^^^M^^^^^^^^^^M marital_statuses.append(fake.random_element(elements=( 'Single7^^ 'Married', 'Divorced', 'Widowed7))^^^^B^^^^^^^^^^^^^^^^^^M price_per_unit.append(random.uniform( , 51 ))^^^J^^^^^^^^^M customer_types.append(fake.random_element(elements=(7New^^^^^B Customer', 'Returning Customer' )))^^^^^^^^^M^^^^^^^^^^^^^B ages.append(random.randint(18, 80)) # Generate random ages^^^J between18 and 80^^^^^^^^^^^HH^^^^^^^H^^^^^^^^^^^^I
# Create a DataFrame from the generated lists df = pd.DataFrame({ 'OrderlD': orderids 'CustomerlD': customerids
13
'Product ID': product_ids 'Purchase Date': purchase_dates 'Product Category': product_categories 'Quantity': quantities 'TotalSales': totalsales 'Gender': genders 'Marital_Status : marital_statuses 'Price Per Unit : price_per_unit,| 'Customer Type': customer types 'Age': ages # Add the 'Age' column to the DataFrame|
# Save the DataFrame to a CSV file^^^^^^J df.to_csv('ecommerce_sales.csv , index=False) # Display the first few rows of the generated dataset print(df.head())^^^^^^^^^^^^M^^^^B^^^^B
This code will generate a DataFrame with the specified columns 'Order_ID', 'Customer_ID', 'Product_ID', 'Purchase_Date', 'Product_Category', 'Quantity', and 'Total_Sales', etc. You can now use this generated dataset for data visualization and analysis and apply the previous 101 visualization examples on it. Remember that this dataset is synthetic and should only be used for learning or testing purposes. For real-world analysis, it's essential to use genuine and representative data.
14
3- Visualizations with Python and Matplotlib •
101 different visualizations for the loaded dataset
Ecommerce sales dataset was generated using Synthetic data approach using Python Faker library:
• •
File name: ecommerce_sales.csv # of entries 10,000
Dataset description:
The Ecommerce Sales Dataset presented in this file, "ecommerce_sales.csv," has been meticulously generated using a Synthetic Data Approach in Python, leveraging the powerful Faker library. This dataset serves as a simulated representation of an online retail store's sales transactions and customer interactions, intended solely for educational and illustrative purposes. With a total of 10,000 entries, each row in the dataset encapsulates a unique synthetic sales transaction. The data encompasses diverse attributes, including Order ID, Customer ID, Product ID, Purchase Date, Product Category, Quantity, and Total Sales Amount, Gender, Age, Age Group Marital Status, Price Per Unit, and Customer Type.
While the information bears resemblance to authentic ecommerce sales data, it is important to emphasize that this dataset does not originate from any real-world transactions or establishments. As such, it does not represent any actual customer behaviors, market trends, or business performance.
This synthetic dataset provides an invaluable resource for learners and data enthusiasts, offering an opportunity to practice data analysis techniques, explore various visualization methods, and develop Python programming skills. However, it is crucial to recognize that this data should not be used for making genuine business decisions or drawing conclusions that impact real-world scenarios.
As users engage with the Synthetic Ecommerce Sales Dataset, they are encouraged to apply their knowledge to real, reliable datasets from trusted sources when conducting actual data analysis and decision-making in professional settings. By doing so, learners can harness the power of data to drive meaningful insights and contribute to data-driven strategies and optimizations in the realm of ecommerce and beyond.
The synthetic data generated using the Faker library in this guidebook is intended for educational and illustrative purposes only. The data is entirely fictional and does not represent any real-world observations or trends. While Faker provides realistic-looking
15
data, it is not sourced from actual observations or events. Therefore, any conclusions or inferences drawn from this synthetic data should be treated with caution and not be used as a basis for real decision-making or analysis. The purpose of using Faker-generated data is to demonstrate data visualization techniques and provide a platform for practicing Python data analysis skills. As a learner, it is encouraged to explore and experiment with the data to understand the visualization concepts better. However, for practical and meaningful data analysis, it is essential to work with authentic, representative, and relevant datasets obtained from reliable sources. Always validate and use real data when conducting actual analyses and drawing conclusions in professional settings. Remember, the synthetic data generated by Faker is a valuable tool for learning, experimentation, and practicing data analysis techniques. It serves as a stepping stone to building proficiency in Python data visualization and exploring various visualization libraries and tools. As you progress, apply your skills to analyze genuine datasets to gain valuable insights and contribute to data-driven decision-making in real-world scenarios. 101 Visualizations with Python: Code and Comprehensive Explanations for Every Visualization"
In this comprehensive guidebook, embark on an illuminating journey through 101 data visualizations, each expertly crafted using Python code. Delve into the depths of data analysis and visualization as you unravel the intricacies of each visualization's purpose and design. With a rich assortment of visualizations at your disposal, learn how to manipulate data, identify patterns, and communicate insights effectively. From classic bar charts and line plots to sophisticated 3D visualizations and interactive dashboards, this guidebook equips you with the tools to become a master of data storytelling.
Each visualization is accompanied by a detailed explanation, offering valuable insights into data interpretation, visualization techniques, and best practices. Discover how to leverage Python's versatile libraries such as Matplotlib, Seaborn, Plotly, and more, to create captivating visual narratives. Whether you're a data enthusiast, aspiring analyst, or seasoned professional, "101 Distinctive Visualizations with Python" empowers you to wield the power of Python to distill complex data into actionable insights. Unlock the art of data visualization and elevate your data-driven decision-making with this immersive and educational guidebook.
IMPORTANT NOTE: Each visual is presented in the following order: Visual #, Title, explanation, Python code, and graph.
16
Python code
# Import necessary libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from datetime import datetime, timedelta from wordcloud import WordCloud # Load the dataset^^^^^^^^^^^^^B df = pd.read_csv('ecommerce_sales.csv')
Rangeindex: 10000 entries, 0 to 9999 Data columns (total 13 columns): # Column Non-Null Count 0
1 2
3 4 5
6 7 8
9 10 11
Dtype
Order_ID Customer_ID
10000 non-null 10000 non-null
object object
ProductID PurchaseDate
10000 non-null 10000 non-null
object object
Productcategory' Quantity TotalSales
10000 non-null 10000 non-null 10000 non-null
object int64 float64
Gender MaritalStatus Price Per Unit
10000 non-null 10000 non-null 10000 non-null
object object float64
CustomerType
10000 non-null 10000 non-null
object int64
Age
12 Age_Group 10000 non-null category dtypes: category(l), float64(2), int64 (2), object(8 memory usage: 947.6+ KB
17
Visualization 1: Bar Chart - Top 10 Product Categories by Sales: Visualizing the total sales for the top 10 product categories using a bar chart. # Visualization 1: Bar Chart - Top 10 Product Categories by Sales t o p_10_c atego ries=^^^^^^^^^^^^^^B^^^^^^^^^^^^^^M df.groupby('Product_Category')['Total_Sales'].sum().nlargest(10)| plt. bar(top_10_categories. index, top_10_categories. values)^^^M plt.xticks(rot ation=45)^J^^^^^^^^^^^^^^^^^^^^^^^^M plt.xlabel(' Product Category^^^^^^^^^^^^^^^^^^^^^^^B plt.ylabel('Total Sales' )^^^^^^J^^^^^^^^^^^^^^^^^M plt.title('Top 10 Product Categories by Sales')^^^^^^^^^^^B plt. showQ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^J
Product Category
18
Visualization 2: Pie Chart - Sales Distribution by Customer Type: Displaying the percentage distribution of sales among different customer types using a pie chart. I# Visualization 2: Pie Chart - Sales Distribution by Customer Type|
plt.show()
Sales Distribution by Customer Type New Customer
49.9%
Returning Customer
19
Visualization 3: Histogram - Sales Amount Distribution: Analyzing the distribution of total sales amounts using a histogram. # Visualization 3: Histogram - Sales Amount Distribution
plt.show()
20
Visualization 4: Box Plot - Sales Amount by Product Category: Comparing the sales distribution across different product categories using a box plot. # Visualization 4: Box Plot - Sales Amount by Product Category! plt.figure(figsize=(10 , 6))^M^^^^^^^^^M^^^^^^^^^M sns.boxplot(x= Product_Category', y= Total_Sales , data=df) |plt.xticks(rotation=45) |plt.xlabel('Product Category') |plt.ylabel('Total Sales') |plt.title('Sales Amount by Product Category') plt.show()
Sales Amount by Product Category 500 -
400 ■
ioo
-
0-
Product Category
21
Visualization 5: Count Plot - Sales by Month: Visualizing the number of sales made in each month using a count plot. # Visualization 5: Count Plot - Sales by Month^^^^^^J df[ Purchase_Date'] = pd.to_datetime(df['Purchase_Date ]) |df[ Month'] = df[ Purchase_Date'].dt.month| plt.figure(figsize=(10, 6)) sns.countplot(x='Month', data=df) plt.xlabel('Month') plt.ylabel('Sales Count') plt.title('Sales by Month') plt.show()
22
Visualization 6: Scatter Plot - Total Sales vs. Quantity: Exploring the relationship between total sales and quantity purchased using a scatter plot.
plt.show()
23
Visualization 7: Line Plot - Monthly Sales Trend: Displaying the monthly sales trend over time using a line plot. I# Visualization 7: Line Plot - Monthly Sales Trend |monthly sales = df.resample('M , on= Purchase_Date')['Total_Sales'].sum() plt.plot(monthly_sales) plt.xlabel('Date') plt.ylabel('Monthly Sales') plt.title('Monthly Sales Trend') plt.show()
24
Visualization 8: Stacked Bar Chart - Sales by Customer Type and Product Category: Visualizing the sales breakdown by customer type and product category using a stacked bar chart. I# Visualization 8: Stacked Bar Chart - Sales by Customer Type and Product |Category| customer_category_sales = df.pivot_table(index='Customer_Type ,| columns= Product_Catego ry', values= Total_Sales , aggfunc='sum ) customer_category_sales.plot(kind='bar , stacked=Tru , figsize=(10, 6)) plt.xlabel('Customer Type*) plt.ylabel('Total Sales')^ |plt.title('Sales by Customer Type and Product Category') |plt.legend(title= Product Category , bbox_to_anchor=(: , 1)) |plt. show()
25
Visualization 9: Pair Plot - Correlation Between Numeric Features: Plotting pairwise relationships between numeric features, such as total sales, quantity, and price per unit, using a pair plot. # Visualization 9: Pair Plot - Correlation Between Numeric Features sns.pairplot(df[['Total_Sales , 'Quantity', ' Price_Per_UnitT]])^M plt.subtitle('Correlation Between Numeric Features', y=1702)^^^J plt. showQ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^I
26
Visualization 10: Heatmap - Correlation Matrix: Displaying the correlation matrix between different numeric features using a heatmap. I# Visualization 10: Heatmap - Correlation Matrix| |correlation matrix = df.corr() sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.title(' Correlation Matrix7)^^^^^^^^^^^^^^^^^^^ |plt. show()
27
Visualization 11: Violin Plot - Sales Distribution by Customer Type: Comparing the distribution of sales amounts for different customer types using a violin plot. |# Visualization 11: Violin Plot - Sales Distribution by Customer Type| plt.figure(figsize=(8, 6)) sns.violinplot(x= Customer_Type', y= Total_Sales , data=df) plt.xlabel('Customer Type') plt.ylabel('Total Sales') plt.title('Sales Distribution by Customer Type') plt.show() Sales Distribution by Customer Type
Returning Customer
New Customer
Customer Type
28
Visualization 12: Joint Plot - Total Sales vs. Price Per Unit: Exploring the relationship between total sales and price per unit using a joint plot. # Visualization 12: Joint Plot - Total Sales vs. Price Per Unit |sns.jointplot(x='Total_Sales', y='Price_Per_Unit , data=df) plt.xlabel('Total Sales') plt.ylabel('Price Per Unit') plt.title('Total Sales vs. Price Per Unit') plt.show()
29
Visualization 13: Stacked Area Chart - Sales Over Time: Visualizing the sales trend over time using a stacked area chart. # Visualization 13: Stacked Area Chart - Sales Over Time monthly_sales = df.resample('M , on= Purchase_Date')['Total_Sales'].sum() monthly sales.plot(kind='area , figsize=(10, 6))^^^^^^^^^^^^^^^M plt.xlabel('Date') plt.ylabel('Monthly Sales') plt.title('Sales Over Time') plt.show()
30
Visualization 14: Box Plot - Sales Amount by Customer Type: Comparing the sales distribution across different customer types using a box plot. |# Visualization 14: Box Plot - Sales Amount by Customer Type| plt.figure(figsize=(8, 6)) sns.boxplot(x= Customer_Type', y='Total_Sales', data=df) plt.xlabel('Customer Type') plt.ylabel('Total Sales') plt.title('Sales Amount by Customer Type') plt.show()
31
Visualization 15: KDE Plot - Sales Distribution: Displaying the kernel density estimate of the sales distribution using a KDE plot. # Visualization 15: KDE Plot - Sales Distribution sns.kdeplot(df['Total_Sales'], shade= rue)^^^M plt.xlabel('Total Sales') plt.ylabel('Density') plt.title('Sales Distribution') plt.show()
32
Visualization 16: Barh Plot - Top 10 Customers by Total Sales: Visualizing the total sales for the top 10 customers using a horizontal bar chart. |# Visualization 16: Barh Plot - Top 10 Customers by Total Sales| |top 10 customers df.groupby('Customer_lD')['Total_Sales'].sum().nlargest( 0)
plt.show() Top 10 Customers by Total Sales
o
(D E o uo (J
33
Visualization 17: Pairwise Scatter Plots - Multiple Features: Creating pairwise scatter plots to explore relationships between multiple features, colored by product category. I# Visualization 17: Pairwise Scatter Plots - Multiple Features |sns.pairplot(df[[ Total Sales , 'Quantity', 'Price_Per_Unit' 'Product Category']], hue= Product Category') |plt. suptitle(' Pairwise Scatter Plots', y=0.95) # Title at the bottom| |plt. show()
34
Visualization 18: Clustered Bar Chart - Sales by Product Category and Customer Type: Displaying sales counts based on product category and customer type using a clustered bar chart. |# Visualization 18: Clustered Bar Chart - Sales by Product Category and Customer Type| plt.figure(figsize=(10, 6))
plt.show()
Sales by Product Category and Customer Type
Product Category
35
Visualization 19: Donut Chart - Sales Distribution by Product Category: Illustrating the percentage distribution of sales among different product categories using a donut chart. # Visualization 19: Donut Chart - Sales Distribution by Product Category^M product category sales = df.groupby('Product Category')[ Total Sales ].sum() plt.pie(product_category_sales, labels=product_category_sales.index , |autopct='%1.1f%%', wedgeprops=dict(width=0.3)) plt.title('Sales Distribution by Product Category')^^^^^^^^^^^M plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. plt.show()
36
Visualization 20: Stacked Percentage Bar Chart - Sales by Gender and Product Category: Visualizing the sales breakdown by gender and product category using a stacked percentage bar chart. |# Visualization 20: Stacked Percentage Bar Chart - Sales by Gender and product Category! gender_category_sales = df.pivot_table(index='Gender' columns= Product Catego values='Total_Sales', aggfunc='sum') |gender category_sales.plot(kind= bar', stacked= rue, figsize=(10, 6)) plt.xlabel('Gender') plt.ylabel('Total Sales') plt.title('Sales by Gender and Product Category') plt.legend(title= Product Category , bbox_to_anchor=(: , 1)) plt.show()
37
Visualization 21: Stacked Bar Chart - Total Sales by Gender and Customer This visualization uses the 'Gender' and 'Customer_Type' attributes to show the total sales for each gender, stacked by the customer type. The legend shows the different customer types represented by the different colors in the bars. This plot helps to understand how sales are distributed among different genders and customer types. # Visualization 21: Stacked Bar Chart - Total Sales by Gender and Customer, Type^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^H plt.figure(figsize=(10 , 6))^^^^^^^^^^^^^M^^^^^^^^^^^^^^^M
plt.show() Total Sales by Gender and Customer Type
Male
Female
Gender
38
Visualization 22: Swarm Plot - Sales by Multiple Categories: Displaying sales values for different product categories based on customer type using a swarm plot. |# Visualization 22: Swarm Plot - Sales by Multiple Categories plt.figure(figsize=(12, 8)) sns.swarmplot(x='Product_Category , y='Total_Sales', hue='Customer_Type' , |data=df) plt.xlabel('Product Category') plt.ylabel('Total Sales') plt.title('Sales by Multiple Categories') plt.legend(title= Customer Type', bbox_to_anchor=(1, 1)) plt.show() Customer Type
39
•
Returning Customer
•
New Customer
Visualization 23: 3D Scatter Plot - Sales, Quantity, and Price Per Unit: Creating a 3D scatter plot to explore the relationships between total sales, quantity, and price per unit. |# Visualization 23: 3D Scatter Plot - Sales, Quantity, and Price Per Unit fig = plt.figure(figsize=(10, 8))
plt.show()
40
Visualization 24: Word Cloud - Most Frequent Product Categories: Generating a word cloud to visualize the most frequent product categories based on their occurrences. |# Visualization 24: Word Cloud - Most Frequent Product Categories from wordcloud import WordCloud wordcloud = WordCloud(width=800, height=400
|plt. show()
Word Cloud - Most Frequent Product Categories
Beauty H Books Electronics
41
Visualization 25: Donut Chart - Sales by Age Group: Illustrating the percentage distribution of sales among different age groups using a donut chart. # Visualization 25: Donut Chart - Sales by Age Group|
plt.pie(age_group_sales, labels=age_group_sales.index, autopct='%1.1f%%' , |wedgeprops=dict(width=0.3)) plt.title('Sales by Age Group' plt.show()
42
Visualization 26: Bar Chart - Sales by Country This bar chart displays the total sales for the top 10 countries with the highest sales in the e-commerce dataset. It helps identify the countries that contribute the most to the company's overall sales. I# Visualization 26: Bar Chart - Sales by Country| |country sales = df.groupby('Age_Group')['Total_Sales'].sum().nlargest(10) plt.bar(country_sales.index, country_sales.values) plt.xticks(rotation=45) plt.xlabel('Age_Group') plt.ylabel('Total Sales') plt.title('Top 10 Sales by Age_Group') plt.show()
43
Visualization 27: Pie Chart - Sales Distribution by Gender The pie chart illustrates the distribution of total sales between male and female customers. It provides a visual representation of the proportion of sales attributed to each gender. # Visualization 27: Pie Chart - Sales Distribution by Genderl^^^M gender sales = df.groupby('Gender')[ Total Sales ].sum()^^^^^^B plt.pie(gender_sales, labels=gender_sales.index, autopct='%1.1f%%') plt.title('Sales Distribution by Gender') plt.show()
Sales Distribution by Gender Female
Male
44
Visualization 28: KDE Plot - Sales Distribution by Product Category The kernel density estimate (KDE) plot visualizes the distribution of total sales among different product categories. It shows the density of sales values, with each product category represented by a different color. # Visualization 28: KDE Plot - Sales Distribution by Product Category! plt.figure(figsize=(10 , 6))^^^^^^^^^^^^^^^^^^^^^^^^^^^^^B sns.kdeplot(data=df, x= Total_Sales , hue= Product_Category', fill=True) plt.xlabel('Total Sales 'J^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^B plt.ylabel(' Densityj^^^^^^^^^B^^^^^^^^^^^^^^^^^^^^^^M plt.title('Sales Distribution by Product Category^^^^^^^^^^^^^^B plt. showQ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^B
Total Sales
45
Visualization 29: Box Plot - Sales Amount by Age Group This box plot compares the distribution of total sales amounts for different age groups of customers. It helps identify if there are any significant differences in sales amounts across age groups. |# Visualization 29: Box Plot - Sales Amount by Age Group| plt.figure(figsize=(8, 6)) sns.boxplot(x= Age_Group , y='Total_Sales', data=df) e Group' )^^^^^^^^^^^^^^H plt.xlabel( plt.ylabel('Total Sales') plt.title('Sales Amount by Age Group') plt.show()
46
Visualization 30: Stacked Bar Chart - Sales by Product Category and Payment Method This stacked bar chart illustrates the sales breakdown by product category and payment method. Each bar represents the total sales for a specific product category, and the segments within the bar represent the distribution of sales based on different payment methods # Visualization 30: Stacked Bar Chart - Sales by Product Category and Payment) Method^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^B^^^^^^^^^H
payment_method_category_sales = df.pivot_table(index= 'Payment_Method columns= Product_Category , values='Total_Sales', aggfunc='sum')^^^B^^^B payment method_category_sales.plot(kind='bar', stacked=True, figsize=(10, 6))| |plt.xlabel('Payment Method') |plt.ylabel('Total Sales') plt.title('Sales by Product Category and Payment Method')^ plt.legend(title= Product Category , bbox_to_anchor=(i , 1)) plt.show()
47
Visualization 31: Line Plot - Sales by Day of the Week This line plot shows the total sales trend based on the day of the week. It helps analyze whether certain days of the week have higher or lower sales. # Visualization 31: Line Plot - Sales by Day of the Week^^^^^^J df[ Day of Week ] = df[ Purchase Date'].dt.day name()^^^^^^^B
day of week_sales = df.groupby('Day_of_Week')[ Total_Sales ].sum() day_of_week_sales.plot(kind= line', marker='o', figsize=(10, 6))
plt.xticks(rotation=45) plt. showQ^^^^^^^^B
48
Visualization 32: Box Plot - Sales Amount by Customer Age Group The box plot compares the distribution of total sales amounts for different customer age groups. It provides insights into potential variations in sales across age segments. |# Visualization 32: Box Plot - Sales Amount by Customer Age Group| plt.figure(figsize=(10, 6))^^^^^^^^^^^^^^^M sns.boxplot(x= Age_Group , y='Total_Sales', data=df) plt.xlabel('Customer Age Group') plt.ylabel('Total Sales') plt.title('Sales Amount by Customer Age Group') plt.show()
49
Visualization 33: Donut Chart - Sales Distribution by Customer Age Group The donut chart displays the percentage distribution of total sales among different customer age groups. It helps visualize the relative contribution of each age group to overall sales.
plt.show()
Sales Distribution by Customer Age Group 31-45
18-30
19C% j 16’*MI