Posts by Pavlo (62)

Exploring Seaborn: A Comprehensive Guide to Statistical Data Visualization

Recently, I dove into the Seaborn library, an invaluable tool for Exploratory Data Analysis, designed to create visually appealing and informative statistical graphics in Python. Here are the main insights and skills I've gained: 1. **Data Structures for Seaborn**: - Learned to convert data from Pandas to Seaborn's preferred format. Seaborn excels with "long-form" data, which facilitates using attributes like hue, col, and row for more complex visualizations. - Demonstrated how to tidy data and transform "wide-form" data into "long-form" using Pandas' `melt()` function. 2. **Basic to Advanced Plotting**: - Explored both "axes-level" and "figure-level" plots, understanding their unique applications. For example, axes-level plots are useful for standalone figures, while figure-level plots can showcase relationships conditioned on additional variables. - Practiced creating various plots such as `jointplot()`, `pairplot()`, `heatmap()`, and more, which allowed me to visualize different data relationships and structures effectively. 3. **Relational Plots**: - Delved into Relational plots, focusing on the relationship between two numerical variables. These include both figure-level method (`relplot()`) and axis-level methods (`scatterplot()`, `lineplot()`), which provide detailed visualizations to explore and present data interactions clearly. 4. **Categorization and Distribution**: - Investigated categorical and distribution plots, which help in analyzing relationships between numerical and categorical variables, and in observing the distribution of data across various dimensions. 5. **Regression and Statistical Analysis**: - Utilized Seaborn's regression plots for exploratory analysis to estimate relationships between variables. Enhanced understanding of statistical measures such as error bars, using regression models from Seaborn integrated with statsmodels for detailed statistical insights. 6. **Customizing Aesthetics**: - Enhanced visual presentations by customizing Seaborn plots with different themes, styles, and color palettes, tailoring plots to be both informative and visually pleasing. **Conclusions**: Seaborn has proven to be a versatile and powerful library for statistical visualization, providing an array of options for analyzing and presenting data effectively. The ability to easily switch between different plot types and integrate statistical analysis makes Seaborn an essential tool for any data scientist. This journey through Seaborn's capabilities, especially the relational plots, has greatly enhanced my data visualization skills, making data analysis an even more insightful process.

Exploratory and Descriptive Wine Reviews Analysis using pandas and seaborn

This project is based on "The Complete Pandas Bootcamp 2023 - Data Science with Python", a course offered by Udemy and taught by Alexander Hagmann. The dataset for this project comes from Kaggle - Wine Reviews https://www.kaggle.com/datasets/zynicide/wine-reviews. The dataset was scraped from https://www.wineenthusiast.com/">Wineenthusiast.com. This project focuses on Exploratory Data Analysis (EDA), descriptive analysis, and basic inferential analysis. **The primary objectives of this project:** o Data Inspection Import the Datasets winemag-data_first150k.csv and Inspect! winemag-data_first150k.csv contains 10 columns and 150k rows of wine reviews. One consideration is that this data is from one month in the summer (June 2017) and was collected from a website based in the United States. The demographics of those people making the Wine Reviews are undoubtedly American and might be prone to drink more American wines due to the domestic costs. o Data Cleaning • 1. Renaming all column names to title case and setting the index • 2. Dropping duplicated rows • 3. Handling missing values in the dataset - 3.1 Fill in the missing values in the 'Country' column - 3.2 Fill in the missing values in the 'Province' column - 3.3 Handling Missing Values in Prices - 3.3.1 Analyze the distribution of 'Price' - 3.3.2 Can we simply drop rows with missing Price values? - 3.3.3 Remove Tunisia and Egypt countries because the number of reviews for them are too small and there are no available prices. - 3.3.4 Deciding on the approach for filling missing prices: Group Median Imputation vs. KNN Imputation vs Hybrid approach. - 3.3.5 **Hybrid Approach** to fill out missing price values. Combines the Group Median Method for datasets with low variability and the KNN Imputation Method for those with high variability, resulting in the reviews_hybrid dataset. - 3.3.5.1 Option One - Spliting datasets into two. One dataset use for group media approach and another dataset use for KNN approach. Then merge two datasets. - 3.3.5.2 Option Two (Best) - performing inplace Imputation for both group median and KNN approaches. - 3.3.6 **Group Median Imputation Approach** to fill out missing prices. Replace the missing values by groups(Country and Province) specific values. Use `.transform()` method. Created reviews_gm dataset. - 3.3.6.1 Analysis if for missing price values by group specific we should use mean price of median price? - 3.3.6.2 After we choose to use median price value as group specific value for missing price lets fill out the missing price values in the dataset. - 3.3.7 **K-nearest neighbors(KNN) Imputation Approach** to fill out missing prices. Created reviews_knn dataset. - 3.3.8 Compare Hybrib Imputation vs K-nearest neighbors(KNN) Imputation Approach vs Group Median Approach to fill out missing prices. For further analysis, we have chosen to use the Hybrid Approach (Section 3.3.5), and will therefore utilize the `reviews_hybrid` dataset. • 4. Handle outliers in the reviews_hybrid dataset. - 4.1 Analyze the distribution of 'Price' - 4.2 Handle the Outliers type 2: Values are correct but extreme to our other data points We have several options for handling these outliers, depending on our specific analysis goals: - 4.2.1 **Keep the outliers**: If the high-priced wines are of particular interest or if we're analyzing market segments that include premium wines, we might choose to retain these data points. - 4.2.2 **Exclude the outliers**: For analyses where extreme values might skew the results, such as when calculating average prices, we might consider excluding these outliers. - 4.2.3 **Cap and Floor the values**: We can cap and floor prices at a certain thresholds to lessen the impact of extremely high/low prices. But the data will not be excluded only the price values will be overwritten. - 4.2.4 **Discretization and Binning of 'Points' into 'Rating'**: I will use this for my further analysis. - 4.2.5 **Discretization and Binning of 'Price' into 'Price_cat'**: I will use this for my further analysis. - 4.2.5.1 Approach Discretizing Price based on Equal width bins. - 4.2.5.2 Approach Discretizing Price based on putting the same number of reviews into different brackets. - 4.2.5.3 Approach Discretizing Price based on defining customized quanlites. - 4.2.5.4 Approach Discretizing Price based on defining customized quanlites and considering outliers. - 4.2.5.5 Approach K-Means Clustering for Price Binning. For our further analysis, we choose to use the Discretization and Binning approach according to 4.2.4 and 4.2.5.5 • 5. Convert Rating and Price_Category to categorical data and set the order in reviews_hybrid dataset. o Pattern Discovery Part I - Data Aggregation • 1. Which country is dominant in wine industry production? • 2. What is the most common variety reviewed in each country? • 3. What are the most expensive and the cheapest wines? • 4. What variety of wine was reviewed most often and how many unique varieties? • 5. What is the most popular Variety of wine by country? • 6. Plotting Points and Price grouped by Variety. • 7. What is the perfect score? • 8. Identifies the top 10 provinces based on the number of wine reviews, finds the most reviewed wine variety within each of those provinces. • 9. Heatmap of the top 20 wine-producing countries based on the frequency of wines falling into different Rating categories. • 10. Heatmap of the top 20 wine-producing countries based on the frequency of wines falling into different Price categories. o Pattern Discovery Part II - Exploratory and Descriptive Analysis • 1. Does Country has significant effect on Price? Use ANOVA test. • 2. Is higher Point(Rating) associated with higher Price or vice versa? Use Pearson correlation coefficient and Linear Regression analysis to measure the linear relationship. • 3. Descriptive Price Category analysis. • 4. Does Price Categories have a significant effect on Price? Use ANOVA test. • 5. Compare Price and Point Distributions. Transformed Price and Points columns to a standard normal distribution. Use Z-score analysis. • 6. How each wine's quality rating compares to the average rating of wines from the same country. Use Z-score analysis. • 7. How each wine's price compares to the average price of wines from the same country. Use Z-score analysis. • 8. Hypothesis: wines from France have higher average ratings than wines from Italy. Perform independent sample t-test. During the data cleaning process, we developed several different methods for handling missing values and outliers. We chose only one of these methods, but for future projects, it would be worthwhile to explore other approaches as well. This will help us understand various strategies in exploratory, descriptive, and inferential analysis.

Data Aggrigation Project using pandas: Clean Summer dataset according to Sport Experts and Aggregate the results

### Project Overview **Course:** "The Complete Pandas Bootcamp 2023 - Data Science with Python" **Platform:** Udemy **Instructor:** Alexander Hagmann **Project Title:** Summer Olympic Games Medal Tables Aggregation ### Introduction This project tackles a data aggregation challenge commonly encountered in job applications and assessment centers within the Data Science field. The primary task involves manipulating and interpreting a vast dataset to generate the Medal Tables for the Summer Olympic Games spanning from 1896 to 2012. ### Project Goals and Objectives Upon joining a Data Science advisory firm, your first assignment is to recreate the official Medal Tables for all editions of the Summer Olympic Games. This entails utilizing datasets like `summer.csv`, which includes over 31,000 medal entries, and aligning your results with the official Medal Tables from the 1996 and 1976 Olympics, extracted from Wikipedia (`wik_1996.csv`, `wik_1976.csv`). **Challenge:** Aim to minimize the total absolute divergence between your aggregated Medal Tables and the official ones, with the goal of achieving an optimal score of 0. For example, if the official Gold Medal count for the United States in 1996 is 44, and your calculation gives 46, this results in an absolute divergence of 2. ### Key Insights - **Team and Singles Events:** In team events, a medal won by any team counts as a single medal irrespective of the number of team members. In singles events, each awarded medal counts individually, even when medals are shared. - **Event Categories:** Medals are differentiated into Men's, Women's, and Mixed Events. Specific criteria determine Mixed Events, including all "Equestrian" and "Sailing" events before 1988, as well as certain medals in Badminton mixed doubles. ### Valuable Perspectives Incorporating insights from sports experts is crucial, particularly in understanding how medals are structured and distributed across different event types. This requires deep data analysis and strategic thinking about data structures. ### Conclusion This project tests both coding proficiency and the ability to integrate expert knowledge and data interpretation to solve complex challenges in Data Science. The emphasis on "Thinking in Data Structures" is a key skill for any aspiring data scientist.

Exploratory Data Anasysis Project using pandas: Summer Olympics + Winter Olympics + Pupulation + GDP Olympics

This project is based on "The Complete Pandas Bootcamp 2023 - Data Science with Python," a course offered by Udemy and taught by Alexander Hagmann. This project is focusing on Exploratory Data analysis (EDA). EDA is a crucial step in the data analysis process that involves summarizing, visualizing, and understanding the main characteristics and patterns within a dataset. The primary objectives of EDA in this project: o Data Inspection: Import the Datasets Summer (summer.csv), Winter (winter.csv) and dictionary (dictionary.csv) and Inspect! o Merge and Concatenate: 1. Merge Summer and Winter (one row for each Medal awarded in any Olympic Games) and save the merged DataFrame in olympics. 2. An additional column (e.g. "Edition") shall indicate the Edition -> Summer or Winter. 3. Add the full Country name from the dictionary to olympics (e.g. France for FRA). o Data Cleaning: 1. Remove Spaces from column headers in dictionary. 2. For some Country Codes, there is no corresponding full Country Name available (e.g. for "URS") -> missing values in olympics. Identify these Country Codes and search the Web for the full Country Names. Replace missing values in Country column! 3. Remove rows from olympics where the Country code is unknown. (Make sure you reset the Index -> RangeIndex) 4. Convert the column Medal into an ordered Categorical column ("Bronze" < "Silver" < "Gold") o Exploratory Data Analysis: • Do GDP, Population, and Politics matter?: 1. Create the following aggregated and merged DataFrame with the Top 50 Countries. The Column Total_Games shows the number of Participants (as an approximation: determine the number of Editions where Countries have won at least one medal). 2. Convert the absolute values in the DataFrame into ranks and save the ranks DataFrame in new variable. • Statistical Analysis and Hypothesis Testing with scipy: In the following work with Ranks! Check whether GDP (Standard of Living), Total_Games (Political Stability measure), and Population (Size) have an effect on Total Medals. Work with Spearman correlation, not with Pearson correlation. In this part, we are going to test whether the factors of population, GDP per capita and the number of participants influence and determine a country's success in the Olympic Games with statistical significance. • Medals Heatmap by Gender and Edition: Create the following Seaborn Heatmap with Medal Ranks for Top 50 Countries (Total Medals, Summer Games Medals, Winter Games Medals, Men, Women). • Summer Games vs. Winter Games - does Geographical Location matter?: Identify Countries that are equally successful in Summer and Winter Games, more successful in Summer Games, more successful in Winter Games. What could be the reasons? 1. First, let's compare summer athletes to winter athletes in the same country, who got more medals. 2. Second, let's compare athletes in summer games in one country to athletes in summer games in another country. • Men vs. Women - does Culture & Religion matter? : Identify Countries where Men and Women are equally successful. Men are more successful. Women are more successful. What could be the reasons? 1. First, compare men to women in the same country and who got more medals. 2. Second, we can compare the men in one country to the men in another country. Then, we can compare the women in one country to the women in another country. • Do Traditions matter?: Create the following Seaborn Heatmap that shows the Ranks of Top 50 Countries by Sports. Identify traditional Sports / National Sports for e.g. UK and China!

Certificate of Completion - 2023 Python Data Analysis & Visualization Masterclass

I earned a Certificate of Completion, which verifies that I successfully completed the '2023 Python Data Analysis & Visualization Masterclass' beginner's level course on 20/10/2023. The course was instructed by Colt Steele on Udemy. Colt Steele is both a developer and a teacher. The certificate confirms that the entire course was completed, as validated by the student. The course duration is equivalent to the total video hours at the time of my most recent completion, which is 20.50 hours. In this course, I learned about various aspects of data analysis and visualization using pandas, matplotlib, and seaborn. Here's a summary of the topics covered: o Pandas: • DataFrames & Series • Analyze dozens of real-world datasets • Parse various csv files types during import • Perform basic statistics • Hierarchial Columns - Groupby and Aggrigation. Hierarchial Indexing - Multi Indexing • Working With Text • Apply, Map and Applymap • Combining, Merging Series & DataFrames • Pull Out Specific Column(s) • Pull Out Specific Row(s) Based on Row Index Label. • Pull Out Desired Row(s) and Column(s). Rows Pulled Out Based on Row Index Label • Pull Out Specific Row(s) Based on Row Position • Pull Out Desired Row(s) and Column(s). Rows Pulled Out Based on Row Position • Deal with NaN and None and <NA> • Drop/Show/Fill Rows or Column Where Volues are NaN • Convert dtype to different type. Before converting get rid of NaN values in the column • Including NaN values in DataFrame and deciding if to ignore it or not during calculations • Working with Dates • Drop Column(s) or Row(s) • Creating New Columns/Rows • Save the changed dataframe to csv file • Value Count the Number of Identical Values in a Column • Count the Unique Values in a Column • Sort Values/Index • Sort in Ascending or Descending Order using nlargest() or nsmallest() methods • Between/ isin methods • Pull Out Specific Row Values from the Singel Column • Pull Out Specific Row Values from Multiple Columns using AND&, OR| and Negate~ methods • Renaming Columns and Index Labels • Replace Values in the Single Column • Accessing a Group of Rows and Columns by Index Label(s) or by Boolean Array • Pandas Plotting • Set option on display row globally. Set the Index Column to a different column after import of data. Set the Index Column/ the DateTime Column/ the Memory efficiency during the read of the file. o Matplotlib: • First plotting option - three plots in one axis on one figure • Second plotting option - three plots each on its separate axis and all axis grouped on one figure • Third plotting option - three plots and each one on its separate figure • OOP Appproach using plt.subplots() • Functional Approach using plt.subplot() • Figure size and dpi, Saving/Exporting Figures With savefig(),Set the styling, Customize the Line Styles, Colors, Widths, Markers, Changing X & Y Ticks(values) and Changing Their Labels, Zoom in / narrow down the plot, Adding Legends to Plots, customize location(loc), fontsize, labelcolor, facecolor, shadow, frameon • Bar Plot, Histogram, Scatter Plots, Pie Chart o Seaborn: • Seaborn Relational Plots - Relplot: Scatterplot and Lineplot • Seaborn Distributions Plots - Displot Plots: Histogram, Kdeplot, Cdeplot and Rugplot • Seaborn Categorical Plots - Catplot Plots: countplot, stripplot, swarmplot, boxplot, violinplot, pointplot, and barplot • Seaborn Controlling Aesthethics What I liked also about this course was that each section of it had a number of exercises and challenges. You can find this course at https://www.udemy.com/course/python-data-analysis-visualization/

1 2 ... 13

Our Sidebar

You can put any information here you'd like.

  • Latest Posts
  • Announcements
  • Calendars
  • etc