我想分析一個資料集 他的data card 內容如下
About Dataset Context Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail". Content "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." Acknowledgements Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK. Image from stocksnap.io. Inspiration Analyses for this dataset could include time series, clustering, classification and more.
資料集如附件,我要怎樣處理跟分析這個資料
----chatbot 回答
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Read the CSV file into a DataFrame
df = pd.read_csv('data.csv')
# Display the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
# Print the column names and their data types
print(df.info())
Traceback (most recent call last):
File "<string>", line 7, in <module>
File "_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "readers.py", line 611, in _read
return parser.read(nrows)
File "readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "c_parser_wrapper.py", line 230, in read
chunks = self._reader.read_low_memory(nrows)
File "parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 79780: invalid start byte
It seems that the file has some encoding issue. We will try to read the file with 'latin1' encoding and then proceed with the standard exploration steps.
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Read the CSV file into a DataFrame, specifying the 'latin1' encoding
df = pd.read_csv('data.csv', encoding='latin1')
# Display the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))
# Print the column names and their data types
print(df.info())
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
|:------------|:------------|:------------------------------------|:-----------|:---------------|:------------|:-------------|:---------------|
| 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850 | United Kingdom |
| 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850 | United Kingdom |
| 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850 | United Kingdom |
| 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850 | United Kingdom |
| 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850 | United Kingdom |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 541909 non-null object
1 StockCode 541909 non-null object
2 Description 540455 non-null object
3 Quantity 541909 non-null int64
4 InvoiceDate 541909 non-null object
5 UnitPrice 541909 non-null float64
6 CustomerID 406829 non-null float64
7 Country 541909 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
None
The dataset contains information about transactions from an online retail store. The columns include InvoiceNo
, StockCode
, Description
, Quantity
, InvoiceDate
, UnitPrice
, CustomerID
, and Country
. We will further explore the data, preprocess it, and conduct some analysis to gain insights into the sales patterns, customer behavior, and popular products.
We will start by preprocessing the data.
# Convert `InvoiceDate` to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%m/%d/%Y %H:%M')
# Check for missing values
print("\nMissing values count:\n")
print(df.isnull().sum().to_markdown(numalign="left", stralign="left"))
# Drop rows with missing `CustomerID`
df.dropna(subset=['CustomerID'], inplace=True)
# Calculate `TotalAmount`
df['TotalAmount'] = df['Quantity'] * df['UnitPrice']
# Filter out rows with `Quantity` <= 0
df = df[df['Quantity'] > 0]
# Display summary statistics of numerical columns
print("\nSummary statistics of numerical columns:\n")
print(df[['Quantity', 'UnitPrice', 'TotalAmount']].describe().to_markdown(numalign="left", stralign="left"))
Missing values count:
| | 0 |
|:------------|:-------|
| InvoiceNo | 0 |
| StockCode | 0 |
| Description | 1454 |
| Quantity | 0 |
| InvoiceDate | 0 |
| UnitPrice | 0 |
| CustomerID | 135080 |
| Country | 0 |
Summary statistics of numerical columns:
| | Quantity | UnitPrice | TotalAmount |
|:------|:-----------|:------------|:--------------|
| count | 397924 | 397924 | 397924 |
| mean | 13.0218 | 3.11617 | 22.3947 |
| std | 180.42 | 22.0968 | 309.056 |
| min | 1 | 0 | 0 |
| 25% | 2 | 1.25 | 4.68 |
| 50% | 6 | 1.95 | 11.8 |
| 75% | 12 | 3.75 | 19.8 |
| max | 80995 | 8142.75 | 168470 |
The data appears to be clean now with no missing values in the CustomerID
column and no negative quantities. We will now proceed with some exploratory data analysis to understand the sales patterns, customer behavior, and popular products.
We will start by analyzing the sales trend over time.
import altair as alt
# Extract month from `InvoiceDate`
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')
# Group by `InvoiceMonth` and sum `TotalAmount`
monthly_sales = df.groupby('InvoiceMonth')['TotalAmount'].sum().reset_index()
# Create a line chart to visualize the trend of `TotalAmount` over time
chart = alt.Chart(monthly_sales).mark_line(point=True).encode(
x='InvoiceMonth:T',
y=alt.Y('TotalAmount:Q', title='Total Sales'),
tooltip = ['InvoiceMonth', 'TotalAmount']
).properties(
title='Monthly Sales Trend'
).interactive()
# Save the chart in json file
chart.save('monthly_sales_trend.json')
/usr/bin/entry/images/py_interpreter.runfiles/rules_python~0.31.0~pip~ice_deps_310_altair/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
Traceback (most recent call last):
File "<string>", line 19, in <module>
File "library_overrides.py", line 49, in custom_altair_chart_save
ORIGINAL_ALTAIR_CHART_SAVE(*args, **kwargs)
File "api.py", line 488, in save
result = save(**kwds)
File "save.py", line 83, in save
spec = chart.to_dict()
File "api.py", line 2020, in to_dict
return super().to_dict(*args, **kwargs)
File "api.py", line 374, in to_dict
copy.data = _prepare_data(original_data, context)
File "api.py", line 89, in _prepare_data
data = _pipe(data, data_transformers.get())
File "functoolz.py", line 628, in pipe
data = func(data)
File "functoolz.py", line 304, in __call__
return self._partial(*args, **kwargs)
File "data.py", line 19, in default_data_transformer
return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
File "functoolz.py", line 628, in pipe
data = func(data)
File "functoolz.py", line 304, in __call__
return self._partial(*args, **kwargs)
File "data.py", line 149, in to_values
data = sanitize_dataframe(data)
File "core.py", line 372, in sanitize_dataframe
elif np.issubdtype(dtype, np.integer):
File "numerictypes.py", line 417, in issubdtype
arg1 = dtype(arg1).type
TypeError: Cannot interpret 'period[M]' as a data type
It seems that the InvoiceMonth
column, which is of type period[M]
, is not directly supported by Altair for plotting. We need to convert it to a string format before proceeding with the visualization.
# Convert `InvoiceMonth` to string
monthly_sales['InvoiceMonth'] = monthly_sales['InvoiceMonth'].astype(str)
# Create a line chart to visualize the trend of `TotalAmount` over time
chart = alt.Chart(monthly_sales).mark_line(point=True).encode(
x=alt.X('InvoiceMonth:O', title='Month'), # Use 'O' for ordinal scale for string months
y=alt.Y('TotalAmount:Q', title='Total Sales'),
tooltip = ['InvoiceMonth', 'TotalAmount']
).properties(
title='Monthly Sales Trend'
).interactive()
# Save the chart in json file
chart.save('monthly_sales_trend.json')
[json-tag: code-generated-json-19-0-1724597802251809603]
The line chart shows a general upward trend in sales over time, with a significant spike in November and December 2011. Next, we will analyze the top-selling products and customer behavior.
# Group by `Description` and sum `Quantity`
product_sales = df.groupby('Description')['Quantity'].sum().reset_index()
# Sort in descending order and select top 10
top_10_products = product_sales.sort_values(by='Quantity', ascending=False).head(10)
# Display top 10 products
print("\nTop 10 Products by Quantity Sold:\n")
print(top_10_products.to_markdown(index=False, numalign="left", stralign="left"))
# Group by `CustomerID` and count unique `InvoiceNo`
orders_per_customer = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
orders_per_customer.columns = ['CustomerID', 'NumberOrders']
# Calculate average order value per customer
avg_order_value_per_customer = df.groupby('CustomerID')['TotalAmount'].mean().reset_index()
avg_order_value_per_customer.columns = ['CustomerID', 'AvgOrderValue']
# Display summary statistics
print("\nSummary statistics of number of orders and average order value per customer:\n")
print(orders_per_customer['NumberOrders'].describe().to_markdown(numalign="left", stralign="left"))
print(avg_order_value_per_customer['AvgOrderValue'].describe().to_markdown(numalign="left", stralign="left"))
Top 10 Products by Quantity Sold:
| Description | Quantity |
|:-----------------------------------|:-----------|
| PAPER CRAFT , LITTLE BIRDIE | 80995 |
| MEDIUM CERAMIC TOP STORAGE JAR | 77916 |
| WORLD WAR 2 GLIDERS ASSTD DESIGNS | 54415 |
| JUMBO BAG RED RETROSPOT | 46181 |
| WHITE HANGING HEART T-LIGHT HOLDER | 36725 |
| ASSORTED COLOUR BIRD ORNAMENT | 35362 |
| PACK OF 72 RETROSPOT CAKE CASES | 33693 |
| POPCORN HOLDER | 30931 |
| RABBIT NIGHT LIGHT | 27202 |
| MINI PAINT SET VINTAGE | 26076 |
Summary statistics of number of orders and average order value per customer:
| | NumberOrders |
|:------|:---------------|
| count | 4339 |
| mean | 4.27195 |
| std | 7.70549 |
| min | 1 |
| 25% | 1 |
| 50% | 2 |
| 75% | 5 |
| max | 210 |
| | AvgOrderValue |
|:------|:----------------|
| count | 4339 |
| mean | 68.3037 |
| std | 1467.75 |
| min | 0 |
| 25% | 12.3531 |
| 50% | 17.7169 |
| 75% | 24.8484 |
| max | 77183.6 |
The analysis reveals that 'PAPER CRAFT , LITTLE BIRDIE' and 'MEDIUM CERAMIC TOP STORAGE JAR' are the top-selling products. The customer analysis shows a wide range in the number of orders and average order value per customer, suggesting varying customer behavior.
Next, we will delve deeper into customer segmentation based on their purchase behavior (RFM analysis) and visualize the distribution of customers across different countries.
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Calculate Recency
max_date = df['InvoiceDate'].max()
df['Recency'] = (max_date - df['InvoiceDate']).dt.days
# Calculate Frequency and Monetary Value
rfm_data = df.groupby('CustomerID').agg({
'InvoiceNo': 'nunique',
'TotalAmount': 'sum',
'Recency': 'min'
})
rfm_data.columns = ['Frequency', 'MonetaryValue', 'Recency']
# Scale the RFM data
scaler = StandardScaler()
rfm_data_scaled = scaler.fit_transform(rfm_data)
# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(rfm_data_scaled)
rfm_data['Cluster'] = kmeans.labels_
# Merge cluster labels back to original dataframe
df_with_cluster = df.merge(rfm_data[['Cluster']], on='CustomerID')
# Display average RFM values for each cluster
print("\nAverage RFM values for each cluster:\n")
print(rfm_data.groupby('Cluster').mean().to_markdown(numalign="left", stralign="left"))
# Count the number of customers in each cluster
print("\nNumber of customers in each cluster:\n")
print(rfm_data['Cluster'].value_counts().reset_index().rename(columns={'index':'Cluster', 'Cluster':'Count'}).to_markdown(index=False, numalign="left", stralign="left"))
# Group by `Country` and count unique `CustomerID`
customers_per_country = df.groupby('Country')['CustomerID'].nunique().reset_index()
customers_per_country.columns = ['Country', 'NumCustomers']
# Sort in descending order and select top 10
top_10_countries = customers_per_country.sort_values(by='NumCustomers', ascending=False).head(10)
# Display top 10 countries
print("\nTop 10 Countries by Customer Count:\n")
print(top_10_countries.to_markdown(index=False, numalign="left", stralign="left"))
# Create a bar chart to visualize the distribution of customers across the top 10 countries
import altair as alt
chart = alt.Chart(top_10_countries).mark_bar().encode(
x=alt.X('Country:N', sort='-y', axis=alt.Axis(labelAngle=-45)),
y=alt.Y('NumCustomers:Q', title='Number of Customers'),
tooltip = ['Country', 'NumCustomers']
).properties(
title='Top 10 Countries by Customer Count'
).interactive()
chart.save('top_10_countries_by_customer_count.json')
Average RFM values for each cluster:
| Cluster | Frequency | MonetaryValue | Recency |
|:----------|:------------|:----------------|:----------|
| 0 | 4.67193 | 1855.37 | 40.4457 |
| 1 | 1.58226 | 631.424 | 246.106 |
| 2 | 66.5 | 85904.4 | 5.03846 |
Number of customers in each cluster:
| Cluster | Count |
|:----------|:--------|
| 0 | 3231 |
| 1 | 1082 |
| 2 | 26 |
Top 10 Countries by Customer Count:
| Country | NumCustomers |
|:---------------|:---------------|
| United Kingdom | 3921 |
| Germany | 94 |
| France | 87 |
| Spain | 30 |
| Belgium | 25 |
| Switzerland | 21 |
| Portugal | 19 |
| Italy | 14 |
| Finland | 12 |
| Austria | 11 |
[json-tag: code-generated-json-28-0-1724597840775815102]
The provided dataset contains transactional data from an online retail store between December 1st, 2010 and December 9th, 2011. The dataset includes information about the products purchased, the quantity, price, customer ID, and country of origin.
Data Preprocessing and Exploration
- The
InvoiceDate
column was converted to datetime format for subsequent time-based analysis. - Missing values were handled by dropping rows with null
CustomerID
, as these were deemed crucial for customer-centric analysis. - A new column
TotalAmount
was created to capture the total value of each transaction. - Transactions with negative quantities, potentially indicating returns or cancellations, were excluded to ensure the analysis focused on actual sales.
- Summary statistics for key numerical columns (
Quantity
,UnitPrice
,TotalAmount
) were computed to understand their distributions and identify any outliers.
Sales Trend Analysis
- A line chart visualized the monthly sales trend, revealing a general upward trajectory with a notable peak in November and December 2011, likely attributed to holiday season shopping.
Product and Customer Analysis
- The top 10 selling products were identified, with 'PAPER CRAFT , LITTLE BIRDIE' and 'MEDIUM CERAMIC TOP STORAGE JAR' leading the list.
- Customer segmentation using RFM analysis (Recency, Frequency, Monetary) and K-means clustering categorized customers into three distinct groups, each exhibiting different purchasing behaviors.
- The distribution of customers across countries was examined, highlighting the dominance of the United Kingdom, followed by Germany and France. A bar chart visualized this distribution for the top 10 countries.
Key Insights
- The dataset provides valuable insights into sales patterns, customer behavior, and popular products.
- The sales trend analysis indicates a strong seasonal pattern, with sales peaking during the holiday season.
- The identification of top-selling products can inform inventory management and marketing strategies.
- Customer segmentation using RFM analysis enables targeted marketing efforts tailored to different customer groups.
- The geographical distribution of customers highlights key markets and potential areas for expansion.
Further Analysis
- More in-depth analysis could be conducted on customer segments to understand their specific needs and preferences.
- Time series forecasting techniques could be applied to predict future sales trends.
- Market basket analysis could be performed to identify product associations and cross-selling opportunities.
- Customer churn analysis could be conducted to understand why customers stop purchasing and develop strategies to retain them.
Overall, this analysis provides a solid foundation for understanding the online retail business represented in the dataset. The insights gained can be leveraged to make informed decisions regarding marketing, inventory management, and customer relationship management, ultimately driving business growth and profitability.