我想分析一個資料集他的data card 內容如下

About Dataset Context Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail". Content "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers." Acknowledgements Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK. Image from stocksnap.io. Inspiration Analyses for this dataset could include time series, clustering, classification and more.

資料集如附件，我要怎樣處理跟分析這個資料

----chatbot 回答

import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Read the CSV file into a DataFrame
df = pd.read_csv('data.csv')

# Display the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

# Print the column names and their data types
print(df.info())

Traceback (most recent call last):
  File "<string>", line 7, in <module>
  File "_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "readers.py", line 611, in _read
    return parser.read(nrows)
  File "readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 79780: invalid start byte

It seems that the file has some encoding issue. We will try to read the file with 'latin1' encoding and then proceed with the standard exploration steps.

import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Read the CSV file into a DataFrame, specifying the 'latin1' encoding
df = pd.read_csv('data.csv', encoding='latin1')

# Display the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

# Print the column names and their data types
print(df.info())

| InvoiceNo   | StockCode   | Description                         | Quantity   | InvoiceDate    | UnitPrice   | CustomerID   | Country        |
|:------------|:------------|:------------------------------------|:-----------|:---------------|:------------|:-------------|:---------------|
| 536365      | 85123A      | WHITE HANGING HEART T-LIGHT HOLDER  | 6          | 12/1/2010 8:26 | 2.55        | 17850        | United Kingdom |
| 536365      | 71053       | WHITE METAL LANTERN                 | 6          | 12/1/2010 8:26 | 3.39        | 17850        | United Kingdom |
| 536365      | 84406B      | CREAM CUPID HEARTS COAT HANGER      | 8          | 12/1/2010 8:26 | 2.75        | 17850        | United Kingdom |
| 536365      | 84029G      | KNITTED UNION FLAG HOT WATER BOTTLE | 6          | 12/1/2010 8:26 | 3.39        | 17850        | United Kingdom |
| 536365      | 84029E      | RED WOOLLY HOTTIE WHITE HEART.      | 6          | 12/1/2010 8:26 | 3.39        | 17850        | United Kingdom |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
None

The dataset contains information about transactions from an online retail store. The columns include InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. We will further explore the data, preprocess it, and conduct some analysis to gain insights into the sales patterns, customer behavior, and popular products.

We will start by preprocessing the data.

# Convert `InvoiceDate` to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%m/%d/%Y %H:%M')

# Check for missing values
print("\nMissing values count:\n")
print(df.isnull().sum().to_markdown(numalign="left", stralign="left"))

# Drop rows with missing `CustomerID`
df.dropna(subset=['CustomerID'], inplace=True)

# Calculate `TotalAmount`
df['TotalAmount'] = df['Quantity'] * df['UnitPrice']

# Filter out rows with `Quantity` <= 0
df = df[df['Quantity'] > 0]

# Display summary statistics of numerical columns
print("\nSummary statistics of numerical columns:\n")
print(df[['Quantity', 'UnitPrice', 'TotalAmount']].describe().to_markdown(numalign="left", stralign="left"))


Missing values count:

|             | 0      |
|:------------|:-------|
| InvoiceNo   | 0      |
| StockCode   | 0      |
| Description | 1454   |
| Quantity    | 0      |
| InvoiceDate | 0      |
| UnitPrice   | 0      |
| CustomerID  | 135080 |
| Country     | 0      |

Summary statistics of numerical columns:

|       | Quantity   | UnitPrice   | TotalAmount   |
|:------|:-----------|:------------|:--------------|
| count | 397924     | 397924      | 397924        |
| mean  | 13.0218    | 3.11617     | 22.3947       |
| std   | 180.42     | 22.0968     | 309.056       |
| min   | 1          | 0           | 0             |
| 25%   | 2          | 1.25        | 4.68          |
| 50%   | 6          | 1.95        | 11.8          |
| 75%   | 12         | 3.75        | 19.8          |
| max   | 80995      | 8142.75     | 168470        |

The data appears to be clean now with no missing values in the CustomerID column and no negative quantities. We will now proceed with some exploratory data analysis to understand the sales patterns, customer behavior, and popular products.

We will start by analyzing the sales trend over time.

import altair as alt

# Extract month from `InvoiceDate`
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')

# Group by `InvoiceMonth` and sum `TotalAmount`
monthly_sales = df.groupby('InvoiceMonth')['TotalAmount'].sum().reset_index()

# Create a line chart to visualize the trend of `TotalAmount` over time
chart = alt.Chart(monthly_sales).mark_line(point=True).encode(
    x='InvoiceMonth:T',
    y=alt.Y('TotalAmount:Q', title='Total Sales'),
    tooltip = ['InvoiceMonth', 'TotalAmount']
).properties(
    title='Monthly Sales Trend'
).interactive()

# Save the chart in json file
chart.save('monthly_sales_trend.json')

/usr/bin/entry/images/py_interpreter.runfiles/rules_python~0.31.0~pip~ice_deps_310_altair/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
Traceback (most recent call last):
  File "<string>", line 19, in <module>
  File "library_overrides.py", line 49, in custom_altair_chart_save
    ORIGINAL_ALTAIR_CHART_SAVE(*args, **kwargs)
  File "api.py", line 488, in save
    result = save(**kwds)
  File "save.py", line 83, in save
    spec = chart.to_dict()
  File "api.py", line 2020, in to_dict
    return super().to_dict(*args, **kwargs)
  File "api.py", line 374, in to_dict
    copy.data = _prepare_data(original_data, context)
  File "api.py", line 89, in _prepare_data
    data = _pipe(data, data_transformers.get())
  File "functoolz.py", line 628, in pipe
    data = func(data)
  File "functoolz.py", line 304, in __call__
    return self._partial(*args, **kwargs)
  File "data.py", line 19, in default_data_transformer
    return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
  File "functoolz.py", line 628, in pipe
    data = func(data)
  File "functoolz.py", line 304, in __call__
    return self._partial(*args, **kwargs)
  File "data.py", line 149, in to_values
    data = sanitize_dataframe(data)
  File "core.py", line 372, in sanitize_dataframe
    elif np.issubdtype(dtype, np.integer):
  File "numerictypes.py", line 417, in issubdtype
    arg1 = dtype(arg1).type
TypeError: Cannot interpret 'period[M]' as a data type

It seems that the InvoiceMonth column, which is of type period[M], is not directly supported by Altair for plotting. We need to convert it to a string format before proceeding with the visualization.

# Convert `InvoiceMonth` to string
monthly_sales['InvoiceMonth'] = monthly_sales['InvoiceMonth'].astype(str)

# Create a line chart to visualize the trend of `TotalAmount` over time
chart = alt.Chart(monthly_sales).mark_line(point=True).encode(
    x=alt.X('InvoiceMonth:O', title='Month'),  # Use 'O' for ordinal scale for string months
    y=alt.Y('TotalAmount:Q', title='Total Sales'),
    tooltip = ['InvoiceMonth', 'TotalAmount']
).properties(
    title='Monthly Sales Trend'
).interactive()

# Save the chart in json file
chart.save('monthly_sales_trend.json')

[json-tag: code-generated-json-19-0-1724597802251809603]

The line chart shows a general upward trend in sales over time, with a significant spike in November and December 2011. Next, we will analyze the top-selling products and customer behavior.

# Group by `Description` and sum `Quantity`
product_sales = df.groupby('Description')['Quantity'].sum().reset_index()

# Sort in descending order and select top 10
top_10_products = product_sales.sort_values(by='Quantity', ascending=False).head(10)

# Display top 10 products
print("\nTop 10 Products by Quantity Sold:\n")
print(top_10_products.to_markdown(index=False, numalign="left", stralign="left"))

# Group by `CustomerID` and count unique `InvoiceNo`
orders_per_customer = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
orders_per_customer.columns = ['CustomerID', 'NumberOrders']

# Calculate average order value per customer
avg_order_value_per_customer = df.groupby('CustomerID')['TotalAmount'].mean().reset_index()
avg_order_value_per_customer.columns = ['CustomerID', 'AvgOrderValue']

# Display summary statistics
print("\nSummary statistics of number of orders and average order value per customer:\n")
print(orders_per_customer['NumberOrders'].describe().to_markdown(numalign="left", stralign="left"))
print(avg_order_value_per_customer['AvgOrderValue'].describe().to_markdown(numalign="left", stralign="left"))


Top 10 Products by Quantity Sold:

| Description                        | Quantity   |
|:-----------------------------------|:-----------|
| PAPER CRAFT , LITTLE BIRDIE        | 80995      |
| MEDIUM CERAMIC TOP STORAGE JAR     | 77916      |
| WORLD WAR 2 GLIDERS ASSTD DESIGNS  | 54415      |
| JUMBO BAG RED RETROSPOT            | 46181      |
| WHITE HANGING HEART T-LIGHT HOLDER | 36725      |
| ASSORTED COLOUR BIRD ORNAMENT      | 35362      |
| PACK OF 72 RETROSPOT CAKE CASES    | 33693      |
| POPCORN HOLDER                     | 30931      |
| RABBIT NIGHT LIGHT                 | 27202      |
| MINI PAINT SET VINTAGE             | 26076      |

Summary statistics of number of orders and average order value per customer:

|       | NumberOrders   |
|:------|:---------------|
| count | 4339           |
| mean  | 4.27195        |
| std   | 7.70549        |
| min   | 1              |
| 25%   | 1              |
| 50%   | 2              |
| 75%   | 5              |
| max   | 210            |
|       | AvgOrderValue   |
|:------|:----------------|
| count | 4339            |
| mean  | 68.3037         |
| std   | 1467.75         |
| min   | 0               |
| 25%   | 12.3531         |
| 50%   | 17.7169         |
| 75%   | 24.8484         |
| max   | 77183.6         |

The analysis reveals that 'PAPER CRAFT , LITTLE BIRDIE' and 'MEDIUM CERAMIC TOP STORAGE JAR' are the top-selling products. The customer analysis shows a wide range in the number of orders and average order value per customer, suggesting varying customer behavior.

Next, we will delve deeper into customer segmentation based on their purchase behavior (RFM analysis) and visualize the distribution of customers across different countries.

from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Calculate Recency
max_date = df['InvoiceDate'].max()
df['Recency'] = (max_date - df['InvoiceDate']).dt.days

# Calculate Frequency and Monetary Value
rfm_data = df.groupby('CustomerID').agg({
    'InvoiceNo': 'nunique',
    'TotalAmount': 'sum',
    'Recency': 'min'
})
rfm_data.columns = ['Frequency', 'MonetaryValue', 'Recency']

# Scale the RFM data
scaler = StandardScaler()
rfm_data_scaled = scaler.fit_transform(rfm_data)

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(rfm_data_scaled)
rfm_data['Cluster'] = kmeans.labels_

# Merge cluster labels back to original dataframe
df_with_cluster = df.merge(rfm_data[['Cluster']], on='CustomerID')

# Display average RFM values for each cluster
print("\nAverage RFM values for each cluster:\n")
print(rfm_data.groupby('Cluster').mean().to_markdown(numalign="left", stralign="left"))

# Count the number of customers in each cluster
print("\nNumber of customers in each cluster:\n")
print(rfm_data['Cluster'].value_counts().reset_index().rename(columns={'index':'Cluster', 'Cluster':'Count'}).to_markdown(index=False, numalign="left", stralign="left"))

# Group by `Country` and count unique `CustomerID`
customers_per_country = df.groupby('Country')['CustomerID'].nunique().reset_index()
customers_per_country.columns = ['Country', 'NumCustomers']

# Sort in descending order and select top 10
top_10_countries = customers_per_country.sort_values(by='NumCustomers', ascending=False).head(10)

# Display top 10 countries
print("\nTop 10 Countries by Customer Count:\n")
print(top_10_countries.to_markdown(index=False, numalign="left", stralign="left"))

# Create a bar chart to visualize the distribution of customers across the top 10 countries
import altair as alt

chart = alt.Chart(top_10_countries).mark_bar().encode(
    x=alt.X('Country:N', sort='-y', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('NumCustomers:Q', title='Number of Customers'),
    tooltip = ['Country', 'NumCustomers']
).properties(
    title='Top 10 Countries by Customer Count'
).interactive()

chart.save('top_10_countries_by_customer_count.json')


Average RFM values for each cluster:

| Cluster   | Frequency   | MonetaryValue   | Recency   |
|:----------|:------------|:----------------|:----------|
| 0         | 4.67193     | 1855.37         | 40.4457   |
| 1         | 1.58226     | 631.424         | 246.106   |
| 2         | 66.5        | 85904.4         | 5.03846   |

Number of customers in each cluster:

| Cluster   | Count   |
|:----------|:--------|
| 0         | 3231    |
| 1         | 1082    |
| 2         | 26      |

Top 10 Countries by Customer Count:

| Country        | NumCustomers   |
|:---------------|:---------------|
| United Kingdom | 3921           |
| Germany        | 94             |
| France         | 87             |
| Spain          | 30             |
| Belgium        | 25             |
| Switzerland    | 21             |
| Portugal       | 19             |
| Italy          | 14             |
| Finland        | 12             |
| Austria        | 11             |

[json-tag: code-generated-json-28-0-1724597840775815102]

The provided dataset contains transactional data from an online retail store between December 1st, 2010 and December 9th, 2011. The dataset includes information about the products purchased, the quantity, price, customer ID, and country of origin.

Data Preprocessing and Exploration

The InvoiceDate column was converted to datetime format for subsequent time-based analysis.
Missing values were handled by dropping rows with null CustomerID, as these were deemed crucial for customer-centric analysis.
A new column TotalAmount was created to capture the total value of each transaction.
Transactions with negative quantities, potentially indicating returns or cancellations, were excluded to ensure the analysis focused on actual sales.
Summary statistics for key numerical columns (Quantity, UnitPrice, TotalAmount) were computed to understand their distributions and identify any outliers.

Sales Trend Analysis

A line chart visualized the monthly sales trend, revealing a general upward trajectory with a notable peak in November and December 2011, likely attributed to holiday season shopping.

Product and Customer Analysis

The top 10 selling products were identified, with 'PAPER CRAFT , LITTLE BIRDIE' and 'MEDIUM CERAMIC TOP STORAGE JAR' leading the list.
Customer segmentation using RFM analysis (Recency, Frequency, Monetary) and K-means clustering categorized customers into three distinct groups, each exhibiting different purchasing behaviors.
The distribution of customers across countries was examined, highlighting the dominance of the United Kingdom, followed by Germany and France. A bar chart visualized this distribution for the top 10 countries.

Key Insights

The dataset provides valuable insights into sales patterns, customer behavior, and popular products.
The sales trend analysis indicates a strong seasonal pattern, with sales peaking during the holiday season.
The identification of top-selling products can inform inventory management and marketing strategies.
Customer segmentation using RFM analysis enables targeted marketing efforts tailored to different customer groups.
The geographical distribution of customers highlights key markets and potential areas for expansion.

Further Analysis

More in-depth analysis could be conducted on customer segments to understand their specific needs and preferences.
Time series forecasting techniques could be applied to predict future sales trends.
Market basket analysis could be performed to identify product associations and cross-selling opportunities.
Customer churn analysis could be conducted to understand why customers stop purchasing and develop strategies to retain them.

Overall, this analysis provides a solid foundation for understanding the online retail business represented in the dataset. The insights gained can be leveraged to make informed decisions regarding marketing, inventory management, and customer relationship management, ultimately driving business growth and profitability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.顧客分群 -E-Commerce Data.md

2.顧客分群 -E-Commerce Data.md

Files

2.顧客分群 -E-Commerce Data.md

Latest commit

History

2.顧客分群 -E-Commerce Data.md

File metadata and controls