Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Task 6】 Open source repository collaboration network and npm artifact library dependency network mapping dataset #84

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 58 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,58 @@
# OpenPerf

OpenPerf is a benchmarking suite tailored for the sustainable management of open-source projects. It assesses key metrics and standards vital for the successful development of open-source ecosystems.

## Features

- **Data Science Benchmarks**: Focus on analyzing and predicting behaviors that impact the sustainability of open-source projects, such as bot detection mechanisms.
- **Standard Benchmarks**: Includes a wide range of benchmarks that measure company, developer, and project impacts on open-source community health and growth.
- **Index Benchmarks**: Provides tools for evaluating and ranking different entities based on metrics critical to open-source sustainability, such as activity levels and influence.
- **Modular CLI**: A robust command-line interface that allows for straightforward interaction with all available benchmarks, facilitating ease of use and integration into other tools.
- **Extensible Framework**: Designed to be flexible and expandable, allowing researchers and developers to add new benchmarks and features as the field evolves.

## Installation

To get started with OpenPerf, clone the repository to your local machine:

```bash
git clone https://github.com/yourgithubusername/openperf.git
cd openperf
```

Install the required dependencies:

```bash
pip install -r requirements.txt
```

## Usage

OpenPerf is equipped with a CLI for easy execution of benchmarks. Here’s how you can run different types of benchmarks:

### Running Data Science Benchmarks
To run the bot detection benchmark, which helps understand automated interactions in project management:
```bash
openperf data_science bot_detection
```

### Running Standard Benchmarks
Evaluate the impact of companies, developers, and projects on open-source sustainability:
```bash
openperf standard company
openperf standard developer
openperf standard project
```

### Running Index Benchmarks
To assess activity and influence indices, crucial for understanding leadership and contributions in open-source projects:
```bash
openperf index activity
openperf index influence
```

### Extending OpenPerf
To add a new benchmark, create a new module under the appropriate directory and update the main.py to include this benchmark in the CLI.

## License
This project is licensed under the MIT License - see the LICENSE.md file for details.

## Acknowledgments
Thanks to all the contributors who have helped to expand and maintain OpenPerf.
Special thanks to the community for the continuous feedback that enriches the project's scope and functionality.
# Open-Source-Collaboration-Network
Open source repository collaboration network and npm artifact library dependency network mapping dataset

# 各文件说明:
### 注:为了使用和读取方便,所有程序文件都成对地以py和ipynb形式各保存了一份!
#### co-net_request.ipynb<br/>
- 对Github APIs爬取数据的测试文件。因为爬数据的工作时间较长,为保证工作无失误,先在少量的仓库上尝试爬取一点作为测试。<br/>
#### get_repo_co.ipynb <br/>
- 针对npm数据所涉及的所有github仓库,爬取其贡献者信息(形成list),为仓库协作网络的构建做准备<br/>
#### gen_npm_graph.ipynb <br/>
- 基于npm信息数据以及npm的依赖关系数据建立网络<br/>
#### npm_net.ipynb(已弃用)<br/>
- 同上。实际上这份文件缺了对图数据地持久化操作和并行化运行,上一个文件对此进行了补充!<br/>
#### get_repo_co.ipynb<br/>
- 基于仓库协作的贡献者信息数据来建立仓库协作网络<br/>
#### reflection_of_npm&repo.ipynb<br/>
- 利用join操作,制作数据集reflection_of_npm_and_repo,表示了从repo到npm的映射关系。<br/>
#### repo_graph.png 和 npm_graph.png
- 分别是协作网络和npm网络的可视化图

# Task_Intro(任务介绍)
### 目标
此数据集旨在映射npm包注册表与相应的开源仓库之间的关系。它旨在解决由于个人贡献和仓库名称更改导致的npm注册表中元数据不完整或过时的挑战,从而促进这些网络的准确预测和映射。

### 内容
两个网络无法完全映射,但两个网络的子集可以有相应的关系,并且可以根据npm包信息中的repo_url字段进行映射。

### 开源仓库协作网络:
节点:各个仓库repo //////代表个别开发者或团队。
边:代表协作关系,包括提交、审阅和讨论等贡献。
属性:包括贡献数量、贡献的性质(代码、文档等)以及协作持续时间等指标。

### npm工件库依赖网络:
节点:代表单独的npm包。
边:代表依赖链接,即一个包依赖于另一个包。
属性:包括版本号、更新频率和受欢迎程度指标(下载量、描述)。

### 数据收集方法:
1. 协作网络的数据通过公共API从流行的源代码托管平台如GitHub、GitLab和Bitbucket收集。您也可以直接下载opendigger提供的样本数据集,并建议比较一年的行为数据。https://github.com/X-lab2017/open-digger/blob/master/sample_data/README.md

2. npm工件库依赖网络的数据从npm注册表的公共API提取,重点是package.json文件以映射依赖关系。您可以在npm.org上进行爬取。以下是提供的全局npm库及其依赖关系:
npm依赖项:npm_dependencies.zip 7.15M
npm包:npm_packages.zip 69.28M

### 潜在用例:
获取两个网络的指标:度、聚类系数、平均路径长度、直径、中心性、密度、模块性、连通分量等。

### 可视化两个网络映射
通过检查依赖链及其对软件可靠性的影响来研究软件生态系统的弹性。
评估软件开发实践随时间的趋势。

### 格式
数据集以适合机器学习和网络分析的格式提供,如表格数据的CSV和结构化元数据的JSON。

### 输出结果
包含开源仓库协作网络和npm工件库依赖网络的完整数据集。
数据集的使用说明,详细说明数据项、来源、收集和处理方法。
数据分析报告,总结关键发现和见解。
98 changes: 98 additions & 0 deletions analysis_graph.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx\n",
"import json\n",
"import matplotlib\n",
"matplotlib.use('Agg')\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open('./repo_graph.json', 'r') as file:\n",
" data = json.load(file)\n",
"\n",
"G = nx.node_link_graph(data)\n",
"print(\"Number of nodes:\", G.number_of_nodes())\n",
"print(\"Number of edges:\", G.number_of_edges())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 计算度中心性\n",
"degree_centrality = nx.degree_centrality(G)\n",
"# 找出度中心性最高的前N个节点\n",
"top_n = 10\n",
"top_nodes = sorted(degree_centrality.items(), key=lambda item: item[1], reverse=True)[:top_n]\n",
"print(\"Top {} nodes by degree centrality:\".format(top_n))\n",
"for node, centrality in top_nodes:\n",
" print(f\"Node: {node}, Degree Centrality: {centrality}\")\n",
"\n",
"# 聚类系数\n",
"clustering_coefficient = nx.clustering(G)\n",
"average_clustering_coefficient = nx.average_clustering(G)\n",
"print(\"Average Clustering Coefficient:\", average_clustering_coefficient)\n",
"\n",
"# # 计算平均路径长度\n",
"# if nx.is_connected(G):\n",
"# average_path_length = nx.average_shortest_path_length(G)\n",
"# print(\"Average Path Length:\", average_path_length)\n",
"# else:\n",
"# print(\"The graph is not connected. Cannot compute average path length.\")\n",
"#\n",
"# # 计算直径\n",
"# if average_path_length is not None:\n",
"# diameter = nx.diameter(G)\n",
"# print(\"Diameter:\", diameter)\n",
"# else:\n",
"# print(\"Cannot compute diameter because the graph is not connected.\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 布局图\n",
"pos = nx.spring_layout(G)\n",
"\n",
"# 使用matplotlib绘制图\n",
"plt.figure(figsize=(12, 12))\n",
"nx.draw_networkx_nodes(G, pos, node_color='lightblue', edgecolors='k')\n",
"nx.draw_networkx_edges(G, pos, width=0.5, alpha=0.5, arrows=False)\n",
"nx.draw_networkx_labels(G, pos, font_size=8, font_family='sans-serif')\n",
"\n",
"# 突出显示度中心性最高的节点\n",
"for node, centrality in top_nodes:\n",
" nx.draw_networkx_nodes(G, pos, node, node_color='red', node_size=500)\n",
"\n",
"# 显示图\n",
"plt.title('Repository Collaboration Network')\n",
"plt.axis('off') # 关闭坐标轴\n",
"plt.savefig('./repo_graph.png')"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
59 changes: 59 additions & 0 deletions analysis_graph.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import networkx as nx
import json
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

with open('./repo_graph.json', 'r') as file:
data = json.load(file)

G = nx.node_link_graph(data)

print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())

# 计算度中心性
degree_centrality = nx.degree_centrality(G)
# 找出度中心性最高的前N个节点
top_n = 10
top_nodes = sorted(degree_centrality.items(), key=lambda item: item[1], reverse=True)[:top_n]
print("Top {} nodes by degree centrality:".format(top_n))
for node, centrality in top_nodes:
print(f"Node: {node}, Degree Centrality: {centrality}")

# 聚类系数
clustering_coefficient = nx.clustering(G)
average_clustering_coefficient = nx.average_clustering(G)
print("Average Clustering Coefficient:", average_clustering_coefficient)

# # 计算平均路径长度
# if nx.is_connected(G):
# average_path_length = nx.average_shortest_path_length(G)
# print("Average Path Length:", average_path_length)
# else:
# print("The graph is not connected. Cannot compute average path length.")
#
# # 计算直径
# if average_path_length is not None:
# diameter = nx.diameter(G)
# print("Diameter:", diameter)
# else:
# print("Cannot compute diameter because the graph is not connected.")

# 布局图
pos = nx.spring_layout(G)

# 使用matplotlib绘制图
plt.figure(figsize=(12, 12))
nx.draw_networkx_nodes(G, pos, node_color='lightblue', edgecolors='k')
nx.draw_networkx_edges(G, pos, width=0.5, alpha=0.5, arrows=False)
nx.draw_networkx_labels(G, pos, font_size=8, font_family='sans-serif')

# 突出显示度中心性最高的节点
for node, centrality in top_nodes:
nx.draw_networkx_nodes(G, pos, node, node_color='red', node_size=500)

# 显示图
plt.title('Repository Collaboration Network')
plt.axis('off') # 关闭坐标轴
plt.savefig('./repo_graph.png')
Loading