Make sure that you don't have different case email duplicates in src/cncf-config/email-map
: cd src
, ./lower_unique.sh cncf-config/email-map
.
- If you generated new email-map using
./import_affs.sh
, then:mv email-map cncf-config/email-map
- To generate
git.log
file and make sure it includes all orgs used bydevstats
use cncf/devstats'sGHA2DB_PROJECTS_OVERRIDE="+cncf,+opencontainers,+istio,+spinnaker,+knative,+linux,+zephyr" PG_PASS=... GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 ./get_repos
and then final command line it generates. Make ituniq
. - To get repos from CDF use:
PG_PASS=... GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_PROJECTS_YAML=cdf_projects.yaml get_repos
. - To get GraphQL repos use:
AWS_PROFILE=... KUBECONFIG=... helm install ./devstats-helm-graphql --set skipSecrets=1,skipPVs=1,skipProvisions=1,skipCrons=1,skipGrafanas=1,skipServices=1,skipPostgres=1,skipIngress=1,bootstrapPodName=debug,bootstrapCommand=sleep,bootstrapCommandArgs={36000s}
,AWS_PROFILE=... KUBECONFIG=... ../devstats-k8s-lf/util/pod_shell.sh debug
,GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_PROJECTS_YAML=gql/projects.yaml GHA2DB_LOCAL=1 get_repos
,AWS_PROFILE=... KUBECONFIG=... kubectl delete po debug
. - Top get LF repos use:
AWS_PROFILE=... KUBECONFIG=... helm install ./devstats-helm --set skipSecrets=1,skipPVs=1,skipProvisions=1,skipCrons=1,skipGrafanas=1,skipServices=1,bootstrapPodName=debug,bootstrapCommand=sleep,bootstrapCommandArgs={36000s}
,AWS_PROFILE=... KUBECONFIG=... ../devstats-k8s-lf/util/pod_shell.sh debug
,ONLY='iovisor mininet opennetworkinglab opensecuritycontroller openswitch p4lang openbmp tungstenfabric cord' GHA2DB_PROPAGATE_ONLY_VAR=1 GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 GHA2DB_PROJECTS_YAML=k8s/projects.yaml GHA2DB_LOCAL=1 get_repos
,AWS_PROFILE=... KUBECONFIG=... kubectl delete po debug
. - Update
repos.txt
to contain all repositories returned by the above commands. Updateall_repos.sh
to include data from CNCF, CDF, LF and GraphQL. - To run
cncf/gitdm
on a generatedgit.log
file run:cd src/; ~/dev/alt/gitdm/src/cncfdm.py -i git.log -r "^vendor/|/vendor/|^Godeps/" -R -n -b ./ -t -z -d -D -A -U -u -o all.txt -x all.csv -a all_affs.csv > all.out
. New approach is./mtp
but it don't have a way (yet) to deal with the same emails mapped into different user names from different per-thread buckets. - To generate human readable text affiliation files: first run:
./enchance_all_affs.sh
then:SKIP_COMPANIES="(Unknown)" ./gen_aff_files.sh
. - If updating via
ghusers.sh
orghusers_cached.sh
(step 6) - rungenerate_actors.sh
too. If you need LF actors, run:AWS_PROFILE=... KUBECONFIG=... ./generate_actors_lf.sh
,AWS_PROFILE=... KUBECONFIG=... ./generate_actors_gql.sh
prior to running./generate_actors.sh
and./generate_actors_cncf.sh
. - Consider
./ghusers_cached.sh
or./ghusers.sh
(if you run this, then copy result json somewhere and get 0-committers from previous version to save GH API points). Sometimes you should just run./ghusers.sh
without cache. - Recommended:
ghusers_partially_cached.sh 2> errors.txt
will refetch repos metadata and commits since last fetched and get users data fromgithub_users.json
so you can save a lot of API points. You can prepend withNCPUS=N
to override autodetecting number of CPU cores available. - To copy source type from previous JSON version do
./copy_source.sh
- Run
./company_names_mapping.sh
to fix typical company names spell errors, lower/upper case etc. Updatecompany-names-mapping
before running this (with a new typos/correlations data from the last 3 steps). - To update (enhance)
github_users.json
with new affiliations./enhance_json.sh
. If you runghusers
you may need to updateskip_github_logins.txt
with new broken GitHub logins found. This is optional if you already have an enhanced json. You can prepend withNCPUS=N
to override autodetecting number of CPU cores available. - To merge with previous JSON use:
./merge_jsons.sh
. - To merge multiple GitHub logins data (for example propagate known affiliation to unknown or not found on the same GitHub login) run:
./merge_github_logins.sh
. - Because this can find new affiliations you can now use
./import_from_github_users.sh
to import back fromgithub_users.json
and then./lower_unique.sh cncf-config/email-map
and restart from step 4. This usescompany-names-mapping
file to import from GitHubcompany
field. - Run
./correlations.sh
and examine its outputcorrelations.txt
to try to normalize company names and remove common suffixes like Ltd., Corp. and downcase/upcase differences. - Run
./check_spell
for fuzziness/spell check errors finder (uses Levenshtein distance to find bugs). - Run
./lookup_json.sh
and examine its output JSONs - those GitHub profiles have some useful data directly available - this will save you some manual research work. - ALWAYS before any commit to GitHub run:
./handle_forbidden_data.sh
to remove any forbiden affiliations, please also seeFORBIDDEN_DATA.md
. - You can use
./clear_affiliations_in_json.sh
to clear all affiliations on a generatedgithub_users.json
. - To make json unique, call
./unique_json.rb github_users.json
. To sort JSON by commits, login, email use:./sort_json.rb github_users.json
. - You should run genderize/geousers (if needed) before the next step.
- You can create smaller final json for
cncf/devstats
using./delete_json_fields.sh github_users.json; ./check_source.rb github_users.json; ./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/github.com/cncf/devstats/github_users.json
. - To generate final
unknowns.csv
manual research task file run:./gen_aff_task.rb unknowns.txt
. You can also generate all actors./gen_aff_task.rb alldevs.txt
. You can prepend withONLY_GH=1
to skip entries without GitHub. You can prepend withONLY_EMP=1
to skip entries with any affiliation already set. - To manually edit all affiliations related files: edit
cncf-config/email-map all.txt all.csv all_affs.csv github_users.json stripped.json ../developers_affiliations.txt ../company_developers.txt affiliations.csv
- To add all possible entries from
github_users.json
tocncf-config/email-map
use :github_users_to_map.sh
. This is optional. - Finally copy
github_users.json
togithub_users.old
. You can check if JSON fileds are correct via./check_json_fields.sh github_users.json
,./check_json_fields.sh stripped.json small
. - If any file displays error with 'Invalid UTF-8' encoding, scrub it using Ruby tool:
./scrub.rb filename
.
./all_repos_log.sh /root/devstats_repos/jenkins-x/* /root/devstats_repos/jenkinsci/* /root/devstats_repos/spinnaker/* /root/devstats_repos/tektoncd/* /root/devstats_repos/Azure/* /root/devstats_repos/BuoyantIO/* /root/devstats_repos/GoogleCloudPlatform/* /root/devstats_repos/OpenObservability/* /root/devstats_repos/RichiH/* /root/devstats_repos/Virtual-Kubelet/* /root/devstats_repos/alibaba/* /root/devstats_repos/apcera/* /root/devstats_repos/appc/* /root/devstats_repos/brigadecore/* /root/devstats_repos/buildpack/* /root/devstats_repos/cdfoundation/* /root/devstats_repos/cloudevents/* /root/devstats_repos/cncf/* /root/devstats_repos/containerd/* /root/devstats_repos/containernetworking/* /root/devstats_repos/coredns/* /root/devstats_repos/coreos/* /root/devstats_repos/cortexproject/* /root/devstats_repos/crosscloudci/* /root/devstats_repos/datawire/* /root/devstats_repos/docker/* /root/devstats_repos/dragonflyoss/* /root/devstats_repos/draios/* /root/devstats_repos/envoyproxy/* /root/devstats_repos/etcd-io/* /root/devstats_repos/falcosecurity/* /root/devstats_repos/fluent/* /root/devstats_repos/goharbor/* /root/devstats_repos/grpc/* /root/devstats_repos/helm/* /root/devstats_repos/istio/* /root/devstats_repos/jaegertracing/* /root/devstats_repos/knative/* /root/devstats_repos/kubeedge/* /root/devstats_repos/kubernetes/* /root/devstats_repos/kubernetes-client/* /root/devstats_repos/kubernetes-csi/* /root/devstats_repos/kubernetes-graveyard/* /root/devstats_repos/kubernetes-helm/* /root/devstats_repos/kubernetes-incubator/* /root/devstats_repos/kubernetes-incubator-retired/* /root/devstats_repos/kubernetes-retired/* /root/devstats_repos/kubernetes-security/* /root/devstats_repos/kubernetes-sig-testing/* /root/devstats_repos/kubernetes-sigs/* /root/devstats_repos/linkerd/* /root/devstats_repos/lyft/* /root/devstats_repos/miekg/* /root/devstats_repos/nats-io/* /root/devstats_repos/open-policy-agent/* /root/devstats_repos/opencontainers/* /root/devstats_repos/openeventing/* /root/devstats_repos/opentracing/* /root/devstats_repos/pingcap/* /root/devstats_repos/prometheus/* /root/devstats_repos/rkt/* /root/devstats_repos/rktproject/* /root/devstats_repos/rook/* /root/devstats_repos/spiffe/* /root/devstats_repos/telepresenceio/* /root/devstats_repos/theupdateframework/* /root/devstats_repos/tikv/* /root/devstats_repos/torvalds/* /root/devstats_repos/uber/* /root/devstats_repos/virtual-kubelet/* /root/devstats_repos/vitessio/* /root/devstats_repos/vmware/* /root/devstats_repos/weaveworks/* /root/devstats_repos/youtube/* /root/devstats_repos/zephyrproject-rtos/* /root/devstats_repos/iovisor/* /root/devstats_repos/mininet/* /root/devstats_repos/open-switch/* /root/devstats_repos/opencord/* /root/devstats_repos/opennetworkinglab/* /root/devstats_repos/opensecuritycontroller/* /root/devstats_repos/p4lang/* /root/devstats_repos/tungstenfabric/*
.
- Open CNCF projects maintainers list
- Save "Name", "Company", "GitHub name" columns to a new sheet and download it as "maintainers.csv".
- Add "name,company,login" CSV header.
- Example file
- Run
[ONLYNEW=1] ./maintainers.sh
script. Follow its instructions.
Please follow the instructions from ADD_PROJECT.md.
To add geo data (country_id
, tz
) and gender data (sex
, sex_prob
), do the following:
- Download
allCountries.zip
file from geonames server. - Create
geonames
database via:sudo -u postgres createdb geonames
,sudo -u postgres psql -f geonames.sql
. Table details ingeonames.info
- Unzip
allCountries.zip
and runPG_PASS=... ./geodata.sh allCountries.tsv
- this will populate the DB. - Create indices on columns to speedup localization:
sudo -u postgres psql -f geonames_idx.sql
. - If this is a first geousers run create
geousers_cache.json
viacp empty.json geousers_cache.json
. - To use cache it is best to have
stripped.json
from the previous run. See step 22. - Enchance
github_users.json
viaPG_PASS=... ./geousers.sh github_users.json stripped.json geousers_cache.json 2000
. It will addcountry_id
andtz
fields. - Go to store.genderize.io and get you
API_KEY
, basic subscription ($9) allows 100,000 monthly gender lookups. - If this is a first genderize run create
genderize_cache.json
viacp empty.json genderize_cache.json
. - Enchance
github_users.json
viaAPI_KEY=... ./genderize.sh github_users.json stripped.json genderize_cache.json 2000
. It will addsex
andsex_prob
fields. - You can skip
API_KEY=...
but only 1000 gender lookups/day are allowed then. - Copy enhanced json to devstats:
./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/devstats/github_users.json
- Import new json on devstats using
./import_affs
tool.
- To import manual affiliations from a google sheet save this sheet as
affiliations.csv
and then use./affiliations.sh
script. - Prepend with
UPDATE=1
to only import those marked as changed: columnchanges='x'
. - Prepend with
DBG=1
to enable verbose output. - After finishing import add a status line to
affiliations_import.txt
file and update the online spreadsheet. - After importing new data run
./src/burndown.sh
(from the src's parent directory). Do this after processing all data mentioned here, not after just importing new CSV. - Import generated
csv/burndown.csv
data intohttps://docs.google.com/spreadsheets/d/1RxEbZNefBKkgo3sJ2UQz0OCA91LDOopacQjfFBRRqhQ/edit?usp=sharing
. - To calculate CNCF/LF ratio use number of CNCF found from last commit - number of CNCF found from some previous commit diveded by the same ratio for all actors.