Skip to content

Commit

Permalink
Add announcement CZI EOSS 6 (#190)
Browse files Browse the repository at this point in the history
Co-authored-by: Reshama Shaikh <[email protected]>
Co-authored-by: Tim Head <[email protected]>
Co-authored-by: Lucy Liu <[email protected]>
  • Loading branch information
4 people authored Aug 27, 2024
1 parent 251e034 commit 656e6a9
Show file tree
Hide file tree
Showing 3 changed files with 110 additions and 0 deletions.
110 changes: 110 additions & 0 deletions _posts/2024-08-06-czi-eoss6-announcement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Chan Zuckerberg Initiative considers scikit-learn an Essential Open Source Software"
date: August 6, 2024
categories:
- Funding
tags:
- Open Source
- Funding
- Internship
- Diversity
featured-image: sklearn_czi.png

postauthors:
- name: Guillaume Lemaitre
website: https://github.com/glemaitre
image: guillaume-lemaitre.jpg
- name: Lucy Liu
website: https://github.com/lucyleeow
image: lucyliu.jpeg
---
<div>
<img src="/assets/images/posts_images/{{ page.featured-image }}" alt="">
{% include postauthor.html %}
</div>

We are delighted to announce that `scikit-learn` has been awarded a grant from
the [Chan Zuckerberg Initiative (CZI)](https://chanzuckerberg.com/)'s [Essential Open
Source Software for Science
(EOSS)](https://chanzuckerberg.com/rfa/essential-open-source-software-for-science/)
program. This grant is funded by [Wellcome Trust](https://wellcome.org/).
As in previous rounds, this cycle supports open-source software projects that are
essential to biomedical research. This is the third time that CZI EOSS supports
`scikit-learn`.

In this new grant, we will focus on improving the [evaluation and inspection of
predictive
models](https://chanzuckerberg.com/eoss/proposals/predictive-models-evaluation-inspection-in-scikit-learn/).

## Predictive models evaluation & inspection

When building a machine learning pipeline for a specific research problem, two key
aspects are closely connected: (i) design of the pipeline and (ii) assessment, analysis, and
inspection of it. Researchers strive to identify the optimal pipeline, maximizing specific
evaluation metrics, while also seeking at explaining the validity and rationale behind
the pipeline's predictions. This is the cornerstone of answering research
questions. With this proposal we aim to improve and extend the available `scikit-learn`
tools.

`scikit-learn` provides building blocks for model evaluation and statistical analysis of
results. Originally, this information was presented in a raw format and required
expertise from scientists to create intuitive reports for outreach to peers and
outsiders. Recently, the `scikit-learn` community developed displays to easily generate
visual figures for communicating such results. However, these displays are still in
their early development stages and do not leverage all available statistical analysis
tools (i.e., cross-validation) from `scikit-learn`. Thus, we aim to expand these
displays, using the right statistical tools and thus promote the adoption of best
practices when reporting results. Additionally, we also intend to create new displays
to support common analysis tasks that are not yet covered in `scikit-learn`.

In the domain of model inspection, we aim to address several areas: (i) model inspection
during training, (ii) enhancing user experience through interactive inspection, and
(iii) model explainability. First, during the training of a pipeline, researchers are
interested in monitoring the internal characteristics of the model, which is a not yet
addressed long-standing issue in `scikit-learn`. We want to build upon some initial work
by implementing a "callback" framework that allows users to track these internal
parameters. Next, researchers commonly use interactive tools such as Jupyter Notebook to
develop pipelines. `scikit-learn` started some efforts to visually and interactively
display pipelines in these environments. However, there is room for improvement in terms
of user interaction and accessibility. Finally, as `scikit-learn` is widely used as a
reference package, it is crucial to improve the section of the library dedicated to
model explainability. We aim to improve the documentation and user experience with the
existing explainability tools, making sure that they use the appropriate tool for their
use cases. In addition, we propose to work on a scikit-learn enhancement proposal (SLEP)
to define a common API for model explainability within scikit-learn. Ultimately, the
goal is to come to a consensus to provide scikit-learn end-users with a consistent
experience when using model explainability tools.

On top of all these items, we intend to continue working on the general maintenance of
the project, addressing bug reports and performance regressions. As a community-driven
project, we also want to dedicate time reviewing external contributions.

## Involved people

To execute this project, we plan the following hires:

- [Lucy Liu](https://github.com/lucyleeow) (Quansight Labs) will work about half-time on
the project, on topic related to displays and feature importance.
- We will hire full-time internships to work on the other part of the project. The
initial plan is to hire two interns for a period of 6 months each and repeat this
process for the next 2 years. We want to provide opportunities to underrepresented
groups in the field of machine learning and data science, similarly to previous
initiatives (cf. [NumFOCUS Small Development
Grant](https://blog.scikit-learn.org/diversity/mentoring/)).

## Past CZI EOSS grants

In the past `scikit-learn` has been awarded two grants from the CZI EOSS program:

- [CZI EOSS Cycle 1](https://chanzuckerberg.com/eoss/proposals/scikit-learn-maintenance-and-enhancement-for-gradient-boosting/)
helped at creating to the
[`HistGradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) and
[`HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html) estimators.
These estimators are the equivalent of gradient boosting models implemented in
`LightGBM` and `XGBoost`.
- [CZI EOSS Cycle 4](https://chanzuckerberg.com/eoss/proposals/maintenance-extension-of-scikit-learn-machine-learning-in-python/)
extended `scikit-learn` to work better with missing values and categorical data in
several estimators.

Both grants allowed us to maintain and enhance `scikit-learn` to better serve the
community.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/posts_images/sklearn_czi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 656e6a9

Please sign in to comment.