Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

location of the code for the active learning and coverage suggestions #2

Open
dkorenci opened this issue Aug 2, 2024 · 2 comments
Open
Assignees
Labels
question Further information is requested

Comments

@dkorenci
Copy link

dkorenci commented Aug 2, 2024

Hi, the MEGAnno paper mentions "Active suggestions" and "coverage" suggestions (Project.suggest_coverage).
I have looked for these functionalities both in the code and in the documentation, but I could not locate them.
Any pointers would be most helpful. I'm using the v1.5.4 of the code. Thanks!

@rafaellichen rafaellichen added the question Further information is requested label Aug 2, 2024
@horseno
Copy link
Contributor

horseno commented Aug 2, 2024

Thanks for your interests.
MEGAnno doesn't directly support active learning, instead it provides support for data to be selected in an active manners.
Below are some old scripts applying the random forest active learning on the Tweet dataset.

# load init and testing data
import modAL, sklearn, numpy as np
from meganno_client import subset
npzfile = np.load('tweet_init_test.npz')
X_init, y_init, X_test, y_test = npzfile['X_init'], npzfile['y_init'],  npzfile['X_test'], npzfile['y_test']
#initialize learner
acc_list = []
learner = modAL.models.ActiveLearner(
    estimator=sklearn.ensemble.RandomForestClassifier(),
    query_strategy=modAL.uncertainty.uncertainty_sampling,
    X_training=X_init, y_training=y_init
)

s_pool = demo.search(keyword='', meta_names=['bert-embedding'], limit=100, start=0)
uuid_pool,X_pool=[],[]
for item in s_pool.value():
    uuid_pool.append(item['uuid'])
    X_pool.append(item['metadata'][0]['value'])
# Active selection: let model select next batch
query_idx, query_inst = learner.query(X_pool,n_instances=3)
next_batch=subset.Subset(service=demo, data_uuids=list(np.array(uuid_pool)[query_idx]))
next_batch.show()# -> annotates in widget

# get y_new
y_new = []
for item in next_batch.value():
    data_uuid = item['uuid']
    labels = item['annotation_list'][0]['labels_record']
    for l in labels:
        if l['label_name']=='sentiment':
            y_new.append(l['label_value'][0])
#remove from pool
X_pool,uuid_pool= np.delete(X_pool, query_idx, axis=0), np.delete(uuid_pool, query_idx, axis=0)

#update learner and compute accuracy
learner.teach(query_inst, np.array(y_new))
acc = learner.score(X_test, y_test)
acc_list.append(acc)

As for coverage suggestion, it could be implemented in various ways. Our previous implementation clusters all labeled data points and sample the unlabeled datapoints from with large average distance from the cluster centroids. It was excluded in the release due to efficiency issues.

Let me know if you need further clarification or interested in contributing. Thanks!

@dkorenci
Copy link
Author

dkorenci commented Aug 8, 2024

Ok, thank you for the clarifications, and for the instructions how to run AL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants