location of the code for the active learning and coverage suggestions #2

dkorenci · 2024-08-02T16:45:05Z

Hi, the MEGAnno paper mentions "Active suggestions" and "coverage" suggestions (Project.suggest_coverage).
I have looked for these functionalities both in the code and in the documentation, but I could not locate them.
Any pointers would be most helpful. I'm using the v1.5.4 of the code. Thanks!

horseno · 2024-08-02T22:20:23Z

Thanks for your interests.
MEGAnno doesn't directly support active learning, instead it provides support for data to be selected in an active manners.
Below are some old scripts applying the random forest active learning on the Tweet dataset.

# load init and testing data
import modAL, sklearn, numpy as np
from meganno_client import subset
npzfile = np.load('tweet_init_test.npz')
X_init, y_init, X_test, y_test = npzfile['X_init'], npzfile['y_init'],  npzfile['X_test'], npzfile['y_test']
#initialize learner
acc_list = []
learner = modAL.models.ActiveLearner(
    estimator=sklearn.ensemble.RandomForestClassifier(),
    query_strategy=modAL.uncertainty.uncertainty_sampling,
    X_training=X_init, y_training=y_init
)

s_pool = demo.search(keyword='', meta_names=['bert-embedding'], limit=100, start=0)
uuid_pool,X_pool=[],[]
for item in s_pool.value():
    uuid_pool.append(item['uuid'])
    X_pool.append(item['metadata'][0]['value'])
# Active selection: let model select next batch
query_idx, query_inst = learner.query(X_pool,n_instances=3)
next_batch=subset.Subset(service=demo, data_uuids=list(np.array(uuid_pool)[query_idx]))
next_batch.show()# -> annotates in widget

# get y_new
y_new = []
for item in next_batch.value():
    data_uuid = item['uuid']
    labels = item['annotation_list'][0]['labels_record']
    for l in labels:
        if l['label_name']=='sentiment':
            y_new.append(l['label_value'][0])
#remove from pool
X_pool,uuid_pool= np.delete(X_pool, query_idx, axis=0), np.delete(uuid_pool, query_idx, axis=0)

#update learner and compute accuracy
learner.teach(query_inst, np.array(y_new))
acc = learner.score(X_test, y_test)
acc_list.append(acc)

As for coverage suggestion, it could be implemented in various ways. Our previous implementation clusters all labeled data points and sample the unlabeled datapoints from with large average distance from the cluster centroids. It was excluded in the release due to efficiency issues.

Let me know if you need further clarification or interested in contributing. Thanks!

dkorenci · 2024-08-08T11:22:38Z

Ok, thank you for the clarifications, and for the instructions how to run AL.

rafaellichen assigned rafaellichen and horseno and unassigned rafaellichen Aug 2, 2024

rafaellichen added the question Further information is requested label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

location of the code for the active learning and coverage suggestions #2

location of the code for the active learning and coverage suggestions #2

dkorenci commented Aug 2, 2024

horseno commented Aug 2, 2024

dkorenci commented Aug 8, 2024

location of the code for the active learning and coverage suggestions #2

location of the code for the active learning and coverage suggestions #2

Comments

dkorenci commented Aug 2, 2024

horseno commented Aug 2, 2024

dkorenci commented Aug 8, 2024