-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
4d12f9f
commit c3dfb65
Showing
4 changed files
with
106 additions
and
76 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,17 +11,17 @@ | |
<div class="right"> | ||
<p>Hi! I'm Lj Miranda, and welcome to my website!</p> | ||
<p> | ||
I'm a <a href="https://allenai.org/predoctoral-young-investigators">predoctoral | ||
I'm currently a <a href="https://allenai.org/predoctoral-young-investigators">predoctoral | ||
researcher</a> at the <a href="https://allenai.org/allennlp">AllenNLP | ||
team</a> at <a href="https://allenai.org/">Ai2</a>. Previously, I was a | ||
machine learning engineer at <a href="https://explosion.ai">Explosion</a> working on <a | ||
href="https://spacy.io">spaCy</a>. | ||
team</a> at <a href="https://allenai.org/">Ai2</a>. In the past, I've worked as an <a | ||
href="https://storage.googleapis.com/ljvmiranda/cv.pdf">engineer, consultant, and | ||
researcher</a>, mostly in the field of NLP and AI. | ||
</p> | ||
<p> | ||
I'm broadly interested in building equitable language technologies in the presence of severe data and | ||
compute constraints. | ||
I'm broadly interested in data-centric approaches to building language technologies at scale. | ||
<!-- My research interests include efficient NLP, low-resource languages, and corpus linguistics. --> | ||
I'm happy to discuss research and collaborate, so feel free to <a href="mailto:[email protected]">reach | ||
I'm happy to discuss <a href="/research">research</a> and collaborate, so feel free to <a | ||
href="mailto:[email protected]">reach | ||
out</a>! | ||
</p> | ||
<!-- <p> | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,38 +5,91 @@ description: Research work of Lester James V. Miranda | |
permalink: /research/ | ||
--- | ||
|
||
I'm broadly interested in building equitable language technologies in the | ||
presence of severe constraints— such as the lack of data or compute. | ||
My current research interests are **low-resource and multilingual NLP**, | ||
**efficient NLP**, and **corpus linguistics**. | ||
<!-- I am interested in **how we can use data-centric techniques to improve the construction of datasets** for training and evaluating large language models. --> | ||
I'm broadly interested in **data-centric approaches to building language technologies at scale.** | ||
<!-- Focusing on data instead of models is crucial, as we face diminishing returns from model scaling and growing concerns about model reliability and fairness. --> | ||
My goal is to <u>develop systematic methodologies for efficiently constructing NLP resources</u> while actively <u>building new datasets and benchmarks</u> to enhance language model training and evaluation. | ||
More concretely, I'm interested in the following areas: | ||
|
||
- **Efficient approaches to annotation**: | ||
Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I'm currently exploring this question in the context of human preferences in LLM post-training (RLHF). | ||
|
||
- **Resources for multilingual NLP**: | ||
No language should be left behind, especially in data. | ||
I've worked on several datasets to improve the state of low-resource and multilingual NLP. | ||
These projects involve Filipino [datasets](https://aclanthology.org/2023.sealp-1.2/) & [tooling](https://aclanthology.org/2023.nlposs-1.1/), and [large-scale multilingual datasets](https://aclanthology.org/2024.naacl-long.243/). | ||
|
||
- **Faithful benchmarks of model capabilities**: | ||
How can we design benchmarks that accurately reflect the true capabilities and limitations of LLMs? | ||
I've explored this question in the context of [evaluating reward models (RewardBench)](https://arxiv.org/abs/2403.13787), and in assessing multilingual capabilities of LLMs on [Southeast Asian languages](https://arxiv.org/abs/2406.10118). | ||
|
||
If you are interested in these types of work, especially in improving the state of Filipino NLP, then do not hesitate to [reach out](mailto:[email protected]). | ||
I'm happy to discuss research and collaborate! | ||
|
||
| ||
|
||
--- | ||
|
||
| ||
|
||
## Selected Publications | ||
|
||
Below is a list of my publications. | ||
You can also check my [Google | ||
Scholar](https://scholar.google.co.jp/citations?user=2RtnNKEAAAAJ&hl=en) and | ||
[Semantic | ||
Scholar](https://www.semanticscholar.org/author/Lester-James-V.-Miranda/13614871) | ||
profiles for more info. | ||
profiles for more updated information. | ||
|
||
### 2024 | ||
|
||
- [RewardBench: Evaluating Reward Models](https://arxiv.org/abs/2403.13787) <br> Nathan Lambert, Valentina Pyatkin, Jacob Morrison, <u>LJ Miranda</u>, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Preprint. *arXiv:2403.13787 [cs.LG]*. March 2024. <br> [[Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)] [[Code](https://github.com/allenai/reward-bench)] [[Blog](https://blog.allenai.org/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b)] | ||
*At AI2, I'm working on various aspects of LM adaptation such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, SIGTYP).* | ||
|
||
- [SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages](https://arxiv.org/abs/2406.10118) | ||
<br>*ArXiV preprint '24* | ||
<br>Holy Lovenia\*, Rahmad Mahendra\*, Salsabil Maulana Akbar\*, <u>Lester James Miranda</u>\*, and 50+ other authors *(∗: major contributor)*. | ||
<br>[[Catalogue](https://seacrowd.github.io/seacrowd-catalogue)] [[Code](https://github.com/SEACrowd/seacrowd-datahub)] | ||
|
||
- [Consent in Crisis: The Rapid Decline of the AI Data Commons](https://www.dataprovenance.org/Consent_in_Crisis.pdf) | ||
<br>*Preprint '24* | ||
<br>Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund,..., <u>Lester Miranda</u>, and 40+ other authors. I contributed in the annotation process design for Web Domain services and annotation quality review. | ||
<br>[[Website](https://www.dataprovenance.org/)] [[Collection](https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection)] [[New York Times Feature](https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html)] | ||
|
||
- [Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages](https://aclanthology.org/2024.sigtyp-1.18/). <u>Lester James V. Miranda</u>, *Proceedings of the EACL 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP*. ACL. St. Julian's, Malta. March 2024. <br> [[Code](https://github.com/ljvmiranda921/LiBERTus)] [[Video](https://www.youtube.com/watch?v=rjOw_G-Rv9I)] | ||
- [RewardBench: Evaluating Reward Models for Language Modelling](https://arxiv.org/abs/2403.13787) | ||
<br>*ArXiV preprint '24* | ||
<br> Nathan Lambert, Valentina Pyatkin, Jacob Morrison, <u>LJ Miranda</u>, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi <br> [[Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)] [[Code](https://github.com/allenai/reward-bench)] [[Blog](https://blog.allenai.org/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b)] | ||
|
||
|
||
- [Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages](https://aclanthology.org/2024.sigtyp-1.18/) | ||
<br>*Special Interest Group on Typology (SIGTYP) Workshop @ EACL '24* | ||
<br><u>Lester James V. Miranda</u> <br> [[Code](https://github.com/ljvmiranda921/LiBERTus)] [[Video](https://www.youtube.com/watch?v=rjOw_G-Rv9I)] | ||
|
||
### 2023 | ||
|
||
- [calamanCy: a Tagalog Natural Language Processing Toolkit](https://aclanthology.org/2023.nlposs-1.1/) <br> <u>Lester James V. Miranda</u>, *Proceedings of the EMNLP 2023 Workshop on NLP Open Source Software (NLP-OSS)*. EMNLP. Singapore, Singapore. December 2023. | ||
*I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on core NLP tasks: POS tagging, NER, dependency parsing, etc.* | ||
|
||
- [calamanCy: a Tagalog Natural Language Processing Toolkit](https://aclanthology.org/2023.nlposs-1.1/) | ||
<br>*NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP '23* | ||
<br> <u>Lester James V. Miranda</u> | ||
<br> [[Code](https://github.com/ljvmiranda921/calamanCy)] [[Poster](/assets/png/calamancy/poster.png)] [[Video](https://youtu.be/2fbzs1KbFTQ?si=_vKEY11Z1Jzuaxeu)] | ||
|
||
- [Developing a Named Entity Recognition Dataset for Tagalog](https://aclanthology.org/2023.sealp-1.2/) <br> <u>Lester James V. Miranda</u>, *Proceedings of the IJCNLP-AACL 2023 Workshop on Southeast Asian Language Processing (SEALP)*. ACL. Nusa Dua, Bali, Indonesia. November 2023. | ||
- [Developing a Named Entity Recognition Dataset for Tagalog](https://aclanthology.org/2023.sealp-1.2/) | ||
<br>*Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL '23* | ||
<br> <u>Lester James V. Miranda</u> | ||
<br> [[Code](https://github.com/ljvmiranda921/calamanCy/tree/master/reports/aacl2023/benchmark)] [[Dataset](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner)] [[Video](https://www.youtube.com/watch?v=WAJ8IEIHuiM)] | ||
|
||
- [Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark](https://arxiv.org/abs/2311.09122) <br>Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, <u>LJ Miranda</u>, Barbara Plank, Arij Riabi, Yuval Pinter. Preprint. *arXiv:2311.09122 [cs.CL]*. November 2023. | ||
- [Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark](https://aclanthology.org/2024.naacl-long.243/) | ||
<br>*NAACL '24, ArXiv preprint '23* | ||
<br>Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, <u>LJ Miranda</u>, Barbara Plank, Arij Riabi, Yuval Pinter | ||
<br> [[Dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GQ8HDL)] [[Website](https://www.universalner.org/)] | ||
|
||
### 2022 | ||
|
||
- [Multi hash embeddings in spaCy](https://arxiv.org/abs/2212.09255) <br> <u>Lester James V. Miranda</u>\*, Ákos Kádár\*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal. Preprint. *arXiv:2212.09255 [cs.CL]*. November 2022. <br> *(∗: equal contributions)* | ||
*My first foray to NLP research is a technical report on spaCy's hash embedding method. I'm lucky to have worked with established researchers in the field.* | ||
|
||
- [Multi hash embeddings in spaCy](https://arxiv.org/abs/2212.09255) | ||
<br>*ArXiV preprint '22* | ||
<br> <u>Lester James V. Miranda</u>\*, Ákos Kádár\*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal *(∗: equal contributions)*. | ||
<br> [[Code](https://github.com/explosion/projects/tree/v3/benchmarks/ner_embeddings)] | ||
|
||
<!-- | ||
|
@@ -53,12 +106,20 @@ profiles for more info. | |
|
||
I used to be a bioinformatics researcher at the [Furuzuki Neurocomputing Systems Laboratory](https://www.waseda.jp/sem-hflab/nclab/index.html), working on nature-inspired algorithms and proteomics. | ||
|
||
- [Feature Extraction using a Mutually-Competitive Autoencoder for Protein Function Prediction](https://ieeexplore.ieee.org/document/8616230). <u>Lester James V. Miranda</u> and Jinglu Hu, _IEEE International Conference on System, Man, and Cybernetics (SMC)_. IEEE. Miyazaki, Japan. October 2018. | ||
- [Feature Extraction using a Mutually-Competitive Autoencoder for Protein Function Prediction](https://ieeexplore.ieee.org/document/8616230). | ||
<br>*IEEE Systems, Man, and Cybernetics (SMC) '18* | ||
<br><u>Lester James V. Miranda</u> and Jinglu Hu | ||
|
||
- [A Deep Learning Approach based on Stacked Denoising Autoencoders for Protein Function Prediction](https://ieeexplore.ieee.org/document/8377699). <u>Lester James V. Miranda</u> and Jinglu Hu, _42nd IEEE Computer Society Signature Conference on Computers, Software, and Applications (COMPSAC)_. IEEE. Tokyo, Japan. July 2018. | ||
- [A Deep Learning Approach based on Stacked Denoising Autoencoder for Protein Function Prediction](https://ieeexplore.ieee.org/document/8377699). | ||
<br>*IEEE Computer, Software, and Applications (COMPSAC) '18* | ||
<br><u>Lester James V. Miranda</u> and Jinglu Hu | ||
|
||
- [PySwarms, a research-toolkit for Particle Swarm Optimization in Python](https://joss.theoj.org/papers/10.21105/joss.00433) <br> <u>Lester James V. Miranda</u>, _Journal of Open Source Software_, vol. 3, no. 433, 2018. | ||
- [PySwarms, a research-toolkit for Particle Swarm Optimization in Python](https://joss.theoj.org/papers/10.21105/joss.00433) | ||
<br>*Journal of Open Source Software (JOSS) '18, vol.3, no. 433* | ||
<br> <u>Lester James V. Miranda</u> | ||
|
||
I was also involved in research early on during my undergrad: | ||
|
||
- [Appliance Recognition using Hall-Effect Sensors and k-Nearest Neighbors for Power Management Systems](https://ieeexplore.ieee.org/document/7847947). <u>Lester James V. Miranda</u>\*, Marian Joice Gutierrez\*, Samuel Matthew Dumlao, and Rosula Reyes, _Proceedings of the 2016 IEEE Region 10 Conference 2016 (TENCON)_. IEEE. Singapore. November 2016. *(∗: equal contributions)* | ||
- [Appliance Recognition using Hall-Effect Sensors and k-Nearest Neighbors for Power Management Systems](https://ieeexplore.ieee.org/document/7847947) | ||
<br>*IEEE Region 10 Conference (TENCON) '16* | ||
<br><u>Lester James V. Miranda</u>\*, Marian Joice Gutierrez\*, Samuel Matthew Dumlao, and Rosula Reyes *(∗: equal contributions)*. |