diff --git a/_layouts/home.html b/_layouts/home.html index 35c5d904..7060260b 100644 --- a/_layouts/home.html +++ b/_layouts/home.html @@ -11,17 +11,17 @@

Hi! I'm Lj Miranda, and welcome to my website!

- I'm a predoctoral + I'm currently a predoctoral researcher at the AllenNLP - team at Ai2. Previously, I was a - machine learning engineer at Explosion working on spaCy. + team at Ai2. In the past, I've worked as an engineer, consultant, and + researcher, mostly in the field of NLP and AI.

- I'm broadly interested in building equitable language technologies in the presence of severe data and - compute constraints. + I'm broadly interested in data-centric approaches to building language technologies at scale. - I'm happy to discuss research and collaborate, so feel free to reach + I'm happy to discuss research and collaborate, so feel free to reach out!

+【[Game Dev](https://ljvmiranda921.itch.io)】 +【[Game Boy Camera Photos](https://ljvmiranda921.github.io/gallery)】 +【[Curriculum Vitae](https://storage.googleapis.com/ljvmiranda/cv.pdf)】 + -## Background - -I'm a [Predoctoral Young -Investigator](https://allenai.org/predoctoral-young-investigators) at the [Allen -Institute for Artificial Intelligence (AI2)](https://allenai.org) as part of the -[AllenNLP team](https://allenai.org/allennlp). Previously, I've worked at the -following places: - -* [Explosion](https://explosion.ai) (Berlin): a natural language processing - startup where I worked on the - open-source [spaCy](https://spacy.io) library and the - [Prodigy](https://prodi.gy) annotation tool. I co-authored our [first technical - report](https://arxiv.org/abs/2212.09255) and developed several features and - projects for our software libraries. - -* [Thinking Machines Data Science](https://thinkingmachin.es) (Manila): a data - science consultancy where I built multiple natural language processing - products for large enterprises. I worked with several of our biggest clients - in Southeast Asia and led teams in the Document AI space. - -* [Preferred Networks](https://www.preferred-networks.jp/en/) (Tokyo): as an - intern, I implemented a training parallelization framework for - [ChainerRL](https://github.com/chainer/chainerrl), an open-source - reinforcement learning library. - -I obtained my master's degree from [Waseda -University](https://www.waseda.jp/top/en) and my bachelor's in Electronics -Engineering, minor in Philosophy from [Ateneo de Manila -University](https://www.ateneo.edu). I used to be a bioinformatics researcher -but moved on to language— text, like proteins, are sequences after all. My -research interests include **low-resource and multilingual NLP**, **efficient -NLP**, and **corpus linguistics**. - -Lastly, I'm well-involved in open-source and have authored [several -projects](https://github.com/ljvmiranda921) of my own. -[Pyswarms](https://github.com/ljvmiranda921/pyswarms) has been quite -successful; I've seen it being used in [quantum -physics](https://arxiv.org/abs/1801.07686), -[chemistry](https://pubs.acs.org/doi/abs/10.1021/acscentsci.8b00307), and -[teaching](https://www.gousios.gr/courses/algo-ds/optimizations.html), amongst -[other -things](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=15267041073198929167). -I love indie games and [dabble in game -development](https://ljvmiranda921.itch.io) using [Pico-8](https://www.lexaloffle.com/pico-8.php) and -[Godot](https://godotengine.org/). -

Short background
Lj Miranda specializes in natural language processing with over five years of @@ -75,16 +33,10 @@ maintain notable open-source libraries such as spaCy and Pyswarms. He dabbles in game development during his free time.

- - - ## Contact **Lester James V. Miranda** -Seattle, Washington, USA +Seattle, WA Email: ljvmiranda [at] gmail [dot] com [Curriculum Vitae (PDF)](https://storage.googleapis.com/ljvmiranda/cv.pdf) diff --git a/notebook/index.md b/notebook/index.md index 226af8bf..7263a645 100644 --- a/notebook/index.md +++ b/notebook/index.md @@ -16,3 +16,20 @@ hope my notebook helps you as much as it has helped me. {% endfor %} + + diff --git a/research/index.md b/research/index.md index f3fbf8ee..65b4f837 100644 --- a/research/index.md +++ b/research/index.md @@ -5,38 +5,91 @@ description: Research work of Lester James V. Miranda permalink: /research/ --- -I'm broadly interested in building equitable language technologies in the -presence of severe constraints— such as the lack of data or compute. -My current research interests are **low-resource and multilingual NLP**, -**efficient NLP**, and **corpus linguistics**. + +I'm broadly interested in **data-centric approaches to building language technologies at scale.** + +My goal is to develop systematic methodologies for efficiently constructing NLP resources while actively building new datasets and benchmarks to enhance language model training and evaluation. +More concretely, I'm interested in the following areas: +- **Efficient approaches to annotation**: + Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I'm currently exploring this question in the context of human preferences in LLM post-training (RLHF). + +- **Resources for multilingual NLP**: + No language should be left behind, especially in data. + I've worked on several datasets to improve the state of low-resource and multilingual NLP. + These projects involve Filipino [datasets](https://aclanthology.org/2023.sealp-1.2/) & [tooling](https://aclanthology.org/2023.nlposs-1.1/), and [large-scale multilingual datasets](https://aclanthology.org/2024.naacl-long.243/). + +- **Faithful benchmarks of model capabilities**: + How can we design benchmarks that accurately reflect the true capabilities and limitations of LLMs? + I've explored this question in the context of [evaluating reward models (RewardBench)](https://arxiv.org/abs/2403.13787), and in assessing multilingual capabilities of LLMs on [Southeast Asian languages](https://arxiv.org/abs/2406.10118). + +If you are interested in these types of work, especially in improving the state of Filipino NLP, then do not hesitate to [reach out](mailto:ljvmiranda@gmail.com). +I'm happy to discuss research and collaborate! + +  + +--- + +  + +## Selected Publications + +Below is a list of my publications. You can also check my [Google Scholar](https://scholar.google.co.jp/citations?user=2RtnNKEAAAAJ&hl=en) and [Semantic Scholar](https://www.semanticscholar.org/author/Lester-James-V.-Miranda/13614871) -profiles for more info. +profiles for more updated information. ### 2024 -- [RewardBench: Evaluating Reward Models](https://arxiv.org/abs/2403.13787)
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Preprint. *arXiv:2403.13787 [cs.LG]*. March 2024.
[[Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)] [[Code](https://github.com/allenai/reward-bench)] [[Blog](https://blog.allenai.org/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b)] +*At AI2, I'm working on various aspects of LM adaptation such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, SIGTYP).* + +- [SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages](https://arxiv.org/abs/2406.10118) +
*ArXiV preprint '24* +
Holy Lovenia\*, Rahmad Mahendra\*, Salsabil Maulana Akbar\*, Lester James Miranda\*, and 50+ other authors *(∗: major contributor)*. +
[[Catalogue](https://seacrowd.github.io/seacrowd-catalogue)] [[Code](https://github.com/SEACrowd/seacrowd-datahub)] +- [Consent in Crisis: The Rapid Decline of the AI Data Commons](https://www.dataprovenance.org/Consent_in_Crisis.pdf) +
*Preprint '24* +
Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund,..., Lester Miranda, and 40+ other authors. I contributed in the annotation process design for Web Domain services and annotation quality review. +
[[Website](https://www.dataprovenance.org/)] [[Collection](https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection)] [[New York Times Feature](https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html)] -- [Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages](https://aclanthology.org/2024.sigtyp-1.18/). Lester James V. Miranda, *Proceedings of the EACL 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP*. ACL. St. Julian's, Malta. March 2024.
[[Code](https://github.com/ljvmiranda921/LiBERTus)] [[Video](https://www.youtube.com/watch?v=rjOw_G-Rv9I)] +- [RewardBench: Evaluating Reward Models for Language Modelling](https://arxiv.org/abs/2403.13787) +
*ArXiV preprint '24* +
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi
[[Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)] [[Code](https://github.com/allenai/reward-bench)] [[Blog](https://blog.allenai.org/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b)] + + +- [Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages](https://aclanthology.org/2024.sigtyp-1.18/) +
*Special Interest Group on Typology (SIGTYP) Workshop @ EACL '24* +
Lester James V. Miranda
[[Code](https://github.com/ljvmiranda921/LiBERTus)] [[Video](https://www.youtube.com/watch?v=rjOw_G-Rv9I)] ### 2023 -- [calamanCy: a Tagalog Natural Language Processing Toolkit](https://aclanthology.org/2023.nlposs-1.1/)
Lester James V. Miranda, *Proceedings of the EMNLP 2023 Workshop on NLP Open Source Software (NLP-OSS)*. EMNLP. Singapore, Singapore. December 2023. +*I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on core NLP tasks: POS tagging, NER, dependency parsing, etc.* + +- [calamanCy: a Tagalog Natural Language Processing Toolkit](https://aclanthology.org/2023.nlposs-1.1/) +
*NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP '23* +
Lester James V. Miranda
[[Code](https://github.com/ljvmiranda921/calamanCy)] [[Poster](/assets/png/calamancy/poster.png)] [[Video](https://youtu.be/2fbzs1KbFTQ?si=_vKEY11Z1Jzuaxeu)] -- [Developing a Named Entity Recognition Dataset for Tagalog](https://aclanthology.org/2023.sealp-1.2/)
Lester James V. Miranda, *Proceedings of the IJCNLP-AACL 2023 Workshop on Southeast Asian Language Processing (SEALP)*. ACL. Nusa Dua, Bali, Indonesia. November 2023. +- [Developing a Named Entity Recognition Dataset for Tagalog](https://aclanthology.org/2023.sealp-1.2/) +
*Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL '23* +
Lester James V. Miranda
[[Code](https://github.com/ljvmiranda921/calamanCy/tree/master/reports/aacl2023/benchmark)] [[Dataset](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner)] [[Video](https://www.youtube.com/watch?v=WAJ8IEIHuiM)] -- [Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark](https://arxiv.org/abs/2311.09122)
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter. Preprint. *arXiv:2311.09122 [cs.CL]*. November 2023. +- [Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark](https://aclanthology.org/2024.naacl-long.243/) +
*NAACL '24, ArXiv preprint '23* +
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
[[Dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GQ8HDL)] [[Website](https://www.universalner.org/)] ### 2022 -- [Multi hash embeddings in spaCy](https://arxiv.org/abs/2212.09255)
Lester James V. Miranda\*, Ákos Kádár\*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal. Preprint. *arXiv:2212.09255 [cs.CL]*. November 2022.
*(∗: equal contributions)* +*My first foray to NLP research is a technical report on spaCy's hash embedding method. I'm lucky to have worked with established researchers in the field.* + +- [Multi hash embeddings in spaCy](https://arxiv.org/abs/2212.09255) +
*ArXiV preprint '22* +
Lester James V. Miranda\*, Ákos Kádár\*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal *(∗: equal contributions)*.
[[Code](https://github.com/explosion/projects/tree/v3/benchmarks/ner_embeddings)]