Update research page (#373)

ljvmiranda921 · Jul 22, 2024 · c3dfb65 · c3dfb65
1 parent 4d12f9f
commit c3dfb65
Show file tree

Hide file tree

Showing 4 changed files with 106 additions and 76 deletions.
diff --git a/_layouts/home.html b/_layouts/home.html
@@ -11,17 +11,17 @@
     <div class="right">
       <p>Hi! I'm Lj Miranda, and welcome to my website!</p>
       <p>
-        I'm a <a href="https://allenai.org/predoctoral-young-investigators">predoctoral
+        I'm currently a <a href="https://allenai.org/predoctoral-young-investigators">predoctoral
           researcher</a> at the <a href="https://allenai.org/allennlp">AllenNLP
-          team</a> at <a href="https://allenai.org/">Ai2</a>. Previously, I was a
-        machine learning engineer at <a href="https://explosion.ai">Explosion</a> working on <a
-          href="https://spacy.io">spaCy</a>.
+          team</a> at <a href="https://allenai.org/">Ai2</a>. In the past, I've worked as an <a
+          href="https://storage.googleapis.com/ljvmiranda/cv.pdf">engineer, consultant, and
+          researcher</a>, mostly in the field of NLP and AI.
       </p>
       <p>
-        I'm broadly interested in building equitable language technologies in the presence of severe data and
-        compute constraints.
+        I'm broadly interested in data-centric approaches to building language technologies at scale.
         <!-- My research interests include efficient NLP, low-resource languages, and corpus linguistics. -->
-        I'm happy to discuss research and collaborate, so feel free to <a href="mailto:[email protected]">reach
+        I'm happy to discuss <a href="/research">research</a> and collaborate, so feel free to <a
+          href="mailto:[email protected]">reach
           out</a>!
       </p>
       <!-- <p>

diff --git a/about/index.md b/about/index.md
@@ -18,55 +18,13 @@ Here, you'll find some of my **thoughts, works, and notes** on software
 development, machine learning, and research. I hope you'll spend a nice time
 here, so go grab yourself a coffee and feel free to look around! 
 
-Other links: [[Game Dev](https://ljvmiranda921.itch.io)] [[Game Boy Camera Photos](https://ljvmiranda921.github.io/gallery)]
+<!-- 【】 -->
+【[Game Dev](https://ljvmiranda921.itch.io)】
+【[Game Boy Camera Photos](https://ljvmiranda921.github.io/gallery)】
+【[Curriculum Vitae](https://storage.googleapis.com/ljvmiranda/cv.pdf)】
+<!-- Other links: [] [[Game Boy Camera Photos](https://ljvmiranda921.github.io/gallery)] -->
 
 
-## Background
-
-I'm a [Predoctoral Young
-Investigator](https://allenai.org/predoctoral-young-investigators) at the [Allen
-Institute for Artificial Intelligence (AI2)](https://allenai.org) as part of the
-[AllenNLP team](https://allenai.org/allennlp). Previously, I've worked at the
-following places:
-
-* [Explosion](https://explosion.ai) (Berlin): a natural language processing
-    startup where I worked on the 
-    open-source [spaCy](https://spacy.io) library and the
-    [Prodigy](https://prodi.gy) annotation tool. I co-authored our [first technical
-    report](https://arxiv.org/abs/2212.09255) and developed several features and
-    projects for our software libraries.
-
-* [Thinking Machines Data Science](https://thinkingmachin.es) (Manila): a data 
-    science consultancy where I built multiple natural language processing
-    products for large enterprises. I worked with several of our biggest clients
-    in Southeast Asia and led teams in the Document AI space.
-
-* [Preferred Networks](https://www.preferred-networks.jp/en/) (Tokyo): as an
-    intern, I implemented a training parallelization framework for
-    [ChainerRL](https://github.com/chainer/chainerrl), an open-source
-    reinforcement learning library.
-
-I obtained my master's degree from [Waseda
-University](https://www.waseda.jp/top/en) and my bachelor's in Electronics
-Engineering, minor in Philosophy from [Ateneo de Manila
-University](https://www.ateneo.edu). I used to be a bioinformatics researcher
-but moved on to language&mdash; text, like proteins, are sequences after all. My
-research interests include **low-resource and multilingual NLP**, **efficient
-NLP**, and **corpus linguistics**.
-
-Lastly, I'm well-involved in open-source and have authored [several
-projects](https://github.com/ljvmiranda921) of my own.
-[Pyswarms](https://github.com/ljvmiranda921/pyswarms) has been quite
-successful; I've seen it being used in [quantum
-physics](https://arxiv.org/abs/1801.07686),
-[chemistry](https://pubs.acs.org/doi/abs/10.1021/acscentsci.8b00307), and
-[teaching](https://www.gousios.gr/courses/algo-ds/optimizations.html), amongst
-[other
-things](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=15267041073198929167).
-I love indie games and [dabble in game
-development](https://ljvmiranda921.itch.io) using [Pico-8](https://www.lexaloffle.com/pico-8.php) and
-[Godot](https://godotengine.org/).
-
 <p style="border:3px; border-style:solid; border-color:#a00000; padding: 1em;">
 <b>Short background</b><br>
 Lj Miranda specializes in natural language processing with over five years of
@@ -75,16 +33,10 @@ maintain notable open-source libraries such as spaCy and Pyswarms. He dabbles
 in game development during his free time.
 </p>
 
-
-<!--
-![](/about/aws_community_builder.png){:width="100px"}
-[![](/about/google_data_engineer.png){:width="100px"}](https://www.credential.net/d17f92a5-a21e-41d5-acb0-81d76e3f3e68)
--->
-
 ## Contact
 
 **Lester James V. Miranda**  
-Seattle, Washington, USA  
+Seattle, WA  
 Email: ljvmiranda [at] gmail [dot] com  
 [Curriculum Vitae (PDF)](https://storage.googleapis.com/ljvmiranda/cv.pdf)
 

diff --git a/notebook/index.md b/notebook/index.md
@@ -16,3 +16,20 @@ hope my notebook helps you as much as it has helped me.
     </li>
   {% endfor %}
 </ul>
+
+<!-- {% assign posts_by_year = site.categories.notebook | group_by_exp:"post", "post.date | date: '%Y'" %}
+
+{% for year in posts_by_year %}
+<h2>{{ year.name }}</h2>
+<ul>
+  {% for post in year.items %}
+    <li>
+      {{ post.date | date_to_string  | split: " " | slice: 0, 2 | join: " " }} » 
+      {% if post.highlight %}&starf; {% endif %}
+      <a href="{{ post.url }}" title="{{ post.title }}">
+        {{ post.title | truncate: 72 }}
+      </a>
+    </li>
+  {% endfor %}
+</ul>
+{% endfor %} -->
diff --git a/research/index.md b/research/index.md
@@ -5,38 +5,91 @@ description: Research work of Lester James V. Miranda
 permalink: /research/
 ---
 
-I'm broadly interested in building equitable language technologies in the
-presence of severe constraints&mdash; such as the lack of data or compute.
-My current research interests are **low-resource and multilingual NLP**,
-**efficient NLP**, and **corpus linguistics**. 
+<!-- I am interested in **how we can use data-centric techniques to improve the construction of datasets** for training and evaluating large language models. -->
+I'm broadly interested in **data-centric approaches to building language technologies at scale.**
+<!-- Focusing on data instead of models is crucial, as we face diminishing returns from model scaling and growing concerns about model reliability and fairness.  -->
+My goal is to <u>develop systematic methodologies for efficiently constructing NLP resources</u> while actively <u>building new datasets and benchmarks</u> to enhance language model training and evaluation.
+More concretely, I'm interested in the following areas:
 
+- **Efficient approaches to annotation**:
+    Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I'm currently exploring this question in the context of human preferences in LLM post-training (RLHF).
+
+- **Resources for multilingual NLP**: 
+    No language should be left behind, especially in data. 
+    I've worked on several datasets to improve the state of low-resource and multilingual NLP.
+    These projects involve Filipino [datasets](https://aclanthology.org/2023.sealp-1.2/) & [tooling](https://aclanthology.org/2023.nlposs-1.1/), and [large-scale multilingual datasets](https://aclanthology.org/2024.naacl-long.243/).
+
+- **Faithful benchmarks of model capabilities**:
+    How can we design benchmarks that accurately reflect the true capabilities and limitations of LLMs?
+    I've explored this question in the context of [evaluating reward models (RewardBench)](https://arxiv.org/abs/2403.13787), and in assessing multilingual capabilities of LLMs on [Southeast Asian languages](https://arxiv.org/abs/2406.10118).
+
+If you are interested in these types of work, especially in improving the state of Filipino NLP, then do not hesitate to [reach out](mailto:[email protected]). 
+I'm happy to discuss research and collaborate! 
+
+&nbsp;
+
+---
+
+&nbsp;
+
+## Selected Publications
+
+Below is a list of my publications.
 You can also check my [Google
 Scholar](https://scholar.google.co.jp/citations?user=2RtnNKEAAAAJ&hl=en) and
 [Semantic
 Scholar](https://www.semanticscholar.org/author/Lester-James-V.-Miranda/13614871)
-profiles for more info.
+profiles for more updated information.
 
 ### 2024
 
-- [RewardBench: Evaluating Reward Models](https://arxiv.org/abs/2403.13787) <br> Nathan Lambert, Valentina Pyatkin, Jacob Morrison, <u>LJ Miranda</u>, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Preprint. *arXiv:2403.13787 [cs.LG]*. March 2024. <br> [[Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)] [[Code](https://github.com/allenai/reward-bench)] [[Blog](https://blog.allenai.org/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b)]
+*At AI2, I'm working on various aspects of LM adaptation such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, SIGTYP).*
+
+- [SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages](https://arxiv.org/abs/2406.10118) 
+<br>*ArXiV preprint '24*
+<br>Holy Lovenia\*, Rahmad Mahendra\*, Salsabil Maulana Akbar\*, <u>Lester James Miranda</u>\*, and 50+ other authors *(&lowast;: major contributor)*.
+<br>[[Catalogue](https://seacrowd.github.io/seacrowd-catalogue)] [[Code](https://github.com/SEACrowd/seacrowd-datahub)]
 
+- [Consent in Crisis: The Rapid Decline of the AI Data Commons](https://www.dataprovenance.org/Consent_in_Crisis.pdf) 
+<br>*Preprint '24*
+<br>Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund,..., <u>Lester Miranda</u>, and 40+ other authors. I contributed in the annotation process design for Web Domain services and annotation quality review.
+<br>[[Website](https://www.dataprovenance.org/)] [[Collection](https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection)] [[New York Times Feature](https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html)]
 
-- [Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages](https://aclanthology.org/2024.sigtyp-1.18/).  <u>Lester James V. Miranda</u>, *Proceedings of the EACL 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP*. ACL. St. Julian's, Malta. March 2024. <br> [[Code](https://github.com/ljvmiranda921/LiBERTus)] [[Video](https://www.youtube.com/watch?v=rjOw_G-Rv9I)] 
+- [RewardBench: Evaluating Reward Models for Language Modelling](https://arxiv.org/abs/2403.13787)
+<br>*ArXiV preprint '24*
+<br> Nathan Lambert, Valentina Pyatkin, Jacob Morrison, <u>LJ Miranda</u>, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi <br> [[Leaderboard](https://huggingface.co/spaces/allenai/reward-bench)] [[Code](https://github.com/allenai/reward-bench)] [[Blog](https://blog.allenai.org/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b)]
+
+
+- [Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages](https://aclanthology.org/2024.sigtyp-1.18/)
+<br>*Special Interest Group on Typology (SIGTYP) Workshop @ EACL '24*
+<br><u>Lester James V. Miranda</u> <br> [[Code](https://github.com/ljvmiranda921/LiBERTus)] [[Video](https://www.youtube.com/watch?v=rjOw_G-Rv9I)] 
 
 ### 2023
 
-- [calamanCy: a Tagalog Natural Language Processing Toolkit](https://aclanthology.org/2023.nlposs-1.1/) <br> <u>Lester James V. Miranda</u>, *Proceedings of the EMNLP 2023 Workshop on NLP Open Source Software (NLP-OSS)*. EMNLP. Singapore, Singapore. December 2023. 
+*I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on core NLP tasks: POS tagging, NER, dependency parsing, etc.*
+
+- [calamanCy: a Tagalog Natural Language Processing Toolkit](https://aclanthology.org/2023.nlposs-1.1/) 
+<br>*NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP '23*
+<br> <u>Lester James V. Miranda</u> 
 <br> [[Code](https://github.com/ljvmiranda921/calamanCy)] [[Poster](/assets/png/calamancy/poster.png)] [[Video](https://youtu.be/2fbzs1KbFTQ?si=_vKEY11Z1Jzuaxeu)]
 
-- [Developing a Named Entity Recognition Dataset for Tagalog](https://aclanthology.org/2023.sealp-1.2/) <br> <u>Lester James V. Miranda</u>, *Proceedings of the IJCNLP-AACL 2023 Workshop on Southeast Asian Language Processing (SEALP)*. ACL. Nusa Dua, Bali, Indonesia. November 2023.
+- [Developing a Named Entity Recognition Dataset for Tagalog](https://aclanthology.org/2023.sealp-1.2/)
+<br>*Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL '23* 
+<br> <u>Lester James V. Miranda</u>
 <br> [[Code](https://github.com/ljvmiranda921/calamanCy/tree/master/reports/aacl2023/benchmark)] [[Dataset](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner)] [[Video](https://www.youtube.com/watch?v=WAJ8IEIHuiM)] 
 
-- [Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark](https://arxiv.org/abs/2311.09122) <br>Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek &Scaron;uppa, Hila Gonen, Joseph Marvin Imperial, B&ouml;rje F. Karlsson, Peiqin Lin, Nikola Ljube&scaron;ic&#769;, <u>LJ Miranda</u>, Barbara Plank, Arij Riabi, Yuval Pinter. Preprint. *arXiv:2311.09122 [cs.CL]*.  November 2023. 
+- [Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark](https://aclanthology.org/2024.naacl-long.243/)
+<br>*NAACL '24, ArXiv preprint '23*
+<br>Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek &Scaron;uppa, Hila Gonen, Joseph Marvin Imperial, B&ouml;rje F. Karlsson, Peiqin Lin, Nikola Ljube&scaron;ic&#769;, <u>LJ Miranda</u>, Barbara Plank, Arij Riabi, Yuval Pinter 
 <br> [[Dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GQ8HDL)] [[Website](https://www.universalner.org/)] 
 
 ### 2022
 
-- [Multi hash embeddings in spaCy](https://arxiv.org/abs/2212.09255) <br> <u>Lester James V. Miranda</u>\*, &Aacute;kos K&aacute;d&aacute;r\*, Adriane Boyd, Sofie Van Landeghem, Anders S&oslash;gaard, and Matthew Honnibal. Preprint. *arXiv:2212.09255 [cs.CL]*. November 2022. <br> *(&lowast;: equal contributions)*
+*My first foray to NLP research is a technical report on spaCy's hash embedding method. I'm lucky to have worked with established researchers in the field.*
+
+- [Multi hash embeddings in spaCy](https://arxiv.org/abs/2212.09255)
+<br>*ArXiV preprint '22*
+<br> <u>Lester James V. Miranda</u>\*, &Aacute;kos K&aacute;d&aacute;r\*, Adriane Boyd, Sofie Van Landeghem, Anders S&oslash;gaard, and Matthew Honnibal *(&lowast;: equal contributions)*.
 <br> [[Code](https://github.com/explosion/projects/tree/v3/benchmarks/ner_embeddings)]
 
 <!--
@@ -53,12 +106,20 @@ profiles for more info.
 
 I used to be a bioinformatics researcher at the [Furuzuki Neurocomputing Systems Laboratory](https://www.waseda.jp/sem-hflab/nclab/index.html), working on nature-inspired algorithms and proteomics. 
 
-- [Feature Extraction using a Mutually-Competitive Autoencoder for Protein Function Prediction](https://ieeexplore.ieee.org/document/8616230). <u>Lester James V. Miranda</u> and Jinglu Hu, _IEEE International Conference on System, Man, and Cybernetics (SMC)_. IEEE. Miyazaki, Japan. October 2018. 
+- [Feature Extraction using a Mutually-Competitive Autoencoder for Protein Function Prediction](https://ieeexplore.ieee.org/document/8616230). 
+<br>*IEEE Systems, Man, and Cybernetics (SMC) '18*
+<br><u>Lester James V. Miranda</u> and Jinglu Hu 
 
-- [A Deep Learning Approach based on Stacked Denoising Autoencoders for Protein Function Prediction](https://ieeexplore.ieee.org/document/8377699). <u>Lester James V. Miranda</u> and Jinglu Hu, _42nd IEEE Computer Society Signature Conference on Computers, Software, and Applications (COMPSAC)_. IEEE. Tokyo, Japan. July 2018.
+- [A Deep Learning Approach based on Stacked Denoising Autoencoder for Protein Function Prediction](https://ieeexplore.ieee.org/document/8377699). 
+<br>*IEEE Computer, Software, and Applications (COMPSAC) '18*
+<br><u>Lester James V. Miranda</u> and Jinglu Hu
 
-- [PySwarms, a research-toolkit for Particle Swarm Optimization in Python](https://joss.theoj.org/papers/10.21105/joss.00433) <br> <u>Lester James V. Miranda</u>, _Journal of Open Source Software_, vol. 3, no. 433, 2018.
+- [PySwarms, a research-toolkit for Particle Swarm Optimization in Python](https://joss.theoj.org/papers/10.21105/joss.00433) 
+<br>*Journal of Open Source Software (JOSS) '18, vol.3, no. 433*
+<br> <u>Lester James V. Miranda</u>
 
 I was also involved in research early on during my undergrad:
 
-- [Appliance Recognition using Hall-Effect Sensors and k-Nearest Neighbors for Power Management Systems](https://ieeexplore.ieee.org/document/7847947). <u>Lester James V. Miranda</u>\*, Marian Joice Gutierrez\*, Samuel Matthew Dumlao, and Rosula Reyes, _Proceedings of the 2016 IEEE Region 10 Conference 2016 (TENCON)_. IEEE. Singapore. November 2016. *(&lowast;: equal contributions)*
+- [Appliance Recognition using Hall-Effect Sensors and k-Nearest Neighbors for Power Management Systems](https://ieeexplore.ieee.org/document/7847947) 
+<br>*IEEE Region 10 Conference (TENCON) '16*
+<br><u>Lester James V. Miranda</u>\*, Marian Joice Gutierrez\*, Samuel Matthew Dumlao, and Rosula Reyes *(&lowast;: equal contributions)*.