This project is related to the course "Subjects in Digital humanities". In this project I hoped to achive a knowledge about how Isreali politicians are being reviewed in the new apps.
In order to do that, we’ll need to understand the context of the articles about them.
We’ll understant the context by extracting adjectives from the articles.
https://github.com/OnlpLab/yap
http://www.tsarfaty.com/pdfs/acl13.pdf
cd <your project dir>
git clone <git url>
cd DigitalHumanProject
python3.7 -m venv venv3
source venv3/bin/activate
pip3 install -r requirements.txt
export api_key=<YOUR_API_KEY>
(example: export api_key=40d61a5ed053486f8b3ef093551f4d40)
deactivate
Our target is to extract
For that, We'll apply 4 steps:
Prepare you research:
In these project I focused on Israelies politicians,
but it can be applied to any context you wish to research.
Get contenct into jsons with the following structure:
{"status": "ok",
"totalResults": 16,
"articles": [
{"source": {"id": "ynet", "name": "Ynet"},
"author": "...",
"description": "...",
"url": "'''",
"publishedAt": "2020-01-01T20:00:00Z", "content": "..."},
...
]
}
- The files will be saved in content/json/name/
- In our project the files contains articles with politicians names, but it can be applied to any name you want
Extracting tokens from the content.
Assuming the content is set in the right way,
this phase will extract the content and parse it into tokens.
An example for a token file:
חבר
הכנסת
מהליכוד
טען
באולפן
ynet
כי
למרות
שראש
הממשלה
נתניהו
יישב
על
ספסל
הנאשמים,
"הניסיון
שלו
כל
כך
משמעותי,
שאדם
עם
אפס
צרות
אחרות
לא
מסוגל
להיכנס
לנעליו".
הוא
גיבה
את
בנט
למרות
המתקפות
נגדו:
"שר
ביטחון
טוב".
על
גדעון
סער:
"נתניהו
מבין
היטב
את
מקומו
בהנהג…
.
- The files will be saved in tokens/name/
Applying YAP utils on our tokens
In this step, we'll apply yap utils to extract "Part of speech" in Hebrew.
- The files will be saved in finalresults/name/
make api-data
make extract-tokens
make apply-yap
make delete-results
make restart
make start
newsapi = NewsApiClient(api_key='40d61a5ed053486f8b3ef093551f4d40')
top_headlines = newsapi.get_top_headlines(q='bitcoin',
sources='bbc-news,the-verge',
category='business',
language='en',
country='us')
all_articles = newsapi.get_everything(q='bitcoin',
sources='bbc-news,the-verge',
domains='bbc.co.uk,techcrunch.com',
from_param='2017-12-01',
to='2017-12-12',
language='en',
sort_by='relevancy',
page=2)
Json
Each item:
{ "source": {"id": "ynet", "name": "Ynet"}, "author": "...", "description": "..."", "url": "...", "urlToImage": "...", "publishedAt": "2020-01-25T16:51:00Z", "content": "..." }