Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Add sample ML-based topic modeling support #170

Open
wants to merge 107 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
e24f3b7
Create token_pool.py
DonggeLiu Jun 29, 2017
9535b81
added the file created last time
DonggeLiu Jul 3, 2017
934da4b
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 3, 2017
2a8a0f2
1. Two LDA model (with different package, not sure which one is bette…
DonggeLiu Jul 3, 2017
e888805
Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…
DonggeLiu Jul 3, 2017
a23aa13
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 3, 2017
bc462ba
General
DonggeLiu Jul 10, 2017
83a31a7
1. Define types for parameters and return values
DonggeLiu Jul 11, 2017
ced8bb4
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 11, 2017
943c696
isolate import gensim to see if it causes failure #3839
DonggeLiu Jul 17, 2017
3db49ee
verifying the reason of errors
DonggeLiu Jul 17, 2017
06d1d37
reformat the output of model_gensim to make it in the same format as …
DonggeLiu Jul 17, 2017
e027dad
1. updated tests according to the changes I made in model_gensim.py
DonggeLiu Jul 17, 2017
336c0d8
added tests for model_lda.py
DonggeLiu Jul 17, 2017
178226b
trying to fix the 'module' object has no attribute 'plugin' problem
DonggeLiu Jul 18, 2017
ebc4715
reference topic_model module with full path
DonggeLiu Jul 18, 2017
39c5e8c
Merge branch 'master' into topic_modelling
pypt Jul 20, 2017
716fe91
added the requirement for sklearn, which supports the NMF algorithm
DonggeLiu Jul 24, 2017
f66ead6
Added msg for each assertion
DonggeLiu Jul 24, 2017
2d6c12d
added msg for each assertion
DonggeLiu Jul 24, 2017
6c50ed2
added model_nmf.py to model topics with the NMF algorithm
DonggeLiu Jul 24, 2017
679fef0
test cases for model_nmf.py
DonggeLiu Jul 24, 2017
3ab2124
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 24, 2017
025dece
Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…
DonggeLiu Jul 24, 2017
61517d1
sorted requirements.txt in alphabetical order
DonggeLiu Jul 24, 2017
36817b9
cache WordNet
DonggeLiu Jul 24, 2017
b5562ad
install the WordNet via NLTK
DonggeLiu Jul 24, 2017
e6b126c
relocate test files
DonggeLiu Jul 24, 2017
c93fe63
remove uncessary files after test suits relocation
DonggeLiu Jul 24, 2017
730a4e9
1. removed josn serialization after fetching sentences from database
DonggeLiu Jul 24, 2017
3b38dff
add .close to open file
DonggeLiu Jul 24, 2017
154f96d
add .close() to opened file
DonggeLiu Jul 24, 2017
5ea449a
suppress warning message caused by NLTK built-in method lemmatize()
DonggeLiu Jul 24, 2017
34fdcbc
restore the file (its content was mysteriously deleted)
DonggeLiu Jul 24, 2017
baca56c
removed path_helper.py and related codes
DonggeLiu Jul 24, 2017
fe78de8
add a file containing sample stories (can replace DB in tests)
DonggeLiu Jul 24, 2017
91d725e
1. Change the SQL query to be the same as suggested in previous PR re…
DonggeLiu Jul 24, 2017
0ca1eca
Seperated test cases for three models from db_connection
DonggeLiu Jul 24, 2017
dc0b73b
added explanation for each of the three modules used
DonggeLiu Jul 24, 2017
96f566c
removed redundant textblob in requirements
DonggeLiu Jul 24, 2017
c488c08
separate test_token_pool.py from database
DonggeLiu Jul 24, 2017
6d8555e
remove import path_helper
DonggeLiu Jul 26, 2017
6182c4f
Rearraged NLTK installation to make it system-wide
DonggeLiu Jul 26, 2017
9c68669
Use wget instead of nltk.download() to avoid 405 error
DonggeLiu Jul 26, 2017
0e04ff1
silent wget
DonggeLiu Jul 26, 2017
d995cb8
adding more echos and comments
DonggeLiu Jul 26, 2017
a361b01
turn on -n switch of unzip gh-pages.zip, preventing rewrite existing …
DonggeLiu Jul 27, 2017
db1c584
added COMMAND_PREFIX to use sudo on linux
DonggeLiu Jul 27, 2017
2a88eab
restore missing log4perl.conf
DonggeLiu Jul 27, 2017
b62e71d
Don't --force-reinstall stuff needlessly
pypt Jul 27, 2017
7922d3c
Install only WordNet data from NLTK data
pypt Jul 27, 2017
7ce27cc
Revert "added COMMAND_PREFIX to use sudo on linux"
pypt Jul 27, 2017
29d460c
Revert "turn on -n switch of unzip gh-pages.zip, preventing rewrite e…
pypt Jul 27, 2017
4008366
Revert "adding more echos and comments"
pypt Jul 27, 2017
c1da604
Revert "silent wget"
pypt Jul 27, 2017
7b6beaf
Revert "Use wget instead of nltk.download() to avoid 405 error"
pypt Jul 27, 2017
bf2c962
Install NLTK data from own mirror on S3
pypt Jul 27, 2017
482f01e
Install only WordNet data from NLTK data
pypt Jul 27, 2017
00633aa
Don't --force-reinstall stuff needlessly
pypt Jul 27, 2017
6f09e31
added punkt into nltk dependencies
DonggeLiu Aug 1, 2017
179da05
use sample handler to separate access to sample file from others
DonggeLiu Aug 7, 2017
1cf5601
1. make use of sample_handler.py to access sample file
DonggeLiu Aug 7, 2017
1d3ad5e
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 7, 2017
81d6892
use full path of sample_handler.py
DonggeLiu Aug 7, 2017
8861d9e
Temporarily disable unit tests for Travis to cache dependencies
pypt Aug 8, 2017
c732a50
Revert "cache WordNet"
pypt Aug 8, 2017
65c505b
Revert "Temporarily disable unit tests for Travis to cache dependencies"
pypt Aug 8, 2017
73f7e2e
added a new abstract method for topic model classes to evaluate curre…
DonggeLiu Aug 9, 2017
ef35923
unify the name of models used in each class to self._model as in the …
DonggeLiu Aug 9, 2017
89882cd
implement the evaluation method based on the buit-in method likelihood()
DonggeLiu Aug 9, 2017
73e518c
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 9, 2017
e2d6655
use the sample file instead of DB in Travis
DonggeLiu Aug 9, 2017
5289a85
Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…
DonggeLiu Aug 9, 2017
00831af
edit the total number of topics
DonggeLiu Aug 9, 2017
59bcb50
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 12, 2017
2c8e6eb
added tuning steps to find out the optimal topic number
DonggeLiu Aug 12, 2017
d1129a6
a finder that can identify the max/min points of a polynomial compute…
DonggeLiu Aug 13, 2017
4d5b9e4
added two methods tune_*() to find out the optimal number of topics
DonggeLiu Aug 13, 2017
8e77ed4
removed some print()s and rewrote evaluation()
DonggeLiu Aug 14, 2017
809aad7
added more test cases on checking the accuracy of the model via likel…
DonggeLiu Aug 14, 2017
f819366
improved polynomial tuning algorithm
DonggeLiu Aug 19, 2017
9869ca8
no longer test tune_with_iteration as polynomial has a sigificant bet…
DonggeLiu Aug 19, 2017
e185dd0
larger sample for Travis to test against
DonggeLiu Aug 19, 2017
3545e0e
modify tests accroding to change in sample_stories.txt
DonggeLiu Aug 19, 2017
7816ec8
use smaller sample size so that Travis will not fail
DonggeLiu Aug 20, 2017
94ebc24
do not test limit if limit is not specified
DonggeLiu Aug 20, 2017
c1c257e
improved tune with polynomial algorithm
DonggeLiu Aug 20, 2017
6d09265
removed uncessary tune_with_iteration as its advantage/feature has be…
DonggeLiu Aug 20, 2017
2479107
fixed the algorithm of optimal point finder
DonggeLiu Aug 20, 2017
51dd0ec
removed useless codes
DonggeLiu Aug 20, 2017
620afb4
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 20, 2017
5ead4f2
Disable unit tests temporarily for Travis to have a chance to compile…
DonggeLiu Aug 20, 2017
0fb4e4a
Cache WordNet of NLTK
DonggeLiu Aug 20, 2017
87efd01
set test cases back
DonggeLiu Aug 20, 2017
6ea203b
revert the changes made on .travis.yml
DonggeLiu Aug 20, 2017
b675559
added more story samples
DonggeLiu Aug 21, 2017
8753442
new commits from git pull origin master
DonggeLiu Aug 21, 2017
e39415b
removed unnecessary code to keep higher level of accuracy
DonggeLiu Aug 21, 2017
a674d26
changed sample file name
DonggeLiu Aug 21, 2017
6267f72
this sample file has been replaced by 3 files with different size
DonggeLiu Aug 21, 2017
d4e9d48
use a smaller sample to test on Travis due to limit restriction
DonggeLiu Aug 21, 2017
0c3f7ee
1. break large block of codes up to more funcitons
DonggeLiu Aug 21, 2017
4c12748
remove uncessary code
DonggeLiu Aug 21, 2017
720dd7a
restructured tests to reduce running time
DonggeLiu Aug 21, 2017
97afc48
further improvements on the code structure
DonggeLiu Aug 22, 2017
016d01c
remove redudent code
DonggeLiu Aug 22, 2017
9ff15ff
Merge branch 'master' into topic_modelling
pypt Sep 1, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 26 additions & 2 deletions install/install_python_dependencies.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,29 @@ echo "Installing (upgrading) Supervisor..."
( cd /tmp; $COMMAND_PREFIX pip2.7 install --upgrade supervisor )

echo "Installing (upgrading) Virtualenv..."
$COMMAND_PREFIX pip2.7 install --force-reinstall --upgrade virtualenv
$COMMAND_PREFIX pip$PYTHON3_MAJOR_VERSION install --force-reinstall --upgrade virtualenv
$COMMAND_PREFIX pip2.7 install --upgrade virtualenv
$COMMAND_PREFIX pip$PYTHON3_MAJOR_VERSION install --upgrade virtualenv

# Install system-wide NLTK because otherwise sudo is unable to find
# NLTK installed in virtualenv on Travis

echo "Installing (upgrading) NLTK to install NLTK's data afterwards..."
$COMMAND_PREFIX pip$PYTHON3_MAJOR_VERSION install --upgrade nltk

# Installing WordNet with NLTK
# (installing from own mirror on S3 to avoid hitting GitHub: https://github.com/nltk/nltk/issues/1787)
echo "Installing NLTK WordNet data..."
if [ `uname` == 'Darwin' ]; then
NLTK_DATA_PATH=/usr/local/share/nltk_data
else
NLTK_DATA_PATH=/usr/share/nltk_data
fi

$COMMAND_PREFIX python$PYTHON3_MAJOR_VERSION \
-m nltk.downloader \
-u https://s3.amazonaws.com/mediacloud-nltk-data/nltk_data/index.xml \
-d "$NLTK_DATA_PATH" \
wordnet punkt

echo "Creating mc-venv virtualenv..."
echo "$(which python$PYTHON3_MAJOR_VERSION)"
Expand All @@ -69,3 +90,6 @@ pip$PYTHON3_MAJOR_VERSION install --upgrade -r mediacloud/requirements.txt || {
echo "'pip$PYTHON3_MAJOR_VERSION install' failed the first time, retrying..."
pip$PYTHON3_MAJOR_VERSION install --upgrade -r mediacloud/requirements.txt
}



10 changes: 9 additions & 1 deletion lib/MediaWords/Job/AnnotateWithCoreNLP.pm
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,20 @@ use MediaWords::Util::CoreNLP;
use MediaWords::DBI::Stories;
use Readonly;

# Having a global database object should be safe because
# job workers don't fork()
my $db = undef;

# Run CoreNLP job
sub run($;$)
{
my ( $self, $args ) = @_;

my $db = MediaWords::DB::connect_to_db();
unless ( $db )
{
# Postpone connecting to the database so that compile test doesn't do that
$db = MediaWords::DB::connect_to_db();
}

my $stories_id = $args->{ stories_id } + 0;
unless ( $stories_id )
Expand Down
7 changes: 6 additions & 1 deletion lib/MediaWords/Job/Bitly/FetchStoryStats.pm
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,17 @@ Readonly my $BITLY_RATE_LIMIT_SECONDS_TO_WAIT => 60 * 10; # every 10 minutes
# How many times to try on rate limiting errors
Readonly my $BITLY_RATE_LIMIT_TRIES => 7; # try fetching 7 times in total (70 minutes)

# Having a global database object should be safe because
# job workers don't fork()
my $db = undef;

# Run job
sub run($;$)
{
my ( $self, $args ) = @_;

my $db = MediaWords::DB::connect_to_db();
# Postpone connecting to the database so that compile test doesn't do that
$db ||= MediaWords::DB::connect_to_db();

my $stories_id = $args->{ stories_id } or die "'stories_id' is not set.";
my $start_timestamp = $args->{ start_timestamp };
Expand Down
7 changes: 6 additions & 1 deletion lib/MediaWords/Job/Facebook/FetchStoryStats.pm
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ use MediaWords::Util::Process;
use Readonly;
use Data::Dumper;

# Having a global database object should be safe because
# job workers don't fork()
my $db = undef;

# Run job
sub run($;$)
{
Expand All @@ -43,7 +47,8 @@ sub run($;$)
fatal_error( 'Facebook API processing is not enabled.' );
}

my $db = MediaWords::DB::connect_to_db();
# Postpone connecting to the database so that compile test doesn't do that
$db ||= MediaWords::DB::connect_to_db();

my $stories_id = $args->{ stories_id } or die "'stories_id' is not set.";

Expand Down
5 changes: 3 additions & 2 deletions mediacloud/mediawords/db/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,8 +239,9 @@ def schema_is_up_to_date(self) -> bool:
raise McSchemaIsUpToDateException("Current schema version is 0")

# Target schema version
sql = open(mc_sql_schema_path(), 'r').read()
target_schema_version = schema_version_from_lines(sql)
sql = open(mc_sql_schema_path(), 'r')
target_schema_version = schema_version_from_lines(sql.read())
sql.close()
if not target_schema_version:
raise McSchemaIsUpToDateException("Invalid target schema version.")

Expand Down
5 changes: 3 additions & 2 deletions mediacloud/mediawords/util/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,9 @@ def __parse_yaml(config_file: str) -> dict:
if not os.path.isfile(config_file):
raise McConfigException("Configuration file '%s' was not found." % config_file)

yaml_file = open(config_file, 'r').read()
yaml_data = yaml.load(yaml_file, Loader=Loader)
yaml_file = open(config_file, 'r')
yaml_data = yaml.load(yaml_file.read(), Loader=Loader)
yaml_file.close()
return yaml_data


Expand Down
Empty file.
101 changes: 101 additions & 0 deletions mediacloud/mediawords/util/topic_modeling/model_gensim.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import gensim

# from mediawords.db import connect_to_db
from mediawords.util.topic_modeling.sample_handler import SampleHandler
from mediawords.util.topic_modeling.topic_model import BaseTopicModel
from mediawords.util.topic_modeling.token_pool import TokenPool
from typing import Dict, List


class ModelGensim(BaseTopicModel):
"""Generate topics of each story based on the LDA model
ModelGensim operates on a single story at a time
by comparing the occurrence of each token in all sentences of that story.
It does not consider the rest of stories. The benefits of this approach include:
1. Each story contains the word in the topics of that story
2. There is a fixed number of topics for each story"""

def __init__(self) -> None:
self._story_number = 0
self._stories_ids = []
self._stories_tokens = []
self._dictionary = None
self._corpus = []
self._WORD_SPLITTER = ' + '
self._COEFFICIENT_SPLITTER = '*'

def add_stories(self, stories: Dict[int, List[List[str]]]) -> None:
"""
Adding new stories into the model
:param stories: a dictionary of new stories
"""
for story in stories.items():
story_id = story[0]
story_tokens = story[1]
self._stories_ids.append(story_id)
self._stories_tokens.append(story_tokens)

self._story_number = len(self._stories_ids)

def summarize_topic(self, topic_number: int = 1,
word_number: int = 4, passes: int = 100) -> Dict[int, list]:
"""
summarize the topic of each story based on the frequency of occurrence of each word
:return: a dictionary of story id
and corresponding list of TOPIC_NUMBER topics (each topic contains WORD_NUMBER words)
"""

story_topic = {}

for i in range(len(self._stories_ids)):
# turn our token documents into a id <-> term dictionary
self._dictionary = gensim.corpora.Dictionary(self._stories_tokens[i])

# convert token documents into a document-term matrix
self._corpus = [self._dictionary.doc2bow(text) for text in self._stories_tokens[i]]

# generate LDA model
self._model = gensim.models.ldamodel.LdaModel(
corpus=self._corpus, num_topics=topic_number,
id2word=self._dictionary, passes=passes)

raw_topics = self._model.print_topics(num_topics=topic_number, num_words=word_number)

story_topic[self._stories_ids[i]] = self._format_topics(raw_topics=raw_topics)

return story_topic

def _format_topics(self, raw_topics: List[tuple]) -> List[List[str]]:
"""
Return topics in the desired format
:param raw_topics: un-formatted topics
:return: formatted topics
"""
formatted_topics = []
for topic in raw_topics:
words_str = topic[1]
# change the format
# from 'COEFFICIENT1*"WORD1" + COEFFICIENT2*"WORD2" + COEFFICIENT3*"WORD3"'
# to [WORD1, WORD2, WORD3]
words = [word_str.split(self._COEFFICIENT_SPLITTER)[1][1:-1]
for word_str in words_str.split(self._WORD_SPLITTER)]
formatted_topics.append(words)

return formatted_topics

def evaluate(self):
pass


# A sample output
if __name__ == '__main__':
model = ModelGensim()

# pool = TokenPool(connect_db())
# model.add_stories(pool.output_tokens(1, 0))
# model.add_stories(pool.output_tokens(5, 1))

pool = TokenPool(SampleHandler())
model.add_stories(pool.output_tokens())

print(model.summarize_topic())
Loading