Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topic_allocation; visualisation not found #5

Open
xiaowei-xw opened this issue Jan 30, 2020 · 11 comments
Open

topic_allocation; visualisation not found #5

xiaowei-xw opened this issue Jan 30, 2020 · 11 comments

Comments

@xiaowei-xw
Copy link

Show the top 5 words by cluster, it helps to make the topic_dict below

top_words(mgp.cluster_word_distribution, top_index, 5)

topic_allocation not found

@ilya-palachev
Copy link

ilya-palachev commented Jun 11, 2020

It seems that visualization can be done (after the fitting is done) as follows:

import pyLDAvis
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in tqdm(docs)]
doc_lengths = [len(doc) for doc in tqdm(docs)]
term_counts_map = {}
for doc in tqdm(docs):
    for term in doc:
        term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in tqdm(vocabulary)]

matrix = []
for cluster in mgp.cluster_word_distribution:
    total = sum([occurance for word, occurance in cluster.items()])
    row = [cluster.get(term, 0) / total for term in vocabulary]
    matrix.append(row)

vis_data = pyLDAvis.prepare(matrix, doc_topic_dists, doc_lengths, vocabulary, term_counts)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

@Felipehonorato1
Copy link

It seems that visualization can be done (after the fitting is done) as follows:

import pyLDAvis
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in tqdm(docs)]
doc_lengths = [len(doc) for doc in tqdm(docs)]
term_counts_map = {}
for doc in tqdm(docs):
    for term in doc:
        term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in tqdm(vocabulary)]

matrix = []
for cluster in mgp.cluster_word_distribution:
    total = sum([occurance for word, occurance in cluster.items()])
    row = [cluster.get(term, 0) / total for term in vocabulary]
    matrix.append(row)

vis_data = pyLDAvis.prepare(matrix, doc_topic_dists, doc_lengths, vocabulary, term_counts)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

Hey, i just tried to use this algorithm to plot my pyLDAvis graph but i had this weird error:

100%|██████████| 10000/10000 [00:10<00:00, 968.73it/s]
100%|██████████| 10000/10000 [00:00<00:00, 1697274.20it/s]
100%|██████████| 10000/10000 [00:00<00:00, 508566.92it/s]
100%|██████████| 5849/5849 [00:00<00:00, 1455588.23it/s]
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-14-d15fc6d7c170> in <module>()
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

<ipython-input-14-d15fc6d7c170> in <listcomp>(.0)
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

ZeroDivisionError: division by zero

It appears to occur whenever the n# of topics drops.Do you know any solution for that?

@ilya-palachev
Copy link

Hey, i just tried to use this algorithm to plot my pyLDAvis graph but i had this weird error:

100%|██████████| 10000/10000 [00:10<00:00, 968.73it/s]
100%|██████████| 10000/10000 [00:00<00:00, 1697274.20it/s]
100%|██████████| 10000/10000 [00:00<00:00, 508566.92it/s]
100%|██████████| 5849/5849 [00:00<00:00, 1455588.23it/s]
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-14-d15fc6d7c170> in <module>()
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

<ipython-input-14-d15fc6d7c170> in <listcomp>(.0)
     11 for cluster in mgp.cluster_word_distribution:
     12     total = sum([occurance for word, occurance in cluster.items()])
---> 13     row = [cluster.get(term, 0) / total for term in vocabulary]
     14     matrix.append(row)
     15 

ZeroDivisionError: division by zero

It appears to occur whenever the n# of topics drops.Do you know any solution for that?

Yes, sure, it happens when some topic becomes empty. I have a workaround for that. There are also some other issues, so my final code looks as follows:

import pandas as pd
import pyLDAvis
import math

def prepare_data(mgp):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)
    
    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   # <--- The discussed workaround is here
        else:
            row = [cluster.get(term, 0) / total for term in vocabulary]
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

vis_data = prepare_visualization_data(mgp)

%matplotlib inline
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

@Felipehonorato1
Copy link

Really appreciate your help dude. Nice job

@ilya-palachev
Copy link

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

@Felipehonorato1
Copy link

Felipehonorato1 commented Jul 16, 2020

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

How can i do that?
And also what should 'now' be?

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

@ilya-palachev
Copy link

ilya-palachev commented Jul 16, 2020 via email

@ilya-palachev
Copy link

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

How can i do that?
And also what should 'now' be?

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

It is just for saving the HTML file. now can be any string constant, or you can choose a file name without this suffix; I use

from datetime import datetime
now = str(datetime.now()).replace(' ', '_')

so that to save all version to different files.

@Felipehonorato1
Copy link

Really appreciate your help dude. Nice job

It would be great if you make a fork with this method implemented.

How can i do that?
And also what should 'now' be?

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    with open(f"gsdmm-pyldavis-{K}-{alpha}-{beta}-{n_iters}-{now}.html", "w") as f:
        pyLDAvis.save_html(vis_data, f)
    return vis_data

It is just for saving the HTML file. now can be any string constant, or you can choose a file name without this suffix; I use

from datetime import datetime
now = str(datetime.now()).replace(' ', '_')

so that to save all version to different files.

got it. Now the error w the complex number ig

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-6141749a10bc> in <module>()
     46     return abs(vis_data)
     47 
---> 48 vis_data = prepare_visualization_data(mgp)
     49 
     50 get_ipython().magic('matplotlib inline')

8 frames
/usr/lib/python3.6/json/encoder.py in default(self, o)
    178         """
    179         raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180                         o.__class__.__name__)
    181 
    182     def encode(self, o):

TypeError: Object of type 'complex' is not JSON serializable

@ilya-palachev
Copy link

got it. Now the error w the complex number ig

It seems to be better to have full source code to help you to debug the issue. Do you store it on GitHub? Maybe, in some notebook?

@ernests
Copy link

ernests commented Jul 16, 2021

Spent some time to solve the issues and get this working.
Here is final, working code with comments :

from datetime import datetime
now = str(datetime.now()).replace(' ', '_')

K = 40
alpha = 0.03
beta = 0.04
n_iters = 30,

def prepare_data(mgp, docs, K):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)
    
    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   # <--- The discussed workaround is here
        else:
            row = [cluster.get(term, 1) / total for term in vocabulary] # <--- changed from 0 to 1
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp, 
                               docs, 
                               K, 
                               alpha, 
                               beta, 
                               n_iters, 
                               now, 
                               save = False):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp, docs, K), sort_topics=False, mds='mmds') # <--- mds is changed from default to mmds. 
    if save:
        with open(f"gsdmm-Clusters-{K}_Alpha-{alpha}_Beta-{beta}_Iterations-{n_iters}--------{now}.html", "w") as f:
            pyLDAvis.save_html(vis_data, f)
    return vis_data

vis_data = prepare_visualization_data(mgp, 
                                      trigrams, 
                                      K = K, 
                                      alpha = alpha,
                                      beta = beta,
                                      n_iters = n_iters,
                                      now = now)
%matplotlib inline

pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants