Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement service to return term counts #24

Open
lintool opened this issue Apr 17, 2013 · 6 comments
Open

Implement service to return term counts #24

lintool opened this issue Apr 17, 2013 · 6 comments

Comments

@lintool
Copy link
Owner

lintool commented Apr 17, 2013

We need a service to return time counts within a certain interval. Need to decide:

  1. Actual implementation (separate service? squeeze into current service?)
  2. Granularity?
  3. Just unigrams? Arbitrary n-grams as well?
  4. Impact on efficiency?
@lintool
Copy link
Owner Author

lintool commented Apr 17, 2013

We might not even need a service:

if we stored term counts in this way: termid -> [ vector of counts... ]
we can definitely post the file publicly, separately distribute term to termid mapping

the format would be pretty much identical to the google books datasets
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

@amjedbj
Copy link

amjedbj commented Apr 17, 2013

We need to retrun for each term:

  • tf: frequency of the term in the document
  • df: number of documents that contain the term (Time aware)
  • cf: frequency of the term in the dataset (Time aware)

@stewhdcs
Copy link

It would probably be useful to do this unigrams and bigrams. The size of the file could be reduced by filtering out low frequencies overall the collection, or per 'bucket' period.

We can specify buckets every N hours from the start of the corpus. N = 4/6/12 hours would probably be more than enough. At least with a smaller than necessary interval, people can easily aggregate intervals together as necessary using integer division on the bucket offset.

We would also need the background model of document frequencies in each bucket so we can compute term probabilities as well.

@amjedbj
Copy link

amjedbj commented Apr 19, 2013

What about tweet and term statistics of the current index. Some IR baslines requires collection statistics such as average tweet length (i.e. Okapi BM25). This is a non-exhaustive list of index stats:

  • tft,d: Frequency of query term t in tweet d
  • post,d: Position of query term t in tweet d
  • lend: Number of terms in tweet d
  • N: Number of tweets in the current index (Time aware)
  • Ns,e: Number of pusblished tweets in the time interval [s,e]
  • T: Number of terms in the current index (Time aware)
  • dft: Number of tweets that contain term t (Time aware)
  • cft: Number of occurrences of term t in the index (Time aware)
  • sum(lend): Sum of tweet length in the current index (Time aware)
  • avg(lend): Average tweet length in the current index (Time aware)
  • max(lend): Maximum tweet length in the current index (Time aware)
  • max(tfd,t): Maximum term frequency in the current index (Time aware)

Some of this data is reproducible on client side unless the same tokenizer and stemmer is used.

I defined some Thrift structs for data encoding. Optional fields must be implemented on client side.
(see https://github.com/amjedbj/twitter-tools/blob/prototype-lintool/src/main/thrift/twittertools.thrift)

What do you think?

@Latifa-AlMarri
Copy link

I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)

Any Help?

@milesefron
Copy link
Collaborator

We are integrating these items into the API currently. They should be included soon.
-Miles

Sent from my iPad

On Jun 26, 2013, at 18:46, Latifa [email protected] wrote:

I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)

Any Help?


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants