Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching status blocks for Tweets2011 - hits twitter api limit #13

Open
nigel-v-thomas opened this issue Feb 2, 2013 · 8 comments
Open

Comments

@nigel-v-thomas
Copy link

I have been trying to get this tool to download the blocks from Tweets2011 collection, unfortunately current implementation hits the twitter api limit each time.

The Twitter limit on the read api is 180 hits per hour, see http://twitter4j.org/en/api-support.html, and 150 for unauthenticated.

I have tried

  • to create authenticated request to twitter 1.1 api (since older API is deprecated, and possible removed from March 2013 onwards)
  • parsing the content out of the web pages directly (a brittle solution!), however this doesn't work with protected accounts and missing pages

Given the number of requests generated by this solution, I am not sure how to build the Tweets2011 corpus.

@isoboroff
Copy link
Collaborator

This is a longstanding issue that the API-based crawler can't fix. The answer is to use the HTML-based crawlers, which need updating to Twitter's current HTML layout.

@nigel-v-thomas
Copy link
Author

Thank you for the feedback.

I have tried the html based crawler, and found that it will not work with users who have closed their accounts, or made their tweets private since the original data was gathered.

I am resigned to believe that this approach of fetching status blocks from API or directly from HTML will no longer work, due to API limits and expired or private accounts and tweets.

@isoboroff
Copy link
Collaborator

We are working on updating the HTML crawler. Stay tuned, or alternatively patches accepted ;-)

@nigel-v-thomas
Copy link
Author

I have submitted pull request #12 , which uses api to fetch.
I have not committed the code which fetches from HTML, it is a simple hack and suffers from problems described earlier ie expired or private tweets not being available. I can commit code if needed..

@andrewyates
Copy link

Could you commit the HTML scraping code somewhere for reference?

From what I understand, there's no way around the fact that private tweets and deleted tweets aren't available.

@nigel-v-thomas
Copy link
Author

@andrewyates @isoboroff
FYI, below is the quick hack to scrape HTML content
nigel-v-thomas@5689ff7

@shakirak
Copy link

Hi Nigel,

I was trying the link https://github.com/nigel-v-thomas/twitter-tools and still getting the error "Unable to parse text from this, possible change in format"... I am using html scraping....
Am i missing something?????

@nigel-v-thomas
Copy link
Author

Hi Shakirak, it is quite likely the HTML markup has changed since I updated the code, as I said this solution is far from ideal, it would be much better to use the standard API.
Depending on what your end goal is:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants