Skip to content

Commit

Permalink
remove outdated examples from Readme.md
Browse files Browse the repository at this point in the history
checkpoint newRows in PageRowRDD.explore
move driverInitialized metric before configuring driver
  • Loading branch information
Peng Cheng committed Dec 21, 2014
1 parent b511e36 commit 3374277
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 269 deletions.
268 changes: 0 additions & 268 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,274 +42,6 @@ Examples

For a complete list of examples please refer to [source code page](https://github.com/tribbloid/spookystuff/tree/master/example/src/main/scala/org/tribbloid/spookystuff/example)

#### 1. Search on LinkedIn
- Goal: Find high-ranking professionals in you area on [http://www.linkedin.com/], whose first name is either 'Sanjay', 'Arun' or 'Hardik', and last name is either 'Gupta' or 'Krishnamurthy', print out their full names, titles and lists of skills
- Query:
```
(sc.parallelize(Seq("Sanjay", "Arun", "Hardik"))
+> Visit("https://www.linkedin.com/")
+> TextInput("input#first", "#{_}")
+*> (TextInput("input#last", "Gupta") :: TextInput("input#last", "Krishnamurthy") :: Nil)
+> Submit("input[name=\"search\"]")
!=!())
.visitJoin("ol#result-set h2 a")()
.extract (
"name" -> (_.text1("span.full-name")),
"title" -> (_.text1("p.title")),
"skills" -> (_.text("div#profile-skills li"))
).asSchemaRDD()
```
- Result (truncated, finished in 1 minutes on a laptop with ~400k/s wifi):
```
Arun Arun Gupta Consultant , Global Canesugar services Pvt.Ltd. ArrayBuffer(Filtration, Project Planning, Energy, Renewable Energy, Engineering, Power Generation, Procurement, Biofuels, Gas, Instrumentation, DCS, Design & Developments, Boilers, Commissioning, Construction, EPC, Energy Management, Factory, ISO, Maintenance Management, Manufacturing, Materials, Negotiation, Operations Management, PLC, Petrochemical, Power Plants, Process Control, Process Engineering, Product Development, MS Project, Project Engineering, Project Management, Pumps, R&D, Refinery, SAP, Steam Turbines, Supply Chain Management, Turbines, Water Treatment)
Sanjay Sanjay Gupta Researcher at REC ltd ArrayBuffer(Internet Recruiting, Candidate Generation, Passive Candidate Generation, Research, Databases, Data Mining, Data Entry)
Arun Arun Krishnamurthy Global Incident Manager at IFF ArrayBuffer(Service Delivery Management, Management, Service Delivery, Transition Management, IT Service Management, Sla, ITIL, Vendor Management, Team Management, Business Analysis, Incident Management, Technical Support, Business Process, Resource Management)
... -------------------returned 116 rows------------------
```

#### 2. Query the Machine Parts Database of AppliancePartsPros
- Goal: Given a washing machine model 'A210S', search on [http://www.appliancepartspros.com/] for the model's full name, a list of schematic descriptions (with each one describing a subsystem), for each schematic, search for data of all enumerated machine parts: description, manufacturer, OEM number, and a list of each one's substitutes. Join them all together and print them out.
- Query:
```
(sc.parallelize(Seq("A210S")) +>
Visit("http://www.appliancepartspros.com/") +>
TextInput("input.ac-input","#{_}") +>
Click("input[value=\"Search\"]") +>
Delay(10.seconds) !=!() //TODO: change to DelayFor to save time
)
.extract(
"model" -> ( _.text1("div.dgrm-lst div.header h2") )
)
.wgetJoin("div.inner li a:has(img)")()
.extract("schematic" -> {_.text1("div#ctl00_cphMain_up1 h1")})
.wgetJoin("tbody.m-bsc td.pdct-descr h2 a")()
.extract(
"name" -> (_.text1("div.m-pdct h1")),
"brand" -> (_.text1("div.m-pdct td[itemprop=brand]")),
"manufacturer" -> (_.text1("div.m-bsc div.mod ul li:contains(Manufacturer) strong")),
"replace" -> (_.text1("div.m-pdct div.m-chm p"))
)
.asSchemaRDD()
```
- Result (truncated, process finished in 2 minutes on one r3.large instance):
```
A210S A210S Washer-Top Loading  Parts for Maytag A210S: Top Cover\console\lid Switch Parts Moisture Barrier for Switch Whirlpool 214987 Part Number 214987 (AP4025452) replaces 2-14987, 438912, AH2018922, EA2018922, PS2018922.
A210S A210S Washer-Top Loading  Parts for Maytag A210S: Transmissions Parts Gear, Pwr Kit Whirlpool 204967 Part Number 204967 (AP4023771) replaces 2-13068, 2-13069, 2-4967, 213068, 213069, 214278, 435155, AH2017141, EA2017141, PS2017141.
A210S A210S Washer-Top Loading  Parts for Maytag A210S: Transmissions Parts Gear, Pwr Kit Whirlpool 204967 Part Number 204967 (AP4023771) replaces 2-13068, 2-13069, 2-4967, 213068, 213069, 214278, 435155, AH2017141, EA2017141, PS2017141.
... -------------------returned 293 rows------------------
```

#### 3. Download University Logos
- Goal: Search for Logos of all US Universities on Google Image (a list of US Universities can be found at [http://www.utexas.edu/world/univ/alpha/]), download them to one of your s3 bucket.
- You need to set up your S3 credential through environment variables
- The following query will visit 4000+ pages and web resources so its better to test it on a cluster
- Query:
```
((noInput
+> Visit("http://www.utexas.edu/world/univ/alpha/")
!=!())
.sliceJoin("div.box2 a")()
.extract(
"name" -> (_.text1("*"))
)
.repartition(400)
+> Visit("http://images.google.com/")
+> DelayFor("form[action=\"/search\"]")
+> TextInput("input[name=\"q\"]","Logo #{name}")
+> Submit("input[name=\"btnG\"]")
+> DelayFor("div#search")
!=!())
.wgetJoin("div#search img","src")(limit = 1)
.saveContent(pageRow =>
"file://"+System.getProperty("user.home")+"/spooky/"+appName+"/images/"+pageRow("name"))
.extract(
"path" -> (_.saved)
)
.asSchemaRDD()
```
- Result (process finished in 13 mintues on 4 r3.large instances, image files can be downloaded from S3 with a file transfer client supporting S3 (e.g. S3 web UI, crossFTP):

![Imgur](http://i.imgur.com/ou6pCjO.png)

#### 4.a. ETL the product database of [http://www.iherb.com/].

- Goal: Generate a complete list of all products and their prices offered on [http://www.iherb.com/] and load into designated S3 bucket as a tsv file, also save every product page being visited as a reference.
- The following query will download 4000+ pages and extract 43000+ items from them so its better to test it on a cluster.
- Query:
```
(noInput
+> Wget("http://ca.iherb.com/")
!=!())
.wgetJoin("div.category a")()
.paginate("p.pagination a:contains(Next)")(indexKey = "page")
.sliceJoin("div.prodSlotWide")(indexKey = "row")
.extract(
"description" -> (_.text1("p.description")),
"price" -> (_.text1("div.price")),
"saved" -> (_.saved)
)
.asSchemaRDD()
```
- Result (process finished in 6.1 mintues on 4 r3.large instances)
```
http.ca.iherb.com.Food-Grocery-Items St. Dalfour, Wild Blueberry, Deluxe Wild Blueberry Spread, 10 oz (284 g) $4.49
http.ca.iherb.com.Food-Grocery-Items Eden Foods, Organic, Wild Berry Mix, Nuts, Seeds & Berries, 4 oz (113 g) $3.76
http.ca.iherb.com.Food-Grocery-Items St. Dalfour, Organic, Golden Mango Green Tea, 25 Tea Bags, 1.75 oz (50 g)) $3.32
... -------------------returned 42821 rows------------------
```

#### 4.b. Cross-website price comparison.

- Goal: Use the product-price list generated in 4.a. as a reference and query on [http://www.amazon.com] for possible alternative offers, generate a table containing data of the first 10 matches for each product, the format of the table is defined by:
```
(From left to right)
product name on iherb|product name on amazon|original price on iherb|price on amazon|shipping condition|user's rating|number of users that rated|"Do you mean" heuristic|exact match info|reference page
```
Save the table as a tsv file and keep all visited pages as a reference.
- this query will open 43000+ browser sessions so it's recommended to deploy on a cluster with 10+ nodes, alternatively you can truncate the result in 4.a. for a dry-run.

- Query:
```
(sc.textFile("iherb.tsv")
.tsvToMap("url\titem\tiherb-price")
+> Visit("http://www.amazon.com/")
+> TextInput("input#twotabsearchtextbox", "#{item}")
+> Submit("input.nav-submit-input")
+> DelayFor("div#resultsCol")
!=!())
.extract(
"DidYouMean" -> {_.text1("div#didYouMean a") },
"noResultsTitle" -> {_.text1("h1#noResultsTitle")},
"savePath" -> {_.saved}
)
.sliceJoin("div.prod[id^=result_]:not([id$=empty])")(limit = 10)
.extract(
"item_name" -> (page => Option(page.attr1("h3 span.bold", "title")).getOrElse(page.text1("h3 span.bold"))),
"price" -> (_.text1("span.bld")),
"shipping" -> (_.text1("li.sss2")),
"stars" -> (_.attr1("a[alt$=stars]", "alt")),
"num_rating" -> (_.text1("span.rvwCnt a"))
)
.asSchemaRDD()
```
- Result (process finished in 2.1 hours on 11 r3.large instances)
```
MusclePharm Assault Fruit Punch Muscle Pharm Assault Pre-Workout System Fruit Punch, 0.96 Pound $33.11 FREE Shipping on orders over $35 3.8 out of 5 stars 484 muscle pharm assault fruit punch null http.www.amazon.com.s.ie=UTF8&page=1&rh=i%3Aaps%2Ck%3AMusclePharm%20Assault%20Fruit%20Punch
Paradise Herbs, L-Carnosine, 60 Veggie Caps Paradise Herbs L-Carnosine Cellular Rejuvenation, Veggie Caps 60 ea $50.00 null null null null null http.www.amazon.com.s.ie=UTF8&page=1&rh=i%3Aaps%2Ck%3AParadise%20Herbs%5Cc%20L-Carnosine%5Cc%2060%20Veggie%20Caps
Nature's Bounty, Acetyl L-Carnitine HCI, 400 mg, 30 Capsules Nature's Bounty Acetyl L-Carnitine 400mg, with Alpha Lipoic Acid 200mg, 30 capsules $15.99 FREE Shipping on orders over $35 3.6 out of 5 stars 7 null null http.www.amazon.com.s.ie=UTF8&page=1&rh=i%3Aaps%2Ck%3ANature%27s%20Bounty%5Cc%20Acetyl%20L-Carnitine%20HCI%5Cc%20400%20mg%5Cc%2030%20Capsules
Lansinoh, Breastmilk Storage Bags, 25 Pre-Sterilized Bags Lansinoh Breastmilk Storage Bags, 25-Count Boxes (Pack of 3) $13.49 FREE Shipping on orders over $35 4 out of 5 stars 727 null null http.www.amazon.com.s.ie=UTF8&page=1&rh=i%3Aaps%2Ck%3ALansinoh%5Cc%20Breastmilk%20Storage%20Bags%5Cc%2025%20Pre-Sterilized%20Bags
... -------------------returned 35737 rows------------------
```

#### 5 Download comments and ratings from [http://www.resellerratings.com/]

- Goal: Given a list of vendors, download their ratings and comments with timestamp from customers.

- Query:
```
(sc.parallelize(Seq("Hewlett_Packard")) +>
Wget( "http://www.resellerratings.com/store/#{_}") !=!())
.paginate( "div#survey-header ul.pagination a:contains(next)")(indexKey = "page")
.sliceJoin("div.review")()
.extract(
"rating" -> (_.text1("div.rating strong")),
"date" -> (_.text1("div.date span")),
"body" -> (_.text1("p.review-body"))
)
.asSchemaRDD()
```

- Result:
```
1/5 2013-09-08 "Extremely pained by the service and behaviour of HP. My Envy touch screen Ultrabook crashed 3 weeks back, which I bought in January this year. It came with preloaded Windows 8 OS and they refuse to load the OS without charging me for it. The laptop is under warranty. Best was that the reason given for the crash by the service engineer is 'Monsoons'. Never buy an HP machine. "
1/5 2013-09-04 "I can not believe how bad the service was. I wanted our company to resell HP products. I have been calling HP for weeks. Every time I call I feel like they push me to tears. When I first called the women didn't know English, she had no one else to speak to me. I called another number, an American answered, but he had no understanding of how to make a new contract but suggested he could transfer me. He did and I got a live person, until he put me on hold and the call was disconnected. I called the number on the HP partners website and they said that they are a whole other company that works for HP and don't have any idea of what to do or who to call. This continued on. It was so stressful. I mean I am so sad and stressed. I may be better off finding another company to work with. I have wasted many valuable days of work and so has my assistant. It is time consuming and costly dealing with them."
1/5 2013-09-03 "I bought a HP CM-1015 multi-function printer/scanner/copier several years ago. It worked fine when connected to a WindowsXP print server. When I upgraded the XP to Windows 7 (about three years ago), I could not find a driver for it, so I waited, waited, and waited more. Today is September 3, 2013, I installed the newest posted driver that I downloaded from the HP’s website, and it still does not work: It only prints black-and-white, not color. I am not even asking to have all the functions of the machine to work, just the printer part, is this too much for HP? I am very disappointed at the HP product and its service. I bought the HP brand because I thought I would get a great product with great service. What I got is the opposite. Now, I am looking at this almost new machine (it has not been used much in the past years) and wondering: Should I get rid of it ($500+ when purchased) and get another brand? It is really a waste of money and time."
... -------------------returned 189 rows------------------
```

#### 6 Download comments and ratings from [http://www.youtube.com/]

- Goal: Visit Metallica's youtube channel, click 'show more' button repeatedly until all videos are displayed. Collect their respective titles, descriptions, date of publishing, number of watched users, number of positive votes, number of negative votes, number of comments. Finally, click 'show more comments' button repeatedly and collect all top-level comments (not replies), also, save each fully expanded comment iframe for validation.

- Query:
```
(((sc.parallelize(Seq("MetallicaTV")) +>
Visit("http://www.youtube.com/user/#{_}/videos") +>
Loop(
Click("button.load-more-button span.load-more-text")
:: DelayFor("button.load-more-button span.hid.load-more-loading").in(10.seconds)
:: Nil
) !=!())
.sliceJoin("li.channels-content-item")()
.extract("title" -> (_.text1("h3.yt-lockup-title")))
.visit("h3.yt-lockup-title a.yt-uix-tile-link")()
.repartition(400) +>
ExeScript("window.scrollBy(0,500)") +>
Try(DelayFor("iframe[title^=Comment]").in(50.seconds) :: Nil)
!><()).extract(
"description" -> (_.text1("div#watch-description-text")),
"publish" -> (_.text1("p#watch-uploader-info")),
"total_view" -> (_.text1("div#watch7-views-info span.watch-view-count")),
"like_count" -> (_.text1("div#watch7-views-info span.likes-count")),
"dislike_count" -> (_.text1("div#watch7-views-info span.dislikes-count"))
)
.visit("iframe[title^=Comment]", attr = "abs:src")() +>
Loop(
Click("span[title^=Load]") :: DelayFor("span.PA[style^=display]").in(10.seconds) :: Nil
) !=!(joinType = LeftOuter))
.extract("num_comments" -> (_.text1("div.DJa")))
.sliceJoin("div[id^=update]")()
.extract(
"comment1" -> (_.text1("h3.Mpa")),
"comment2" -> (_.text1("div.Al"))
)
.asSchemaRDD()
```
- Result (finished in 9 minutes on 19 r3.large instances):
```
MetallicaTV Metallica: Welcome Home (Sanitarium) & Sad But True (MetOnTour - Asunción, Paraguay - 2014) Fly on the wall footage shot by the MetOnTour reporter on March 24, 2014 in Asunción, Paraguay. Footage includes the band warming up in the Tuning Room as well as both "Welcome Home (Sanitarium)" and "Sad But True" from the show. Download the full audio from the show at LiveMetallica.com: http://talli.ca/Asuncion2014 Follow Metallica: http://www.metallica.com http://www.livemetallica.com http://www.facebook.com/metallica http://www.twitter.com/metallica http://www.instagram.com/metallica http://www.youtube.com/metallicatv Published on Apr 1, 2014 null null null All comments (453) Max Galvan   james: what sound do u want? guy: SED BOT TRU!!!!! lol  Read more Show less
MetallicaTV Metallica: Welcome Home (Sanitarium) & Sad But True (MetOnTour - Asunción, Paraguay - 2014) Fly on the wall footage shot by the MetOnTour reporter on March 24, 2014 in Asunción, Paraguay. Footage includes the band warming up in the Tuning Room as well as both "Welcome Home (Sanitarium)" and "Sad But True" from the show. Download the full audio from the show at LiveMetallica.com: http://talli.ca/Asuncion2014 Follow Metallica: http://www.metallica.com http://www.livemetallica.com http://www.facebook.com/metallica http://www.twitter.com/metallica http://www.instagram.com/metallica http://www.youtube.com/metallicatv Published on Apr 1, 2014 null null null All comments (453) J   lol@ 14:41 James telling the fan not to bow before him XD Read more Show less
MetallicaTV Metallica: Welcome Home (Sanitarium) & Sad But True (MetOnTour - Asunción, Paraguay - 2014) Fly on the wall footage shot by the MetOnTour reporter on March 24, 2014 in Asunción, Paraguay. Footage includes the band warming up in the Tuning Room as well as both "Welcome Home (Sanitarium)" and "Sad But True" from the show. Download the full audio from the show at LiveMetallica.com: http://talli.ca/Asuncion2014 Follow Metallica: http://www.metallica.com http://www.livemetallica.com http://www.facebook.com/metallica http://www.twitter.com/metallica http://www.instagram.com/metallica http://www.youtube.com/metallicatv Published on Apr 1, 2014 null null null All comments (453) Damian Sobik   Lol Kirk at 3:30 hahah WTF ?! Read more Show less
... -------------------returned 47236 rows------------------
```

#### 7 Download forward citations from [Google Scholar](http://scholar.google.com)

- Goal: Given a list of titles of papers, search them on Google Scholar, collect the first result of each and information (namely titles and abstracts) of its forward citations (all publications that cites them) download all of them.

- Query:
```
(sc.parallelize(Seq("Large scale distributed deep networks"))
+> Visit("http://scholar.google.com/")
+> DelayFor("form[role=search]")
+> TextInput("input[name=\"q\"]","#{_}")
+> Submit("button#gs_hp_tsb")
+> DelayFor("div[role=main]")
!=!())
.extract(
"title" -> (_.text1("div.gs_r h3.gs_rt a")),
"citation" -> (_.text1("div.gs_r div.gs_ri div.gs_fl a:contains(Cited)"))
)
.visitJoin("div.gs_r div.gs_ri div.gs_fl a:contains(Cited)")(limit = 1)
.paginate("div#gs_n td[align=left] a")()
.sliceJoin("div.gs_r")()
.extract(
"citation_title" -> (_.text1("h3.gs_rt a")),
"citation_abstract" -> (_.text1("div.gs_rs"))
)
.wgetJoin("div.gs_md_wp a")()
.saveContent(select = _("citation_title"))
.asSchemaRDD()
```

- Result (finished in 1.5 minute on my laptop with ~450k download speed):
```
Large scale distributed deep networks Large scale distributed deep networks Cited by 119 Good practice in large-scale learning for image classification Abstract—We benchmark several SVM objective functions for large-scale image classification. We consider one-versus-rest, multiclass, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives ...
Large scale distributed deep networks Large scale distributed deep networks Cited by 119 Deep learning with cots hpc systems Abstract Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have ...
Large scale distributed deep networks Large scale distributed deep networks Cited by 119 A reliable effective terascale linear learning system Abstract: We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with trillions of features,{The number of features here refers to the number of non-zero entries in the data matrix.} billions of training examples ...
... -------------------returned 119 rows------------------
```

Performance
---------------

Expand Down
Loading

0 comments on commit 3374277

Please sign in to comment.