ACM Data Mining Camp, March 20, 2010
The event was a success – close to 400 people attended and used the day to talk about data mining.
WHAT is an UNCONFERENCE or CAMP?
An unconference is an event where users suggest topics, get together and discuss them in detail. This camp is focused on Data Mining, Analytics, Cloud Computing and the various applications of these technologies. There is an option to join the SF Bay ACM for $20 per year. Our last Data Mining Camp had 225 participants.
DONORS:
- eBay, LinkedIn, KXEN, Revolution Computing, Salford Systems
- Details and links below, thank them for making the unconference FREE
LOCATION:
- eBay 2161 N First St, San Jose, CA 95131 (free parking)
- The auditorium seats 420 people and 10 other rooms are available for break out sessions.
- If you plan on connecting a Mac to a projector, please bring your connector cables.
REGISTER / RSVP / SEE WHO IS COMING:
- Click here to register for the $30 pre-camp training How to Win Data Mining Competitions by Mikhail Golovnya, Consulting Project Leader, Salford Systems
- Optional $20 Annual ACM Membership
RSVP at the LinkedIn event so we can print name tags and plan for food and coffee during the conference. Listing the event on your LinkedIn profile will also help promote the event.
We have 478 + participants who have said they are coming with a total of 560 RSVPs!
(Thursday, 3/18/2010)
|
SCHEDULE for Saturday, March 20th, 2010:
(click the “more” tag below, or the event title above for the full page)
- Pre-Camp Training, by Mikhail Golovnya, Consulting Project Leader, Salford Systems
- 9:00 Pick up name tag, coffee, network
- 9:30 – 11:30 “How to Win Data Mining Competitions,” ($30 Registration)
- FREE Data Mining Camp (11:15 – 7:30pm) (beverages & snacks included with RSVP)
- 11:15-Noon Arrive, register, network, brainstorm session topics
- Box lunches provided by eBay
- Noon Unconference welcome, 5 min / donor, hiring announcements
- 12:50 Expert Panel Questions and Answers
- 1:40 Audience members suggest a topic, get show of hands for interest, select session room size by interest level, select time slot for session. We recommend sessions have a leader, blogger / note taker, and a timer so we can leave everybody 10 minutes to get to the next session. See the last SESSION MATRIX for example.
- 2:30 SESSION break out time slot 1
- 3:30 SESSION break out time slot 2
- 4:30 SESSION break out time slot 3
- 5:30 SESSION break out time slot 4
- 6:30 Share summary of sessions over pizza and salad (Thank you Donors!)
- Door prize drawings
- 7:30 Thank you and wrap up.
EXPERT PANEL:
Video
Moderator: Patricia Hoffman, Ph.D. Scientific Researcher, Aha Solutions !
- Ted Dunning, Ph.D. Chief Technology Officer at DeepDyve
DeepDyve provides technical literature to a wide audience. Dr. Dunning created MusicMatch’s recommendation system and revenue optimization system. He also created ID Analytics large scale identity fraud systems.
- Joseph B. Rickert Revolution Computing
Mr. Rickert preformed statistical analysis of clinical trials and built economic models for Cedar Associates. He founded Scotts Valley Instruments. He started his career by building mathematical models of communication networks for NASA, CIA, and NSA.
- Giovanni Seni, Ph.D. Elder Research Inc. and Professor at Santa Clara University
Dr. Seni has lectured on “From Trees to Forest and Rule Sets – A Unified Overview of Ensemble Methods” . His text, “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions “ is now available. He teaches Pattern Recognition and Data Mining at Santa Clara University. His research interests include statistical pattern recognition, data mining , and human-computer interaction applications. He holds five U.S. patents and has published over twenty conference and journal articles.
- Michael Walker, Ph.D. Professor at Stanford University and President of Walker Bioscience
Dr. Walker provides statistics consulting for pharmaceutical and biotechnology companies in the areas of drug development, medical devices, clinical trials, and molecular diagnostics. His work has lead to FDA and CLIA approvals, articles in the New England Journal of Medicine and in Lancet, numerous patents, and new products to diagnose and treat disease. He consults for venture capital companies to evaluate investments in life sciences and works with life science start-ups to obtain funding and advise on corporate strategy and product development.
- Hugh Williams, Ph.D. Vice President, Search Engine Engineer, Buyer Experience Development eBay
- Mike Bowles, Ph.D. Seasoned in Startups, Data Mining and Quantitative Finance
Dr. Bowles was founder and Chairman of Board at iBeam Broadcasting, and founding CEO at Com21. He is experienced with quantitative finance and fully automated quantitative trading.
- Greg Makowski Principal Consultant at Golden Data Mining
Greg has deployed 70+ data mining models since 1992 over 3 continents, including architecting enterprise software or web systems with embedded data mining. He is experienced in financial services, targeted marketing, retail supply chain, internet advertising and start-ups. He is currently consulting for Wells Fargo working on fraud detection using tools like SAS Enterprise Miner.
Dr. Williams is an innovator, inspiring leader, and expert in search engines and web services. In the past he managed a large R&D team at Microsoft’s Bing. He has published 99 works including two books: “Web Database Applications with PHP and MySQL” and “Learning Mysql” for O’Reilly Media Inc. He holds 2 US patents and has 25 patent applications in the works.
Moderated by Dr. Patricia Hoffman
The expert panel includes: Dr. Ted Dunning, Mr. Joseph B. Rickert, Dr. Giovanni Seni,
Dr. Michael Walker, Dr. Hugh Williams, Dr. Mike Bowles, and Mr. Greg Makowski
SESSION TOPICS:
may include one or more focus such as…
- Experience level: beginners to experts
- Algorithms: Forecasting, clustering, text mining, sentiment, network analysis, collaborative filtering, fraud
- Verticals: Internet advertising, social networks, targeted marketing, financial services, medical, genetics, green tech, space science, mobile devices, startups, Netflix $1,000,000 prize
- Tools/Processes: Commercial, public domain, libraries, in SQL, in cloud, project or product management, SalesForce.com plug ins, CRM software plug ins
- User Groups: R, SAS, Salford Systems, Hadoop, Mahout
- Help me: I am stuck on… I need guidance… How do you…? (but suggest topic of general interest)
- Participate in Our Data Mining Blog: Find birds of a feather, invite participants in a session, suggest or plan session ideas, update during the session, share in the summary of sessions or add a job posting
GOLD UNDERWRITERS:
- Founded in 1995, eBay Inc. is the worldwide leader in shopping and payments on the web. Every day, we connect hundreds of millions of buyers and sellers — and we continue to find new ways to help people do business around the world.
- Check out how our award-winning products can help you predict the future of your business NOW! Need to determine your ROI on your data mining project quickly and cost-effectively? Consider our Rapid Response Data Mining Center.
- Over 60 Million Professions Use LinkedIn to exchange information, ideas, and opportunities.
- Increase Business Performance with Automated Customer Lifecycle Analytics with KXEN. KXEN is the leading provider of automated data mining software and customer analytic solutions for retail, communications, media, financial and marketing services companies to improve their customer insight and enhance corporate performance. KXEN solutions integrate predictive analytics and social network analysis into business processes to boost marketing campaign results and profitability.
- REvolution Computing offers open source products and services for high performance analytics, including REvolution R Enterprise which delivers 100% R and more—optimized, validated and supported.
- REvolution R Enterprise 3.0 Coming Soon. New R Integrated Development Environment (IDE) with visual debugging, enhanced script editor, and more! Click here to learn more.
PROMOTION AND PROMOTION PARTNERS:
- ACM email newsletter, opt-in list of 3,000+ people mostly in the N CA bay area
- Analytic Bridge is a data mining social network and newsletter to 20,000 people
- KD Nuggets website and newsletter, 12,000 people
- Predictive Analytics World is the cross-vendor conference covering commercial deployment, February 16-17, 2010 in San Francisco, CA. See also the Prediction Impact newsletter
- SD Forum (The Software Development Forum) has ~20 events / month, and 16 SIGs
- See also their seminar “The Analytics Revolution”, Fri April 9, 8:30am – 3:30 pm, Mountain View, $100
BLOG / TWITTER:
- Twitter Tag #DMCAMP
- You can contribute to the blog below: Lets get started planning the sessions
- The blog is moderated to cut out spam, so your posts will appear after being approved.
- The Vokel web site of live video exchanges for people that can’t attend, but still want to watch and listen (a big thank-you to Irene Gabashvili, PhD)
VIDEOGRAPHER / PHOTOGRAPHER:
- Ron Fredericks will be recording and web hosting sessions
- Blog Post for this event
- Examples: Webinars, Internet TV, event capture, web hosting, and DVDs
- Tag line: “Video production for exceptional people, brands, and products”
- Web site and video hosting: www.LectureMaker.com
- Phone: 408-390-1895
- Here is the link to my video blog post for your event: http://www.lecturemaker.com/2010/03/acm-data-mining-camp/
- coordinate recording your session – ronf@lecturemaker.com
ONE PAGE EVENT ANNOUNCEMENT:
- Use this one page event announcement PDF to help share and publicize this event!
ORGANIZERS:
- Greg Makowski, www.LinkedIn.com/in/GregMakowski

Greg is the overall organizer, sponsor contact, co-marketing contact, DM SIG co-chair - Patricia Hoffman, Ph.D., www.LinkedIn.com/in/PatriciaHoffmanPhd

- Mike Bowles, Ph.D., www.LinkedIn.com/in/MikeBowles

Mike is the Volunteer Coordinator
Tricia is the Expert Panel Moderator, and web marketer
CALL FOR VOLUNTEERS:
- In Advance:
- Contact Mike Bowles, and ask how you can help!
- marketing: Help announce and publicize to analytic crowds, groups, technical talks
- marketing: tweet about the event
- marketing: announce to analytic contacts at your company or in your network
- marketing: add the event web page to your email signature
Come to the ACM Data Mining Camp, Sat, March 20 in San Jose. See the current RSVP list - feedback forms: Develop, get organizer review
- feedback forms: print
- session topics: suggest ideas for session topics on the moderated blog (below)
- video: provide video editing training (optional)
- video: receive video editing training, offer to help edit video
- video: technical plans on posting the video
- Early morning or day before:
- signs: put up maps of complex and signs at the facility
- Day of the Event:
- coffee: help setup
- registration: gather emails, pass out name tags
- registration: bring portable for on-line registration, to allow people to pay to join ACM
- sponsor support: help as requested
- food and beverage: distribution when delivery arrives, resupply, keep neet
- gofer: many last minute things come up, go-fer this, then go-fer that
- session matrix: record the matrix of session titles per time slot and room
- session matrix: post to blog, email to organizers
- session matrix: print ~dozen copies
- session matrix: distribute session matrix around the session rooms
- session content: take notes during sessions, add to the blog (below)
- session content: encourage others to cover sessions, try and get all covered
- tweet!
- session timing: help with timing of 50 minute sessions, giving a 5 minute and 2 minute announcements
- video: bring and operate video equipment (optional)
- After the Event (same day):
- Help clean up food areas
- Collect feedback on improving the event in the future
- Pick up event specific signs
- After the Event (later):
- video: editing, organizing
- video: web posting
- tweet!
- blog: add to your blog or our blog, post links
- Help announce and publicize to analytic crowds
- Put up maps and signs at the facility
Session Notes:
The session on “Natural Language Processing” was attended by more than 20 people, and in fact related problems were discussed at other sessions, e.g., on “Sentiment Analysis”. Hopefully, next (un)conferences will bring more discussions and better understanding of the problems.
Parts of the discussion, as recorded by our volunteer scribe, Joan A. Hoenow, are below, and my own comments are in the end.
=====
Question raised: Can you do NLP in a reasonable amount of time? Suppose you have millions of documents.
Response: NLP will not be fast at first but can be improved.
Comment: Automatic speech recognition systems have been getting better. They are more able to deal with unrecognized words.
Response: Maybe speech recognition in phone answering systems won’t be the model that we want.
Comment: there is more interest in web search.
Comment: Dictionaries, and elaborate grammars are developed, but these are not part of “understanding the meaning”.
There are elaborate syntactic processors, but they are strictly syntactic, too slow for big amounts of text. They are based on a strict theoretical interpretation, and adding semantics was not included. Adding semantics would reduce the amount of work, but this is not in the academic tradition.
Question: What do you do about the inherent ambiguities? The machine doesn’t know about jaguar car vs. jaguar animals.
Comment: Machines now do statistical processing. Google has parallel “facts”.
Comment: What is natural language processing? Powerset, acquired by Microsoft, is developing a natural language search engine for the Internet. Will it be able to distinguish in a search for “children’s books” at least between the 3 categories: 1)books for children; 2)books owned by children; 3)books written by children?
The group discussed two approaches to natrual language processing
1) set up a structure, knowing the concepts, and program the understanding from human understanding.
2) machine self learning, feed bunches of text, find patterns, significance of patterns. Set up a system that evolves understanding. Reference made to Monica L. Anderson of Syntience Inc. (not in attendance) as a proponent of machine learning. Syntience Inc. web site proposes “artificial Intuition”>
In discussion of either approach, the question was raised “What is understanding?”
Comment on machine learning approach: imagine intercepting signals from another civilization, or obtaining text from a former civilization. How would you recognize anything of significance? What quantities can I use?
Response: Why refuse to know the language?
Comment: Not being refused, but possibly I don’t have it. For example, you may be indexing geology and you may not know it. There are advantages to be able to do some of this work while being completely ignorant of the jargon. For example, ‘set’ means something different in math than in other fields, and the differing usage of the same word in different disciplines is common.
Comment: before such an interesting thing as discovering meaning from an unknown civilization, can we understand our own?
Other applications: Trying to make a computer do categorization, give an idea what the text is about. An example is a sales ad, perhaps on eBay. Can the ad be processed to be placed in the correct category?
Comment: Note the difference between two things in NLP
1) make decisions based on statistics
2) the ambition to understand the meaning completely even in a restricted domain.
Comment: Suggest this approach. If any compendium has a glossary, you either already have the word with a detailed description. For a missing entry, one needs to be created.
Comment: Back to the two approaches. Are these really different problems or is it just a matter of how much computation is needed?
Comment: When a human learns, there is a lot of negative feedback. Would it be possible for the natural language processing to have an “I don’t know” and get human feedback?
Comment: There’s ambiguity but you can have a computer confidence interval. For example, to process speech tagging. The problem is that the training set has to be very similar to the text.
Everyone is talking about using wikipedia. Wikipedia knows lots of things that most humans don’t care about, obscure items. This is not a good training set. There is no common sense.
Comment: Maybe there could be an NLP Wikipedia.
Comment: The problem we have when we talk to linguists, Understanding and meaning are not things we can define and measure. If we can measure what is there, we can build graphs.
Comment: Semantic etymology can be done. There is some dispute whether word stemming is really useful, for example, ‘pant’ and ‘pants’ have the same stem but this is not helpful.
Comment: Is there intrinsic informational content about a sequence of words? “The quick brown fox jumps over the lazy dog” . Which words are significant? Can you recognize the subject, the predicate, the object of the predicate?
Comment: English syntax doesn’t show the exact roles for verbs with multiple objects “I give him the book”.
Comment: The question is “What do we want to do with it?” What is the context? Imagine this: we are 20 people, each going to sell a similar laptop or Camaro on eBay. It is likely we would have 20 different descriptions. If we have time, we can get a good categorization. eBay has some words that are unique to its environment. Some categorization works well, some not so well. In trying to connect the buyer and the seller, it is important that the seller can describe the item, and many sellers do not know how to do this? How can eBay help the seller? And how can the buyer’s search get the correct items?






We are privileged to announce that
Ted Dunning is willing to host a session on the current status of Mahout!
Mahout’s goal is to build scalable machine learning libraries. Mahout uses Hadoop to support scaling. By scalable we mean:
* Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.
Currently Mahout supports mainly four use cases: Recommendation mining takes users’ behavior and from that tries to find items users might like. Clustering takes for example text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
Looking forward to attending – let me know if you need additional speakers or panelists.
Cheers,
Ilene
Dr. Sudhir Kshirsagar is interested in anomaly detection
Ben Gimpert says,
“There were hopefully be interest in another finance and trading track, for which I will suggest a few adhoc talks on the “day of.” Perhaps a bit more technology -focused than the BATSIG meetings. * What domain-specific compromises can we make when crawling financial data? * Machine-learning techniques for a problem domain that is extremely messy, but tolerant to false positives? * Proxies for overall market sentiment. Does a forecast of a particular stock index do a “good enough” job of picking macro- tops and bottoms? See you there!”
Biomedical Data Mining: Successes, Failures, and Challenges.
In November, we proposed and discussed this topic: Biomedical Data Mining: Dimensionality, Noise, Applications
http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html
This time, let it be Biomedical Data Mining: Successes, Failures, and Challenges.
For those who can’t attend in person, please do so virtually:
http://vokle.com/events/1547
Is the event going to have any online activity? It will be great if something like proceedings of the event are published on the website or something like webinar which could be bought.
Posted by Pankaj Deshpande
Could we discuss machine learning using parallel processors?
Posted by Elinor Velasquez
We can offer a session:
“Automated Data Mining with KXEN — Bring your data and we will beat your model or we will eat our hat!” by Vincent Vikor, KXEN
Biomedical Data Mining: Successes, Failures, and Challenges. In November, we proposed and discussed this topic: Biomedical Data Mining: Dimensionality, Noise, Applications http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html This time, let it be Biomedical Data Mining: Successes, Failures, and Challenges. For those who can’t attend in person, please do so virtually: http://vokle.com/events/1547
Irene Gabashvili, Entrepreneur, Innovator, and Educator
Thanks for Tweeting!
aurametrix says Mining Health-related Data: live online event on March 20 http://vokle.com/events/1547 (from ACM Data Mining Camp #DMCAMP)
WhatToDoBay says Biomedical Data Mining http://ow.ly/1pju2d webcast from San Jose #DMCAMP March 20 http://www.sfbayacm.org/?p=1341
TweetMeme says
erninthecity RT @rachelsegal: Lunch at digitalmediacamp was delicious now we’re recapping on the first sessions #dmcamp Agreed, lunch was quite good!
I’d like to see something concerning the applications of evolutionary algorithms.
Can we discuss sentiment analysis?
After posting information about the “camp”, a dialogue discussing the pros and cons of statistic vs. machine learning has started at:
http://www.analyticbridge.com/group/timeseries/forum/topics/association-of-computing-5?commentId=2004291%3AComment%3A62172&xg_source=msg_com_forum
It’d be nice to have a discussion about how to go about data mining for some engineers who have almost forgotten statistical models
Dynamic poll for DMC2010: http://urtak.com/u/719
[...] SF Bay Area ACM event web page [...]
I plan to attend the camp. I volunteer for a session on algorithms to optimize Cost per Click for events platform.
I am interested data mining utilizing the cloud (Hadoop, MapReduce, Hive or Pig), for both structured and unstructured data. A real world example/demo with performance metrics would be excellent!
Data mining, machine learning, knowledge discovery in databases and statistical inference are my primary focus. I’m particularly interested in applications of evolutionary algorithms
I’m interested in methodologies (e.g. SVM/RVM, neural networks, LWPR) as they apply to market microstructure and high frequency data source
Check out http://twtvite.com/DMCAMP2010
For a poll on topics for the camp look at
http://urtak.com/u/719
So far there have been 130 responses to 24 questions which have been asked!
6 people want to discuss sentiment analysis or opinion mining. Is there anyone interested in facilitating this topic?
Another seemingly popular topic is visualizing data mining models. Any takers for facilitating this one?
From: liana.ydisg@juno.com
Hi, Greg,
I need to add another item to tagging on Financial products.
Thus
Meet 8: Structured Tags(How to tag an AI, data mining or financial methods)
Liana Ye
SAS will be there. We can provide a workshop on “A Tour of SAS Analytics”
interested in knowing from experts and professionals what programming tools are most commonly used in data mining industry.
Tag an AI, data mining or financial method
To me AI means human inquiries to self, to model self, to build Web03 to ensure peace on earth, thus I propose a session on how to turn Semantic Web and Word-net into Web03 (Just a name at this time).
Throwing out a rock to induce gems. I propose to build structured tags to index on Web content via URL and lend content handles to semantic web to establish finer relationship among contents.
Structured tags is just a way to put content into silos. What these tags represent is crucial for common reference.
After all the works from the past decade, I see Metadata agreement on its way, and a layered convention will soon to come for Web03 (as I have discussed with participants at Enterprise Data 2010).
1. An AI method is an ad Hoc method. Anyone with programming skill should be able to create and play and test. The test may be as large scale as selling of derivatives to the world which contributed to recent financial meltdown.
Questions: What kinds of fields should be tagged on an AI method? When, where, why, who and how?
Should we be able to create a movie script out of AI tags?
Does hacking belong to AI?
2. Data mining methods is a part of AI. As it tests out a theory in our mind and find proof of the theory that can be qualified as a standard method, be useful again and again. Thus it must have well understood range and limits, such as neural net.
Questions: Do we need the same set of fields like an AI method tag?
What are these fields?
How do we define an application when it is out of predefined range?
Do we need someone like patent office tells us the application is out of it predefined range, or an expert system will be fine?
3. A financial method is a legal issue. It assumes all the implication of a method is well understood, thus has been implemented into standards. Any violation of these rules will be punished either as a stolen ID, or default on shipping order, or a lost to the stock market meltdown. Our system used to be designed around these rules, and we do have an extended library of these rules as an insurance company has.
Questions: How can we insure high quality of data? Do we put hackers in jail? or some automatic monitoring scheme like a clock telling us something is wrong?
Do we need built-in data set independence within a private cloud?
Can we build a private cloud that puts control into local community hand, regardless it is a Red China or a Google China?
Do we need independent data monitoring over this legal system?
How can a rule be detected or reverse engineered, such as the subprime problem?
4. These system rules will be in and out of libraries. Different systems from different communities will have conflicts of these rules once we can refer to these metadata and communicate and resolve the differences for an exchange to happen. Such a conflict resolution mechanism can be built into a private cloud, called Tag Voting.
Questions: Which voting schemes are suitable for Tag voting?
A voting process must be peaceful and not deprive anyone’s freedom of speech, with either words, images or even bullets.
Tag voting has to be an educational process.
Tag voting is a systematic way on conflict resolution, it is a must feature on Web03.
5. A tree structured tag system is on http://www.PeaceNames.com for discussion.
Questions: Does OWL is sufficient to expand into structured tags?
Can OWL lend to parallel processing?
What other tag structures exist or not exist?
Tag an AI, data mining or financial method
To me AI means human inquiries to self, to model self, to build Web03 to ensure peace on earth, thus I propose a session on how to turn Semantic Web http://en.wikipedia.org/wiki/Semantic_Web and Word-net http://wordnet.princeton.edu/ into Web03 (Just a name at this time).
Throwing out a rock to induce gems. I propose to build structured tags to index on Web content via URL and lend content handles to semantic web to establish finer relationship among contents.
Structured tags is just a way to put content into silos. What these tags represent is crucial for common reference.
After all the works from the past decade, I see Metadata agreement on its way, and a layered convention will soon to come for Web03 (as I have discussed with participants at Enterprise Data 2010 http://edw2010.wilshireconferences.com/).
1. An AI method is an ad Hoc method. Anyone with programming skill should be able to create and play and test. The test may be as large scale as selling of derivatives ( http://www.nytimes.com/1996/09/30/news/30iht-deriva.t.html?pagewanted=1 )to the world which contributed to recent financial meltdown.
Questions: What kinds of fields should be tagged on an AI method? When, where, why, who and how?
Should we be able to create a movie script out of AI tags?
Does hacking belong to AI?
2. Data mining methods is a part of AI. As it tests out a theory in our mind and find proof of the theory that can be qualified as a standard method, be useful again and again. Thus it must have well understood range and limits, such as neural net.
Questions: Do we need the same set of fields like an AI method tag?
What are these fields?
How do we define an application when it is out of predefined range?
Do we need someone like patent office tells us the application is out of it predefined range, or an expert system will be fine?
3. A financial method is a legal issue. It assumes all the implication of a method is well understood, thus has been implemented into standards. Any violation of these rules will be punished either as a stolen ID, or default on shipping order, or a lost to the stock market meltdown. Our system used to be designed around these rules, and we do have an extended library of these rules as an insurance company has.
Questions: How can we insure high quality of data? Do we put hackers in jail? or some automatic monitoring scheme like a clock telling us something is wrong?
Do we need built-in data set independence within a private cloud?
Can we build a private cloud that puts control into local community hand, regardless it is a Red China or a Google China? http://en.wikipedia.org/wiki/Google_China
Do we need independent data monitoring over this legal system?
How can a rule be detected or reverse engineered, such as the subprime problem? http://en.wikipedia.org/wiki/Subprime_mortgage_crisis
4. These system rules will be in and out of libraries. Different systems from different communities will have conflicts of these rules once we can refer to these metadata and communicate and resolve the differences for an exchange to happen. Such a conflict resolution mechanism can be built into a private cloud, called Tag Voting.
Questions: Which voting schemes are suitable for Tag voting?
A voting process must be peaceful and not deprive anyone’s freedom of speech, with either words, images or even bullets.
Tag voting has to be an educational process.
Tag voting is a systematic way on conflict resolution, it is a must feature on Web03.
5. A tree structured tag system is on http://www.PeaceNames.com for discussion.
Questions: Does OWL (http://www.w3.org/TR/owl-features/ )is sufficient to expand into structured tags?
Can OWL lend to parallel processing?
What other tag structures exist or not exist?
For those who can’t attend in person, watch live streaming or stream your live video by RSVP’ing at http://vokle.com/events/1547 (biomedical data mining) and http://vokle.com/events/1728 (other live & on-demand topics). Vote for topics or add your own at: http://urtak.com/u/719
Irene Gabashvili, Entrepreneur, Innovator, and Educator
I’m interested in methodologies (e.g. SVM/RVM, neural networks, LWPR) as they apply to market microstructure and high frequency data sources.
This is the most useful information I have found so far, thank you for this.
SESSION PROPOSAL:
TITLE: Forecasting unbalanced data sets (targets 1% or less) for fraud, internet advertising or other industries.
FORMAT: Round Table Discussion
LEVEL: Experienced+ (you know what stratified sampling is, you are experienced on several predictive algorithms)
SESSION LEADER: I can lead, unless someone else wants to.
(Greg Makowski)
BACKGROUND SUBJECT TAXONOMY:
Predictive
issues of forecasting 1 in 10,000
in fraud, not all fraud records are labeled correctly)
(semi-supervised learning)
(train one model
Outlier Detection
think of a progression
multivariate outlier detection
go from one cluster to many clusters
go from many clusters to granular segments w/ diff fields
I can post paper links later..
SESSION PROPOSAL:
TITLE: Intro to Data Mining
FORMAT: Teaching from PPT, lots of Q&A
LEVEL: New to data mining, numerically literate
SESSION LEADER: Greg Makowski
OUTLINE:
Objectives for 50 Minutes
From Scatter Plots to Analysis
Families of Analysis Engines (games & golf clubs)
How to recognize and assess a project
Knowledge Discovery in Databases (swing)
Exploratory Data Analysis & Preprocessing
Model Building, Evaluation, Description
Start Today – Next Actions!
SESSION PROPOSAL:
TITLE: Collaborative Filtering Review
FORMAT: Teaching from PPT, lots of Q&A
LEVEL: intermediate+ (see reading list and judge)
(We are sharing a summary of readings)
SESSION LEADERS: Greg Makowski, Tricia Hoffman
OUTLINE:
1 Netflix $1,000,000 competition was won
2 References & Reading
3 History & Basics of Collaborative Filtering
4 Newer Matrix Methods, Singular Value Decomposition (SVD)
5 SVD Variations
RELATED READING:
High Level Reading
Programming Collective Intelligence by Toby Segaran. The 2nd chapter gives a good introduction to collaborative filtering with Python examples (non-SVD).
Matrix Factorization Techniques for Recommender Systems Yehuda Koren; Robert Bell; Chris Volinsky, IEEE Computer, 2009, 8
Singular Value Decomposition (SVD) Reading
The Singular Value Decomposition, by Jody Hourigan and Lynn McIndoo, Linear Algebra – Math 45. http://online.redwoods.edu/INSTRUCT/darnold/LAPROJ/Fall98/JodLynn/report2.pdf w/ Matlab & image examples
Numerical Recipes, 3rd Edition, Press et. al.,2007, p65-75.
Collaborative Filtering Reading
See papers on research.yahoo.com/Yehuda_Koren
Collaborative Filtering for Implicit Feedback Datasets, Yifan Hu; Yehuda Koren; Chris Volinsky, IEEE International Conference on Data Mining (ICDM 2008), IEEE, 2008
Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, Yehuda Koren, ACM Int. Conference on Knowledge Discovery and Data Mining (KDD’08), 2008
Collaborative Filtering with Temporal Dynamics, Yehuda Koren, KDD 2009, ACM, 2009
[...] now, on a Saturday, I’m off to the ACM Data Mining Camp, hosted at eBay’s north [...]
thanks for attending the DIMRED sessions, here are infos i captured during the session – please use it as a reference … also i add my linkedin profile if you want to connect for more infos …
.luca
–
ME: http://www.linkedin.com/in/lucarigazio
HLDA / HDA:
http://www.clsp.jhu.edu/ws97/acoustic/reports/SpCo98HDA.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.7481&rep=rep1&type=pdf
LDPP: http://prhlt.iti.es/seminars/2008/Villegas08c.pdf
CORE VECTOR MACHINES: http://www.cse.ust.hk/~ivor/cvm.html
SPARSE PROJ SQ: http://www.public.asu.edu/~jye02/Software/SLEP
Mike Mahoney page (infos about random projection)
http://cs.stanford.edu/people/mmahoney
[...] was my second time ACM SF Bay Data Mining camp. The first thing to notice – it was big, 400 people or so hosted by eBay in San [...]
Hi,
I posted my presentation on feature selection problem on my website. It has links to two software package I mentioned.
The slides are at:
http://www.junlinghu.com/feature_selection.pdf
Irene Gabashvili, PhD Has posted a Live Vokle Event on Biomedical Data Mining capturing this event! Thanks Irene
http://vokle.com/events/1547
Check it out!
Andraz created a Zemanta Blog Post
http://www.zemanta.com/fruitblog/data-mining-camp-report/
Here is a quote,
“Initial expert panel was highly acclaimed. The guys were really direct, honest and interesting. A few quotes:
* Joseph B. Rickert: “Size of your data and time you have to analyze it are reversely proportional”
* @hughewilliams: “Data miner vs Statistician skills – ability to write code”
* Ted Dunning’s preconditions for data mining: 1) data exists 2) someone benefits in a concrete way
Thank you Andraz for the compliment!
Thanks to everyone who attended and made this event. We at eBay were glad to have you.
Steve
Some of the buzz from Twitter
mateuszb @PatriciaHoffman it was really nice to be there. Good job and looking forward to attending these more often!
DataJunkie @PatriciaHoffman Sure! Will post an entry on my blog as well. Thanks for a great event! Will be back!
ajlopez RT @patriciahoffman @NextGenCMO: Big data is red hot@andraz: jobs… – NetFlix, Ebay Hadoop/data mining jockey can land a job in seconds.
andraz @patriciahoffman I don’t know any experts in sentiment analysis personally. they are elusive. Would like to ask them many things
#dmcamp
RT @Psyllo: @PatriciaHoffman Loved the panel and the un-conference format. #DMCAMP was a success IMO
@littleidea: @PatriciaHoffman Days like #DMCAMP I wish I lived in the Bay area, I won’t make this one, but will follow along at home.
DeeMcCrorey @patriciahoffman Happy 2 help out! Thx 4 the share…I’ll check out Avinash Kaushik on web analytics. Another gd find is @massimopaolini
aurametrix @PatriciaHoffman thank you. #DMCAMP should be a success http://twtvite.com/DMCAMP2010
@sailur: Latest dicovery – RHIPE using R over HADOOP #DMCAMP
@andraz: Clem Wang formerly Yahoo spam guy now at Bing annotated most of the learning sets by himself, not trusting crowdsourcing #dmcamp
@mateuszb: #DMCAMP Interesting. Facebook dropped Cassandra for inbox search and hired HBase person to switch. (as reported by @cwensel on stage)
@NextGenCMO: Hot topics at #dmcamp include #mahout, #hadoop, #cloud, click stream analytics for tab browsing, bio informatics … Surprise, surprise.
DataJunkie When to start using Cascading? “Write one map/reduce application, throw it away, and then start using cascading.” Via @cwensel #DMcamp
mateuszb #DMCAMP Interesting. Facebook dropped Cassandra for inbox search and hired HBase person to switch. (as reported by @cwensel on stage)
sailur Done! Nice survey of state of practice! Thank you ACM, EBay and all other gold sponsers!! #DMCAMP
@andraz: Ted Dunning’s preconditions for data mining: 1) data exists 2) someone benefits in a concrete way #dmcamp about 17 hours ago from web
@bobpage: This expert panel at (free) #dmcamp is better than many big (paid) conference panels. No hawking, just good expert info.
raymondmccauley
Lots of good genomics questions at ACM Data Mining Camp – not just for web metrics anymore #dmcamp
dws
Refreshingly high quality, agenda-free panel at #dmcamp. Kudos to whoever who put the panel together. But please fix the mic feedback.
#DMCAMP Who benefits from data from smart meter OR from rare event?
#DMCAMP Expert Panel answering lots of questions including one on “smart meters” coming on electric meters
#DMCAMP Stanford Professor Dr.Michael Walker speaking about personalized medicine and using data mining
#DMCAMP Greg Makowski talking on fraud detection…Break out groups coming up
#DMCAMP Dr Mike Bowles speaking at Data Mining Camp ion how to use data mining and opportunities
Here at eBay for Data Mining Camp … Expert panel up Dr.Tricia Hoffman speaking
Here at eBay for Data Mining Camp … Lots of people here … Excitement building
cwensel @mateuszb lurking clickstream. Leading hadoop next.
mateuszb At large text classification session at #dmcamp
mateuszb @PatriciaHoffman it was really nice to be there. Good job and looking forward to attending these more often!
Biomedical #DMCamp is online now: http://vokle.com/events/1547
@andraz – great comments. What session are you attending now? could you join us for semantic discussions?
Now at #dmcamp: Semantic Technologies & Data Mining onside: Fireside C, online http://vokle.com/events/1547
Discussing Collective Intelligence in Action Book http://is.gd/aQKol at #dmcamp, Fireside B
Another difference between statistics and data mining: grant size – thousands vs millions @ted_dunning at #dmcamp
Data miner vs Statistician skills – ability to write code (@hughewilliams at #dmcamp Expert Panel)
http://www.flickr.com/photos/marstein/sets/72157623536442495/ Check out the photos!
My Experience at ACM Data Mining Camp #DMcamp
http://www.bytemining.com/2010/03/acm-data-mining-camp-dmcamp/
More reports on the conference:
Biomedical
1)http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html
R revolution
2)http://blog.revolution-computing.com/2009/11/thoughts-on-the-sf-bay-data-mining-camp.html
3)http://blog.revolution-computing.com/2010/03/acm-data-mining-camp-march-20.html
KDnuggets
4)http://www.kdnuggets.com/2009/11/b-sf-data-mining-camp-report.html
Andraz
5)http://www.zemanta.com/fruitblog/acm-data-mining-camp-silicon-valley-report/
Analytic Bridge
6)http://www.analyticbridge.com/main/search/search?q=Data+mining+camp
# Health IT: What’s the Future? (rwjfblogs.typepad.com)
# Thoughts on the SF Bay Data Mining Camp (revolution-computing.com)
References
1. Blair S. How to assess exercise training habit and physical fitness. In: Behavioral Health, edited by Matarazzo JD. New York: Wiley, 1984, p. 424-447.
2. Rauramaa R., Tuomainen P., Väisänen S., and Rankinen T. Physical activity and health- related fitness in middle-aged men. Med Sci Sports Exerc 27: 707-712, 1995.
3. Gøtzsche, P.C., Jørgensen, K.J., Mæhlen, J. and Zahl, P.-H. Estimation of lead time and overdiagnosis in breast cancer screening. British Journal of Cancer (2009) 100, 219–219.
Mining My Data Mining Camp impressions:
http://aurametrix.blogspot.com/2010/03/mining-data-mining-camp-impressions.html
Hello,
Thanks for the interest in slides. Sorry I didn’t post some things sooner. Other priorities came up and I had to travel on short notice to help my mom. To see what was up, follow this link – any support is appreciated.
http://pages.teamintraining.org/los/rnr10/jwilsonbli
http://www.theoaklandpress.com/articles/2010/03/26/obituaries/1561629.txt
Here are some of the presentations from the Data Mining Camp on 3/20/2010:
* Intro To Data Mining (… or what is it all about anyway?)
https://docs.google.com/fileview?id=0B9w6dFrP3862ODYxZjk4ODktM2ZlOS00MDQxLWExNTYtYmZlODQxNTViMWI3&hl=en
* Collaborative Filtering – Review of Koren Papers
https://docs.google.com/fileview?id=0B9w6dFrP3862NzQxYjYxMGEtYTI2MS00NTkxLWEwMzQtZThmYzY4NmQ5NzU0&hl=en
* Salford Training – How to Win Data Mining Competitions
https://docs.google.com/fileview?id=0B9w6dFrP3862MDQ5NTk0ZGUtZTk3NS00NTM5LWI2NTktMDkzMmEwYzhjMTU3&hl=en
* Data Mining Camp 2010 03 20 – Session Matrix
https://docs.google.com/fileview?id=0B9w6dFrP3862NzY1ZjcxYTItMDM1Mi00OTNkLWI1MDMtMjg0MDAwMzJiZGYy&hl=en
Thanks for your interest – we hope to start discussing plans for our next Data Mining Camp.
Greg Makowski
First, thanks inviting me to the huge event on Saturday.
It was better than SXSW in Austin!
The ACM crowd was smarter and funnier.
The session on “Natural Language Processing” was attended by more than 20 people, and in fact related problems were discussed at other sessions, e.g., on “Sentiment Analysis”. Hopefully, next (un)conferences will bring more discussions and better understanding of the problems.
Parts of the discussion, as recorded by our volunteer scribe, Joan A. Hoenow, are below, and my own comments are in the end.
=====
Question raised: Can you do NLP in a reasonable amount of time? Suppose you have millions of documents.
Response: NLP will not be fast at first but can be improved.
Comment: Automatic speech recognition systems have been getting better. They are more able to deal with unrecognized words.
Response: Maybe speech recognition in phone answering systems won’t be the model that we want.
Comment: there is more interest in web search.
Comment: Dictionaries, and elaborate grammars are developed, but these are not part of “understanding the meaning”.
There are elaborate syntactic processors, but they are strictly syntactic, too slow for big amounts of text. They are based on a strict theoretical interpretation, and adding semantics was not included. Adding semantics would reduce the amount of work, but this is not in the academic tradition.
Question: What do you do about the inherent ambiguities? The machine doesn’t know about jaguar car vs. jaguar animals.
Comment: Machines now do statistical processing. Google has parallel “facts”.
Comment: What is natural language processing? Powerset, acquired by Microsoft, is developing a natural language search engine for the Internet. Will it be able to distinguish in a search for “children’s books” at least between the 3 categories: 1)books for children; 2)books owned by children; 3)books written by children?
The group discussed two approaches to natrual language processing
1) set up a structure, knowing the concepts, and program the understanding from human understanding.
2) machine self learning, feed bunches of text, find patterns, significance of patterns. Set up a system that evolves understanding. Reference made to Monica L. Anderson of Syntience Inc. (not in attendance) as a proponent of machine learning. Syntience Inc. web site proposes “artificial Intuition”>
In discussion of either approach, the question was raised “What is understanding?”
Comment on machine learning approach: imagine intercepting signals from another civilization, or obtaining text from a former civilization. How would you recognize anything of significance? What quantities can I use?
Response: Why refuse to know the language?
Comment: Not being refused, but possibly I don’t have it. For example, you may be indexing geology and you may not know it. There are advantages to be able to do some of this work while being completely ignorant of the jargon. For example, ‘set’ means something different in math than in other fields, and the differing usage of the same word in different disciplines is common.
Comment: before such an interesting thing as discovering meaning from an unknown civilization, can we understand our own?
Other applications: Trying to make a computer do categorization, give an idea what the text is about. An example is a sales ad, perhaps on eBay. Can the ad be processed to be placed in the correct category?
Comment: Note the difference between two things in NLP
1) make decisions based on statistics
2) the ambition to understand the meaning completely even in a restricted domain.
Comment: Suggest this approach. If any compendium has a glossary, you either already have the word with a detailed description. For a missing entry, one needs to be created.
Comment: Back to the two approaches. Are these really different problems or is it just a matter of how much computation is needed?
Comment: When a human learns, there is a lot of negative feedback. Would it be possible for the natural language processing to have an “I don’t know” and get human feedback?
Comment: There’s ambiguity but you can have a computer confidence interval. For example, to process speech tagging. The problem is that the training set has to be very similar to the text.
Everyone is talking about using wikipedia. Wikipedia knows lots of things that most humans don’t care about, obscure items. This is not a good training set. There is no common sense.
Comment: Maybe there could be an NLP Wikipedia.
Comment: The problem we have when we talk to linguists, Understanding and meaning are not things we can define and measure. If we can measure what is there, we can build graphs.
Comment: Semantic etymology can be done. There is some dispute whether word stemming is really useful, for example, ‘pant’ and ‘pants’ have the same stem but this is not helpful.
Comment: Is there intrinsic informational content about a sequence of words? “The quick brown fox jumps over the lazy dog” . Which words are significant? Can you recognize the subject, the predicate, the object of the predicate?
Comment: English syntax doesn’t show the exact roles for verbs with multiple objects “I give him the book”.
Comment: The question is “What do we want to do with it?” What is the context? Imagine this: we are 20 people, each going to sell a similar laptop or Camaro on eBay. It is likely we would have 20 different descriptions. If we have time, we can get a good categorization. eBay has some words that are unique to its environment. Some categorization works well, some not so well. In trying to connect the buyer and the seller, it is important that the seller can describe the item, and many sellers do not know how to do this? How can eBay help the seller? And how can the buyer’s search get the correct items?
Comment: eBay does have the advantage of some feedback, times when a buyer searched, located an item, and purchased it. Perhaps these can be used in some way.
Further comments; There is a tradeoff with having metadata. The most relevant features. However, each category would have different relevant features
Comment: Some clickthrough analysis can be useful.
Comment: This is just one case. We know that we want understanding. How can we make more progress? In general, it seems to be having technique of understanding relationships between concepts in various texts and then apply this to more domains.
Comment: For people really trying to solve a problem, besides just talking about general understanding, what do you do in the real world where it is so messy? And there are issues of scaling.
Comment: Another application is trying to understand content on the blogs. Zemanta Ltd. is trying to understand blog content to add appropriate images. Can we improve the experience of grazing content?
Comment: Search is good if you know what you are looking for. With a large number of documents, it may not be that useful. With large data, and electronic discovery systems, how do you determine “what is the content?”.
Then some of the participants mentioned their own work and interest, some of these are listed.
-multiple context in the word. stemming.
-Adobe optimization for serving content. using probability.
-adding semantic filtering after syntactic parsing to an online tutoring system analyzing school students’ production in the course of “Language Arts and Writing”.
-how do you learn as you go, and optimize as you go, and how it scales, and how it can be extended to other languages?
-twitter analysis, web spam (at Yahoo!).
=======================
My comments (Gregory S. Tseytin)
Many commercial applications need to process “unstructured” contents presented in natural language. Clearly what they need is the meaning of the texts, but, paradoxically, they avoid by all means the language’s methods of representation of meaning and instead rely on various statistical approaches (which, of course, look more accessible from the start). As a result, NLP has acquired a reputation of being inherently inaccurate. Indeed, I saw a blurb from a company maintaining a database of facts in cell biology. They were saying that their database was very well curated because they DIDN’T USE NLP. To overcome this perception, should we come up with a different name for the field?
Is it possible to access the meaning of natural language texts by analyzing the sentences, discovering the relationships between the words and the underlying concepts? It is not an easy task, but, in the academia, there are elaborate formalized grammars and extensive dictionaries showing semantic relationships. The problem, as I think, is that the academic research follows its own traditions, striving more to verification of neatly presented theoretical ideas (even if they don’t cover all of the available linguistic material) than to satisfying practical needs. Also, given the scarcity of research grants, it is safer to adhere to well-established approaches. One obvious example of self-imposed theoretical limitations is development of syntax processing without reference to semantics, even though its use would dramatically reduce the number of variants to be considered.
Well, semantics probably cannot be defined as a single universal concept, and each application has its own requirements. But this what the language is like: beyond the morphology and syntax, it is more like a loose collection of various application fields sharing common mechanisms from syntax, rather than a well-defiined single system. We should learn to live with it, and learn to apply the knowledge of syntactic mechanisms to the various applications. Can we find a way to overcome this disconnect between academic studies and business applications?
I must be getting really old! From when I first got into this stuff compared to now, putting a setup together has gotten pretty simple. Man, I am blown away at what you can do now.
Supply Chain – IT Sourcing Consultant Ability to validate and derive insight from large amounts of data from diverse sources . Strong experience in collecting, manipulating, interpreting, and cleansing data from multiple and diverse sources . Ability to perform trend analysis on large ….