ACM Data Mining Camp, March 20, 2010

Posted March 20th, 2010 by GregMakowski and filed in ACM Meeting, Conference, DM SIG Meeting

The event was a success – close to 400 people attended and used the day to talk about data mining.

WHAT is an UNCONFERENCE or CAMP?

An unconference is an event where users suggest topics, get together and discuss them in detail. This camp is focused on Data Mining, Analytics, Cloud Computing and the various applications of these technologies. There is an option to join the SF Bay ACM for $20 per year. Our last Data Mining Camp had 225 participants.

DONORS:

LOCATION:

  • eBay 2161 N First St, San Jose, CA 95131 (free parking)
  • The auditorium seats 420 people and 10 other rooms are available for break out sessions.
  • If you plan on connecting a Mac to a projector, please bring your connector cables.

REGISTER / RSVP / SEE WHO IS COMING:

We have 478 + participants who have said they are coming with a total of 560 RSVPs!

(Thursday, 3/18/2010)

SCHEDULE for Saturday, March 20th, 2010:

(click the “more” tag below, or the event title above for the full page)

  • FREE Data Mining Camp (11:15 – 7:30pm) (beverages & snacks included with RSVP)
    • 11:15-Noon Arrive, register, network, brainstorm session topics
    • Box lunches provided by eBay
    • Noon Unconference welcome, 5 min / donor, hiring announcements
    • 12:50 Expert Panel Questions and Answers
    • 1:40 Audience members suggest a topic, get show of hands for interest, select session room size by interest level, select time slot for session. We recommend sessions have a leader, blogger / note taker, and a timer so we can leave everybody 10 minutes to get to the next session. See the last SESSION MATRIX for example.
    • 2:30 SESSION break out time slot 1
    • 3:30 SESSION break out time slot 2
    • 4:30 SESSION break out time slot 3
    • 5:30 SESSION break out time slot 4
    • 6:30 Share summary of sessions over pizza and salad (Thank you Donors!)
      • Door prize drawings
    • 7:30 Thank you and wrap up.

EXPERT PANEL:

Video
Moderator: Patricia Hoffman, Ph.D. Scientific Researcher, Aha Solutions !

  • Ted Dunning, Ph.D. Chief Technology Officer at DeepDyve

    DeepDyve provides technical literature to a wide audience. Dr. Dunning created MusicMatch’s recommendation system and revenue optimization system. He also created ID Analytics large scale identity fraud systems.

  • Joseph B. Rickert Revolution Computing

    Mr. Rickert preformed statistical analysis of clinical trials and built economic models for Cedar Associates. He founded Scotts Valley Instruments. He started his career by building mathematical models of communication networks for NASA, CIA, and NSA.

  • Giovanni Seni, Ph.D. Elder Research Inc. and Professor at Santa Clara University

    Dr. Seni has lectured on “From Trees to Forest and Rule Sets – A Unified Overview of Ensemble Methods” . His text, “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions “ is now available. He teaches Pattern Recognition and Data Mining at Santa Clara University. His research interests include statistical pattern recognition, data mining , and human-computer interaction applications. He holds five U.S. patents and has published over twenty conference and journal articles.

  • Michael Walker, Ph.D. Professor at Stanford University and President of Walker Bioscience

    Dr. Walker provides statistics consulting for pharmaceutical and biotechnology companies in the areas of drug development, medical devices, clinical trials, and molecular diagnostics. His work has lead to FDA and CLIA approvals, articles in the New England Journal of Medicine and in Lancet, numerous patents, and new products to diagnose and treat disease. He consults for venture capital companies to evaluate investments in life sciences and works with life science start-ups to obtain funding and advise on corporate strategy and product development.

  • Hugh Williams, Ph.D. Vice President, Search Engine Engineer, Buyer Experience Development eBay
  • Dr. Williams is an innovator, inspiring leader, and expert in search engines and web services. In the past he managed a large R&D team at Microsoft’s Bing. He has published 99 works including two books: “Web Database Applications with PHP and MySQL” and “Learning Mysql” for O’Reilly Media Inc. He holds 2 US patents and has 25 patent applications in the works.

  • Mike Bowles, Ph.D. Seasoned in Startups, Data Mining and Quantitative Finance

    Dr. Bowles was founder and Chairman of Board at iBeam Broadcasting, and founding CEO at Com21. He is experienced with quantitative finance and fully automated quantitative trading.

  • Greg Makowski Principal Consultant at Golden Data Mining

    Greg has deployed 70+ data mining models since 1992 over 3 continents, including architecting enterprise software or web systems with embedded data mining.  He is experienced in financial services, targeted marketing, retail supply chain, internet advertising and start-ups. He is currently consulting for Wells Fargo working on fraud detection using tools like SAS Enterprise Miner.

ACM Data Mining Camp: Expert Panel Discusion with Q&A
Moderated by Dr. Patricia Hoffman

The expert panel includes: Dr. Ted Dunning, Mr. Joseph B. Rickert, Dr. Giovanni Seni,
Dr. Michael Walker, Dr. Hugh Williams, Dr. Mike Bowles, and Mr. Greg Makowski

Video Link

SESSION TOPICS:

may include one or more focus such as…

  • Experience level: beginners to experts
  • Algorithms: Forecasting, clustering, text mining, sentiment, network analysis, collaborative filtering, fraud
  • Verticals: Internet advertising, social networks, targeted marketing, financial services, medical, genetics, green tech, space science, mobile devices, startups, Netflix $1,000,000 prize
  • Tools/Processes: Commercial, public domain, libraries, in SQL, in cloud, project or product management, SalesForce.com plug ins, CRM software plug ins
  • User Groups: R, SAS, Salford Systems, Hadoop, Mahout
  • Help me: I am stuck on… I need guidance… How do you…? (but suggest topic of general interest)
  • Participate in Our Data Mining Blog: Find birds of a feather, invite participants in a session, suggest or plan session ideas, update during the session, share in the summary of sessions or add a job posting

GOLD UNDERWRITERS:

ebay logo

  • Founded in 1995, eBay Inc. is the worldwide leader in shopping and payments on the web. Every day, we connect hundreds of millions of buyers and sellers — and we continue to find new ways to help people do business around the world.

Salford Systems logo

  • Check out how our award-winning products can help you predict the future of your business NOW! Need to determine your ROI on your data mining project quickly and cost-effectively? Consider our Rapid Response Data Mining Center.

LinkedIn_logo_1_350

  • Over 60 Million Professions Use LinkedIn to exchange information, ideas, and opportunities.

logo_KXEN_web 25KB

  • Increase Business Performance with Automated Customer Lifecycle Analytics with KXEN. KXEN is the leading provider of automated data mining software and customer analytic solutions for retail, communications, media, financial and marketing services companies to improve their customer insight and enhance corporate performance. KXEN solutions integrate predictive analytics and social network analysis into business processes to boost marketing campaign results and profitability.

REvolution Computing

PROMOTION AND PROMOTION PARTNERS:

BLOG / TWITTER:

VIDEOGRAPHER / PHOTOGRAPHER:

  • Ron Fredericks will be recording and web hosting sessions
  • Blog Post for this event
  • Examples: Webinars, Internet TV, event capture, web hosting, and DVDs
  • Tag line: “Video production for exceptional people, brands, and products”
  • Web site and video hosting: www.LectureMaker.com
  • Phone: 408-390-1895
  • Here is the link to my video blog post for your event: http://www.lecturemaker.com/2010/03/acm-data-mining-camp/
  • coordinate recording your session – ronf@lecturemaker.com

ONE PAGE EVENT ANNOUNCEMENT:

ORGANIZERS:

CALL FOR VOLUNTEERS:

  • In Advance:
    • Contact Mike Bowles, and ask how you can help!
    • marketing:  Help announce and publicize to analytic crowds, groups, technical talks
    • marketing:  tweet about the event
    • marketing:  announce to analytic contacts at your company or in your network
    • marketing:  add the event web page to your email signature
      Come to the ACM Data Mining Camp, Sat, March 20 in San Jose. See the current RSVP list
    • feedback forms:  Develop, get organizer review
    • feedback forms:  print
    • session topics:  suggest ideas for session topics on the moderated blog (below)
    • video:  provide video editing training (optional)
    • video:  receive video editing training, offer to help edit video
    • video:  technical plans on posting the video
  • Early morning or day before:
    • signs:  put up maps of complex and signs at the facility
  • Day of the Event:
    • coffee: help setup
    • registration:  gather emails, pass out name tags
    • registration:  bring portable for on-line registration, to allow people to pay to join ACM
    • sponsor support:  help as requested
    • food and beverage:  distribution when delivery arrives, resupply, keep neet
    • gofer:  many last minute things come up, go-fer this, then go-fer that
    • session matrix:  record the matrix of session titles per time slot and room
    • session matrix:  post to blog, email to organizers
    • session matrix: print ~dozen copies
    • session matrix: distribute session matrix around the session rooms
    • session content: take notes during sessions, add to the blog (below)
    • session content: encourage others to cover sessions, try and get all covered
    • tweet!
    • session timing: help with timing of 50 minute sessions, giving a 5 minute and 2 minute announcements
    • video:  bring and operate video equipment (optional)
  • After the Event (same day):
    • Help clean up food areas
    • Collect feedback on improving the event in the future
    • Pick up event specific signs
  • After the Event (later):
    • video:  editing, organizing
    • video:  web posting
    • tweet!
    • blog:  add to your blog or our blog, post links
    • Help announce and publicize to analytic crowds
    • Put up maps and signs at the facility

Session Notes:
The session on “Natural Language Processing” was attended by more than 20 people, and in fact related problems were discussed at other sessions, e.g., on “Sentiment Analysis”. Hopefully, next (un)conferences will bring more discussions and better understanding of the problems.

Parts of the discussion, as recorded by our volunteer scribe, Joan A. Hoenow, are below, and my own comments are in the end.

=====

Question raised: Can you do NLP in a reasonable amount of time? Suppose you have millions of documents.
Response: NLP will not be fast at first but can be improved.

Comment: Automatic speech recognition systems have been getting better. They are more able to deal with unrecognized words.
Response: Maybe speech recognition in phone answering systems won’t be the model that we want.

Comment: there is more interest in web search.
Comment: Dictionaries, and elaborate grammars are developed, but these are not part of “understanding the meaning”.
There are elaborate syntactic processors, but they are strictly syntactic, too slow for big amounts of text. They are based on a strict theoretical interpretation, and adding semantics was not included. Adding semantics would reduce the amount of work, but this is not in the academic tradition.

Question: What do you do about the inherent ambiguities? The machine doesn’t know about jaguar car vs. jaguar animals.

Comment: Machines now do statistical processing. Google has parallel “facts”.

Comment: What is natural language processing? Powerset, acquired by Microsoft, is developing a natural language search engine for the Internet. Will it be able to distinguish in a search for “children’s books” at least between the 3 categories: 1)books for children; 2)books owned by children; 3)books written by children?

The group discussed two approaches to natrual language processing
1) set up a structure, knowing the concepts, and program the understanding from human understanding.
2) machine self learning, feed bunches of text, find patterns, significance of patterns. Set up a system that evolves understanding. Reference made to Monica L. Anderson of Syntience Inc. (not in attendance) as a proponent of machine learning. Syntience Inc. web site proposes “artificial Intuition”>

In discussion of either approach, the question was raised “What is understanding?”

Comment on machine learning approach: imagine intercepting signals from another civilization, or obtaining text from a former civilization. How would you recognize anything of significance? What quantities can I use?
Response: Why refuse to know the language?
Comment: Not being refused, but possibly I don’t have it. For example, you may be indexing geology and you may not know it. There are advantages to be able to do some of this work while being completely ignorant of the jargon. For example, ‘set’ means something different in math than in other fields, and the differing usage of the same word in different disciplines is common.
Comment: before such an interesting thing as discovering meaning from an unknown civilization, can we understand our own?

Other applications: Trying to make a computer do categorization, give an idea what the text is about. An example is a sales ad, perhaps on eBay. Can the ad be processed to be placed in the correct category?

Comment: Note the difference between two things in NLP
1) make decisions based on statistics
2) the ambition to understand the meaning completely even in a restricted domain.

Comment: Suggest this approach. If any compendium has a glossary, you either already have the word with a detailed description. For a missing entry, one needs to be created.

Comment: Back to the two approaches. Are these really different problems or is it just a matter of how much computation is needed?
Comment: When a human learns, there is a lot of negative feedback. Would it be possible for the natural language processing to have an “I don’t know” and get human feedback?

Comment: There’s ambiguity but you can have a computer confidence interval. For example, to process speech tagging. The problem is that the training set has to be very similar to the text.
Everyone is talking about using wikipedia. Wikipedia knows lots of things that most humans don’t care about, obscure items. This is not a good training set. There is no common sense.

Comment: Maybe there could be an NLP Wikipedia.

Comment: The problem we have when we talk to linguists, Understanding and meaning are not things we can define and measure. If we can measure what is there, we can build graphs.

Comment: Semantic etymology can be done. There is some dispute whether word stemming is really useful, for example, ‘pant’ and ‘pants’ have the same stem but this is not helpful.

Comment: Is there intrinsic informational content about a sequence of words? “The quick brown fox jumps over the lazy dog” . Which words are significant? Can you recognize the subject, the predicate, the object of the predicate?
Comment: English syntax doesn’t show the exact roles for verbs with multiple objects “I give him the book”.

Comment: The question is “What do we want to do with it?” What is the context? Imagine this: we are 20 people, each going to sell a similar laptop or Camaro on eBay. It is likely we would have 20 different descriptions. If we have time, we can get a good categorization. eBay has some words that are unique to its environment. Some categorization works well, some not so well. In trying to connect the buyer and the seller, it is important that the seller can describe the item, and many sellers do not know how to do this? How can eBay help the seller? And how can the buyer’s search get the correct items?

Comment: eBay does have the advantage of some feedback, times when a buyer searched, located an item, and purchased it. Perhaps these can be used in some way.

Further comments; There is a tradeoff with having metadata. The most relevant features. However, each category would have different relevant features < in cars, the model year is important. in antiques, it may be the decade or century which is important, and not as important as other attributes>

Comment: Some clickthrough analysis can be useful.
Comment: This is just one case. We know that we want understanding. How can we make more progress? In general, it seems to be having technique of understanding relationships between concepts in various texts and then apply this to more domains.

Comment: For people really trying to solve a problem, besides just talking about general understanding, what do you do in the real world where it is so messy? And there are issues of scaling.

Comment: Another application is trying to understand content on the blogs. Zemanta Ltd. is trying to understand blog content to add appropriate images. Can we improve the experience of grazing content?

Comment: Search is good if you know what you are looking for. With a large number of documents, it may not be that useful. With large data, and electronic discovery systems, how do you determine “what is the content?”.

Then some of the participants mentioned their own work and interest, some of these are listed.
-multiple context in the word. stemming.
-Adobe optimization for serving content. using probability.
-adding semantic filtering after syntactic parsing to an online tutoring system analyzing school students’ production in the course of “Language Arts and Writing”.
-how do you learn as you go, and optimize as you go, and how it scales, and how it can be extended to other languages?
-twitter analysis, web spam (at Yahoo!).

=======================

My comments (Gregory S. Tseytin)

Many commercial applications need to process “unstructured” contents presented in natural language. Clearly what they need is the meaning of the texts, but, paradoxically, they avoid by all means the language’s methods of representation of meaning and instead rely on various statistical approaches (which, of course, look more accessible from the start). As a result, NLP has acquired a reputation of being inherently inaccurate. Indeed, I saw a blurb from a company maintaining a database of facts in cell biology. They were saying that their database was very well curated because they DIDN’T USE NLP. To overcome this perception, should we come up with a different name for the field?

Is it possible to access the meaning of natural language texts by analyzing the sentences, discovering the relationships between the words and the underlying concepts? It is not an easy task, but, in the academia, there are elaborate formalized grammars and extensive dictionaries showing semantic relationships. The problem, as I think, is that the academic research follows its own traditions, striving more to verification of neatly presented theoretical ideas (even if they don’t cover all of the available linguistic material) than to satisfying practical needs. Also, given the scarcity of research grants, it is safer to adhere to well-established approaches. One obvious example of self-imposed theoretical limitations is development of syntax processing without reference to semantics, even though its use would dramatically reduce the number of variants to be considered.

Well, semantics probably cannot be defined as a single universal concept, and each application has its own requirements. But this what the language is like: beyond the morphology and syntax, it is more like a loose collection of various application fields sharing common mechanisms from syntax, rather than a well-defiined single system. We should learn to live with it, and learn to apply the knowledge of syntactic mechanisms to the various applications. Can we find a way to overcome this disconnect between academic studies and business applications?

50 Responses to “ACM Data Mining Camp, March 20, 2010”

  1. We are privileged to announce that
    Ted Dunning is willing to host a session on the current status of Mahout!

    Mahout’s goal is to build scalable machine learning libraries. Mahout uses Hadoop to support scaling. By scalable we mean:

    * Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

    Currently Mahout supports mainly four use cases: Recommendation mining takes users’ behavior and from that tries to find items users might like. Clustering takes for example text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

  2. Looking forward to attending – let me know if you need additional speakers or panelists.

    Cheers,
    Ilene

  3. Patricia Hoffman says:

    Dr. Sudhir Kshirsagar is interested in anomaly detection

  4. Patricia Hoffman says:

    Ben Gimpert says,

    “There were hopefully be interest in another finance and trading track, for which I will suggest a few adhoc talks on the “day of.” Perhaps a bit more technology -focused than the BATSIG meetings. * What domain-specific compromises can we make when crawling financial data? * Machine-learning techniques for a problem domain that is extremely messy, but tolerant to false positives? * Proxies for overall market sentiment. Does a forecast of a particular stock index do a “good enough” job of picking macro- tops and bottoms? See you there!”

  5. Aurametrix says:

    Biomedical Data Mining: Successes, Failures, and Challenges.

    In November, we proposed and discussed this topic: Biomedical Data Mining: Dimensionality, Noise, Applications
    http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html

    This time, let it be Biomedical Data Mining: Successes, Failures, and Challenges.

    For those who can’t attend in person, please do so virtually:
    http://vokle.com/events/1547

  6. Pankaj Deshpande says:

    Is the event going to have any online activity? It will be great if something like proceedings of the event are published on the website or something like webinar which could be bought.
    Posted by Pankaj Deshpande

  7. Elinor Velasquez says:

    Could we discuss machine learning using parallel processors?
    Posted by Elinor Velasquez

  8. We can offer a session:

    “Automated Data Mining with KXEN — Bring your data and we will beat your model or we will eat our hat!” by Vincent Vikor, KXEN

  9. Irene Gabashvili says:

    Biomedical Data Mining: Successes, Failures, and Challenges. In November, we proposed and discussed this topic: Biomedical Data Mining: Dimensionality, Noise, Applications http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html This time, let it be Biomedical Data Mining: Successes, Failures, and Challenges. For those who can’t attend in person, please do so virtually: http://vokle.com/events/1547

    Irene Gabashvili, Entrepreneur, Innovator, and Educator

  10. WhatToDoBay says:

    Thanks for Tweeting!

    aurametrix says Mining Health-related Data: live online event on March 20 http://vokle.com/events/1547 (from ACM Data Mining Camp #DMCAMP)

    WhatToDoBay says Biomedical Data Mining http://ow.ly/1pju2d webcast from San Jose #DMCAMP March 20 http://www.sfbayacm.org/?p=1341

    TweetMeme says
    erninthecity RT @rachelsegal: Lunch at digitalmediacamp was delicious now we’re recapping on the first sessions #dmcamp Agreed, lunch was quite good!

  11. I’d like to see something concerning the applications of evolutionary algorithms.

  12. Ji Fang says:

    Can we discuss sentiment analysis?

  13. After posting information about the “camp”, a dialogue discussing the pros and cons of statistic vs. machine learning has started at:

    http://www.analyticbridge.com/group/timeseries/forum/topics/association-of-computing-5?commentId=2004291%3AComment%3A62172&xg_source=msg_com_forum

  14. Amlan Chatterjee says:

    It’d be nice to have a discussion about how to go about data mining for some engineers who have almost forgotten statistical models :)

  15. [...] SF Bay Area ACM event web page [...]

  16. I plan to attend the camp. I volunteer for a session on algorithms to optimize Cost per Click for events platform.

  17. I am interested data mining utilizing the cloud (Hadoop, MapReduce, Hive or Pig), for both structured and unstructured data. A real world example/demo with performance metrics would be excellent! :-)

  18. Data mining, machine learning, knowledge discovery in databases and statistical inference are my primary focus. I’m particularly interested in applications of evolutionary algorithms

  19. I’m interested in methodologies (e.g. SVM/RVM, neural networks, LWPR) as they apply to market microstructure and high frequency data source

  20. For a poll on topics for the camp look at
    http://urtak.com/u/719

    So far there have been 130 responses to 24 questions which have been asked!

    6 people want to discuss sentiment analysis or opinion mining. Is there anyone interested in facilitating this topic?

    Another seemingly popular topic is visualizing data mining models. Any takers for facilitating this one?

  21. From: liana.ydisg@juno.com

    Hi, Greg,
    I need to add another item to tagging on Financial products.

    Thus
    Meet 8: Structured Tags(How to tag an AI, data mining or financial methods)

    Liana Ye

  22. Roland Chow says:

    SAS will be there. We can provide a workshop on “A Tour of SAS Analytics”

  23. Neeral Beladia says:

    interested in knowing from experts and professionals what programming tools are most commonly used in data mining industry.

  24. liana Ye says:

    Tag an AI, data mining or financial method

    To me AI means human inquiries to self, to model self, to build Web03 to ensure peace on earth, thus I propose a session on how to turn Semantic Web and Word-net into Web03 (Just a name at this time).

    Throwing out a rock to induce gems. I propose to build structured tags to index on Web content via URL and lend content handles to semantic web to establish finer relationship among contents.

    Structured tags is just a way to put content into silos. What these tags represent is crucial for common reference.
    After all the works from the past decade, I see Metadata agreement on its way, and a layered convention will soon to come for Web03 (as I have discussed with participants at Enterprise Data 2010).

    1. An AI method is an ad Hoc method. Anyone with programming skill should be able to create and play and test. The test may be as large scale as selling of derivatives to the world which contributed to recent financial meltdown.
    Questions: What kinds of fields should be tagged on an AI method? When, where, why, who and how?
    Should we be able to create a movie script out of AI tags?
    Does hacking belong to AI?

    2. Data mining methods is a part of AI. As it tests out a theory in our mind and find proof of the theory that can be qualified as a standard method, be useful again and again. Thus it must have well understood range and limits, such as neural net.
    Questions: Do we need the same set of fields like an AI method tag?
    What are these fields?
    How do we define an application when it is out of predefined range?
    Do we need someone like patent office tells us the application is out of it predefined range, or an expert system will be fine?

    3. A financial method is a legal issue. It assumes all the implication of a method is well understood, thus has been implemented into standards. Any violation of these rules will be punished either as a stolen ID, or default on shipping order, or a lost to the stock market meltdown. Our system used to be designed around these rules, and we do have an extended library of these rules as an insurance company has.
    Questions: How can we insure high quality of data? Do we put hackers in jail? or some automatic monitoring scheme like a clock telling us something is wrong?
    Do we need built-in data set independence within a private cloud?
    Can we build a private cloud that puts control into local community hand, regardless it is a Red China or a Google China?
    Do we need independent data monitoring over this legal system?
    How can a rule be detected or reverse engineered, such as the subprime problem?

    4. These system rules will be in and out of libraries. Different systems from different communities will have conflicts of these rules once we can refer to these metadata and communicate and resolve the differences for an exchange to happen. Such a conflict resolution mechanism can be built into a private cloud, called Tag Voting.
    Questions: Which voting schemes are suitable for Tag voting?
    A voting process must be peaceful and not deprive anyone’s freedom of speech, with either words, images or even bullets.
    Tag voting has to be an educational process.
    Tag voting is a systematic way on conflict resolution, it is a must feature on Web03.

    5. A tree structured tag system is on http://www.PeaceNames.com for discussion.
    Questions: Does OWL is sufficient to expand into structured tags?
    Can OWL lend to parallel processing?
    What other tag structures exist or not exist?

  25. liana Ye says:

    Tag an AI, data mining or financial method

    To me AI means human inquiries to self, to model self, to build Web03 to ensure peace on earth, thus I propose a session on how to turn Semantic Web http://en.wikipedia.org/wiki/Semantic_Web and Word-net http://wordnet.princeton.edu/ into Web03 (Just a name at this time).

    Throwing out a rock to induce gems. I propose to build structured tags to index on Web content via URL and lend content handles to semantic web to establish finer relationship among contents.

    Structured tags is just a way to put content into silos. What these tags represent is crucial for common reference.
    After all the works from the past decade, I see Metadata agreement on its way, and a layered convention will soon to come for Web03 (as I have discussed with participants at Enterprise Data 2010 http://edw2010.wilshireconferences.com/).

    1. An AI method is an ad Hoc method. Anyone with programming skill should be able to create and play and test. The test may be as large scale as selling of derivatives ( http://www.nytimes.com/1996/09/30/news/30iht-deriva.t.html?pagewanted=1 )to the world which contributed to recent financial meltdown.
    Questions: What kinds of fields should be tagged on an AI method? When, where, why, who and how?
    Should we be able to create a movie script out of AI tags?
    Does hacking belong to AI?

    2. Data mining methods is a part of AI. As it tests out a theory in our mind and find proof of the theory that can be qualified as a standard method, be useful again and again. Thus it must have well understood range and limits, such as neural net.
    Questions: Do we need the same set of fields like an AI method tag?
    What are these fields?
    How do we define an application when it is out of predefined range?
    Do we need someone like patent office tells us the application is out of it predefined range, or an expert system will be fine?

    3. A financial method is a legal issue. It assumes all the implication of a method is well understood, thus has been implemented into standards. Any violation of these rules will be punished either as a stolen ID, or default on shipping order, or a lost to the stock market meltdown. Our system used to be designed around these rules, and we do have an extended library of these rules as an insurance company has.
    Questions: How can we insure high quality of data? Do we put hackers in jail? or some automatic monitoring scheme like a clock telling us something is wrong?
    Do we need built-in data set independence within a private cloud?
    Can we build a private cloud that puts control into local community hand, regardless it is a Red China or a Google China? http://en.wikipedia.org/wiki/Google_China
    Do we need independent data monitoring over this legal system?
    How can a rule be detected or reverse engineered, such as the subprime problem? http://en.wikipedia.org/wiki/Subprime_mortgage_crisis

    4. These system rules will be in and out of libraries. Different systems from different communities will have conflicts of these rules once we can refer to these metadata and communicate and resolve the differences for an exchange to happen. Such a conflict resolution mechanism can be built into a private cloud, called Tag Voting.
    Questions: Which voting schemes are suitable for Tag voting?
    A voting process must be peaceful and not deprive anyone’s freedom of speech, with either words, images or even bullets.
    Tag voting has to be an educational process.
    Tag voting is a systematic way on conflict resolution, it is a must feature on Web03.

    5. A tree structured tag system is on http://www.PeaceNames.com for discussion.
    Questions: Does OWL (http://www.w3.org/TR/owl-features/ )is sufficient to expand into structured tags?
    Can OWL lend to parallel processing?
    What other tag structures exist or not exist?

  26. For those who can’t attend in person, watch live streaming or stream your live video by RSVP’ing at http://vokle.com/events/1547 (biomedical data mining) and http://vokle.com/events/1728 (other live & on-demand topics). Vote for topics or add your own at: http://urtak.com/u/719

    Irene Gabashvili, Entrepreneur, Innovator, and Educator

  27. I’m interested in methodologies (e.g. SVM/RVM, neural networks, LWPR) as they apply to market microstructure and high frequency data sources.

  28. Isobel Chey says:

    This is the most useful information I have found so far, thank you for this.

  29. SESSION PROPOSAL:
    TITLE: Forecasting unbalanced data sets (targets 1% or less) for fraud, internet advertising or other industries.
    FORMAT: Round Table Discussion
    LEVEL: Experienced+ (you know what stratified sampling is, you are experienced on several predictive algorithms)
    SESSION LEADER: I can lead, unless someone else wants to.
    (Greg Makowski)

    BACKGROUND SUBJECT TAXONOMY:
    Predictive
    issues of forecasting 1 in 10,000
    in fraud, not all fraud records are labeled correctly)
    (semi-supervised learning)
    (train one model
    Outlier Detection
    think of a progression
    multivariate outlier detection
    go from one cluster to many clusters
    go from many clusters to granular segments w/ diff fields

    I can post paper links later..

  30. SESSION PROPOSAL:
    TITLE: Intro to Data Mining
    FORMAT: Teaching from PPT, lots of Q&A
    LEVEL: New to data mining, numerically literate
    SESSION LEADER: Greg Makowski

    OUTLINE:
    Objectives for 50 Minutes
    From Scatter Plots to Analysis
    Families of Analysis Engines (games & golf clubs)
    How to recognize and assess a project
    Knowledge Discovery in Databases (swing)
    Exploratory Data Analysis & Preprocessing
    Model Building, Evaluation, Description
    Start Today – Next Actions!

  31. SESSION PROPOSAL:
    TITLE: Collaborative Filtering Review
    FORMAT: Teaching from PPT, lots of Q&A
    LEVEL: intermediate+ (see reading list and judge)
    (We are sharing a summary of readings)
    SESSION LEADERS: Greg Makowski, Tricia Hoffman

    OUTLINE:
    1 Netflix $1,000,000 competition was won
    2 References & Reading
    3 History & Basics of Collaborative Filtering
    4 Newer Matrix Methods, Singular Value Decomposition (SVD)
    5 SVD Variations

    RELATED READING:
    High Level Reading
    Programming Collective Intelligence by Toby Segaran. The 2nd chapter gives a good introduction to collaborative filtering with Python examples (non-SVD).

    Matrix Factorization Techniques for Recommender Systems Yehuda Koren; Robert Bell; Chris Volinsky, IEEE Computer, 2009, 8

    Singular Value Decomposition (SVD) Reading
    The Singular Value Decomposition, by Jody Hourigan and Lynn McIndoo, Linear Algebra – Math 45. http://online.redwoods.edu/INSTRUCT/darnold/LAPROJ/Fall98/JodLynn/report2.pdf w/ Matlab & image examples

    Numerical Recipes, 3rd Edition, Press et. al.,2007, p65-75.

    Collaborative Filtering Reading
    See papers on research.yahoo.com/Yehuda_Koren

    Collaborative Filtering for Implicit Feedback Datasets, Yifan Hu; Yehuda Koren; Chris Volinsky, IEEE International Conference on Data Mining (ICDM 2008), IEEE, 2008

    Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, Yehuda Koren, ACM Int. Conference on Knowledge Discovery and Data Mining (KDD’08), 2008

    Collaborative Filtering with Temporal Dynamics, Yehuda Koren, KDD 2009, ACM, 2009

  32. Hello eBay! says:

    [...] now, on a Saturday, I’m off to the ACM Data Mining Camp, hosted at eBay’s north [...]

  33. Luca RIGAZIO says:

    thanks for attending the DIMRED sessions, here are infos i captured during the session – please use it as a reference … also i add my linkedin profile if you want to connect for more infos …

    .luca

    ME: http://www.linkedin.com/in/lucarigazio

    HLDA / HDA:

    http://www.clsp.jhu.edu/ws97/acoustic/reports/SpCo98HDA.pdf
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.7481&rep=rep1&type=pdf

    LDPP: http://prhlt.iti.es/seminars/2008/Villegas08c.pdf

    CORE VECTOR MACHINES: http://www.cse.ust.hk/~ivor/cvm.html

    SPARSE PROJ SQ: http://www.public.asu.edu/~jye02/Software/SLEP

    Mike Mahoney page (infos about random projection)

    http://cs.stanford.edu/people/mmahoney

  34. [...] was my second time ACM SF Bay Data Mining camp. The first thing to notice – it was big, 400 people or so hosted by eBay in San [...]

  35. Junling Hu says:

    Hi,

    I posted my presentation on feature selection problem on my website. It has links to two software package I mentioned.
    The slides are at:
    http://www.junlinghu.com/feature_selection.pdf

  36. Irene Gabashvili, PhD Has posted a Live Vokle Event on Biomedical Data Mining capturing this event! Thanks Irene

    http://vokle.com/events/1547

    Check it out!

  37. Andraz created a Zemanta Blog Post

    http://www.zemanta.com/fruitblog/data-mining-camp-report/

    Here is a quote,

    “Initial expert panel was highly acclaimed. The guys were really direct, honest and interesting. A few quotes:

    * Joseph B. Rickert: “Size of your data and time you have to analyze it are reversely proportional”
    * @hughewilliams: “Data miner vs Statistician skills – ability to write code”
    * Ted Dunning’s preconditions for data mining: 1) data exists 2) someone benefits in a concrete way

    Thank you Andraz for the compliment!

  38. Thanks to everyone who attended and made this event. We at eBay were glad to have you.

    Steve

  39. Some of the buzz from Twitter

    mateuszb @PatriciaHoffman it was really nice to be there. Good job and looking forward to attending these more often!

    DataJunkie @PatriciaHoffman Sure! Will post an entry on my blog as well. Thanks for a great event! Will be back!

    ajlopez RT @patriciahoffman @NextGenCMO: Big data is red hot@andraz: jobs… – NetFlix, Ebay Hadoop/data mining jockey can land a job in seconds.

    andraz @patriciahoffman I don’t know any experts in sentiment analysis personally. they are elusive. Would like to ask them many things :) #dmcamp

    RT @Psyllo: @PatriciaHoffman Loved the panel and the un-conference format. #DMCAMP was a success IMO

    @littleidea: @PatriciaHoffman Days like #DMCAMP I wish I lived in the Bay area, I won’t make this one, but will follow along at home. :)

    DeeMcCrorey @patriciahoffman Happy 2 help out! Thx 4 the share…I’ll check out Avinash Kaushik on web analytics. Another gd find is @massimopaolini

    aurametrix @PatriciaHoffman thank you. #DMCAMP should be a success http://twtvite.com/DMCAMP2010
    @sailur: Latest dicovery – RHIPE using R over HADOOP #DMCAMP

    @andraz: Clem Wang formerly Yahoo spam guy now at Bing annotated most of the learning sets by himself, not trusting crowdsourcing #dmcamp

    @mateuszb: #DMCAMP Interesting. Facebook dropped Cassandra for inbox search and hired HBase person to switch. (as reported by @cwensel on stage)

    @NextGenCMO: Hot topics at #dmcamp include #mahout, #hadoop, #cloud, click stream analytics for tab browsing, bio informatics … Surprise, surprise.

    DataJunkie When to start using Cascading? “Write one map/reduce application, throw it away, and then start using cascading.” Via @cwensel #DMcamp

    mateuszb #DMCAMP Interesting. Facebook dropped Cassandra for inbox search and hired HBase person to switch. (as reported by @cwensel on stage)

    sailur Done! Nice survey of state of practice! Thank you ACM, EBay and all other gold sponsers!! #DMCAMP

    @andraz: Ted Dunning’s preconditions for data mining: 1) data exists 2) someone benefits in a concrete way #dmcamp about 17 hours ago from web

    @bobpage: This expert panel at (free) #dmcamp is better than many big (paid) conference panels. No hawking, just good expert info.

    raymondmccauley

    Lots of good genomics questions at ACM Data Mining Camp – not just for web metrics anymore #dmcamp

    dws

    Refreshingly high quality, agenda-free panel at #dmcamp. Kudos to whoever who put the panel together. But please fix the mic feedback.

    #DMCAMP Who benefits from data from smart meter OR from rare event?

    #DMCAMP Expert Panel answering lots of questions including one on “smart meters” coming on electric meters

    #DMCAMP Stanford Professor Dr.Michael Walker speaking about personalized medicine and using data mining

    #DMCAMP Greg Makowski talking on fraud detection…Break out groups coming up

    #DMCAMP Dr Mike Bowles speaking at Data Mining Camp ion how to use data mining and opportunities

    Here at eBay for Data Mining Camp … Expert panel up Dr.Tricia Hoffman speaking

    Here at eBay for Data Mining Camp … Lots of people here … Excitement building

    cwensel @mateuszb lurking clickstream. Leading hadoop next.

    mateuszb At large text classification session at #dmcamp

    mateuszb @PatriciaHoffman it was really nice to be there. Good job and looking forward to attending these more often!

    Biomedical #DMCamp is online now: http://vokle.com/events/1547

    @andraz – great comments. What session are you attending now? could you join us for semantic discussions?

    Now at #dmcamp: Semantic Technologies & Data Mining onside: Fireside C, online http://vokle.com/events/1547

    Discussing Collective Intelligence in Action Book http://is.gd/aQKol at #dmcamp, Fireside B

    Another difference between statistics and data mining: grant size – thousands vs millions @ted_dunning at #dmcamp

    Data miner vs Statistician skills – ability to write code (@hughewilliams at #dmcamp Expert Panel)

  40. More reports on the conference:
    Biomedical
    1)http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html
    R revolution
    2)http://blog.revolution-computing.com/2009/11/thoughts-on-the-sf-bay-data-mining-camp.html
    3)http://blog.revolution-computing.com/2010/03/acm-data-mining-camp-march-20.html
    KDnuggets
    4)http://www.kdnuggets.com/2009/11/b-sf-data-mining-camp-report.html
    Andraz
    5)http://www.zemanta.com/fruitblog/acm-data-mining-camp-silicon-valley-report/
    Analytic Bridge
    6)http://www.analyticbridge.com/main/search/search?q=Data+mining+camp

    # Health IT: What’s the Future? (rwjfblogs.typepad.com)
    # Thoughts on the SF Bay Data Mining Camp (revolution-computing.com)

    References

    1. Blair S. How to assess exercise training habit and physical fitness. In: Behavioral Health, edited by Matarazzo JD. New York: Wiley, 1984, p. 424-447.
    2. Rauramaa R., Tuomainen P., Väisänen S., and Rankinen T. Physical activity and health- related fitness in middle-aged men. Med Sci Sports Exerc 27: 707-712, 1995.
    3. Gøtzsche, P.C., Jørgensen, K.J., Mæhlen, J. and Zahl, P.-H. Estimation of lead time and overdiagnosis in breast cancer screening. British Journal of Cancer (2009) 100, 219–219.

  41. Hello,
    Thanks for the interest in slides. Sorry I didn’t post some things sooner. Other priorities came up and I had to travel on short notice to help my mom. To see what was up, follow this link – any support is appreciated.
    http://pages.teamintraining.org/los/rnr10/jwilsonbli
    http://www.theoaklandpress.com/articles/2010/03/26/obituaries/1561629.txt

    Here are some of the presentations from the Data Mining Camp on 3/20/2010:
    * Intro To Data Mining (… or what is it all about anyway?)
    https://docs.google.com/fileview?id=0B9w6dFrP3862ODYxZjk4ODktM2ZlOS00MDQxLWExNTYtYmZlODQxNTViMWI3&hl=en

    * Collaborative Filtering – Review of Koren Papers
    https://docs.google.com/fileview?id=0B9w6dFrP3862NzQxYjYxMGEtYTI2MS00NTkxLWEwMzQtZThmYzY4NmQ5NzU0&hl=en

    * Salford Training – How to Win Data Mining Competitions
    https://docs.google.com/fileview?id=0B9w6dFrP3862MDQ5NTk0ZGUtZTk3NS00NTM5LWI2NTktMDkzMmEwYzhjMTU3&hl=en

    * Data Mining Camp 2010 03 20 – Session Matrix
    https://docs.google.com/fileview?id=0B9w6dFrP3862NzY1ZjcxYTItMDM1Mi00OTNkLWI1MDMtMjg0MDAwMzJiZGYy&hl=en

    Thanks for your interest – we hope to start discussing plans for our next Data Mining Camp.

    Greg Makowski

  42. DJ Cline says:

    First, thanks inviting me to the huge event on Saturday.
    It was better than SXSW in Austin!
    The ACM crowd was smarter and funnier.

  43. The session on “Natural Language Processing” was attended by more than 20 people, and in fact related problems were discussed at other sessions, e.g., on “Sentiment Analysis”. Hopefully, next (un)conferences will bring more discussions and better understanding of the problems.

    Parts of the discussion, as recorded by our volunteer scribe, Joan A. Hoenow, are below, and my own comments are in the end.

    =====

    Question raised: Can you do NLP in a reasonable amount of time? Suppose you have millions of documents.
    Response: NLP will not be fast at first but can be improved.

    Comment: Automatic speech recognition systems have been getting better. They are more able to deal with unrecognized words.
    Response: Maybe speech recognition in phone answering systems won’t be the model that we want.

    Comment: there is more interest in web search.
    Comment: Dictionaries, and elaborate grammars are developed, but these are not part of “understanding the meaning”.
    There are elaborate syntactic processors, but they are strictly syntactic, too slow for big amounts of text. They are based on a strict theoretical interpretation, and adding semantics was not included. Adding semantics would reduce the amount of work, but this is not in the academic tradition.

    Question: What do you do about the inherent ambiguities? The machine doesn’t know about jaguar car vs. jaguar animals.

    Comment: Machines now do statistical processing. Google has parallel “facts”.

    Comment: What is natural language processing? Powerset, acquired by Microsoft, is developing a natural language search engine for the Internet. Will it be able to distinguish in a search for “children’s books” at least between the 3 categories: 1)books for children; 2)books owned by children; 3)books written by children?

    The group discussed two approaches to natrual language processing
    1) set up a structure, knowing the concepts, and program the understanding from human understanding.
    2) machine self learning, feed bunches of text, find patterns, significance of patterns. Set up a system that evolves understanding. Reference made to Monica L. Anderson of Syntience Inc. (not in attendance) as a proponent of machine learning. Syntience Inc. web site proposes “artificial Intuition”>

    In discussion of either approach, the question was raised “What is understanding?”

    Comment on machine learning approach: imagine intercepting signals from another civilization, or obtaining text from a former civilization. How would you recognize anything of significance? What quantities can I use?
    Response: Why refuse to know the language?
    Comment: Not being refused, but possibly I don’t have it. For example, you may be indexing geology and you may not know it. There are advantages to be able to do some of this work while being completely ignorant of the jargon. For example, ‘set’ means something different in math than in other fields, and the differing usage of the same word in different disciplines is common.
    Comment: before such an interesting thing as discovering meaning from an unknown civilization, can we understand our own?

    Other applications: Trying to make a computer do categorization, give an idea what the text is about. An example is a sales ad, perhaps on eBay. Can the ad be processed to be placed in the correct category?

    Comment: Note the difference between two things in NLP
    1) make decisions based on statistics
    2) the ambition to understand the meaning completely even in a restricted domain.

    Comment: Suggest this approach. If any compendium has a glossary, you either already have the word with a detailed description. For a missing entry, one needs to be created.

    Comment: Back to the two approaches. Are these really different problems or is it just a matter of how much computation is needed?
    Comment: When a human learns, there is a lot of negative feedback. Would it be possible for the natural language processing to have an “I don’t know” and get human feedback?

    Comment: There’s ambiguity but you can have a computer confidence interval. For example, to process speech tagging. The problem is that the training set has to be very similar to the text.
    Everyone is talking about using wikipedia. Wikipedia knows lots of things that most humans don’t care about, obscure items. This is not a good training set. There is no common sense.

    Comment: Maybe there could be an NLP Wikipedia.

    Comment: The problem we have when we talk to linguists, Understanding and meaning are not things we can define and measure. If we can measure what is there, we can build graphs.

    Comment: Semantic etymology can be done. There is some dispute whether word stemming is really useful, for example, ‘pant’ and ‘pants’ have the same stem but this is not helpful.

    Comment: Is there intrinsic informational content about a sequence of words? “The quick brown fox jumps over the lazy dog” . Which words are significant? Can you recognize the subject, the predicate, the object of the predicate?
    Comment: English syntax doesn’t show the exact roles for verbs with multiple objects “I give him the book”.

    Comment: The question is “What do we want to do with it?” What is the context? Imagine this: we are 20 people, each going to sell a similar laptop or Camaro on eBay. It is likely we would have 20 different descriptions. If we have time, we can get a good categorization. eBay has some words that are unique to its environment. Some categorization works well, some not so well. In trying to connect the buyer and the seller, it is important that the seller can describe the item, and many sellers do not know how to do this? How can eBay help the seller? And how can the buyer’s search get the correct items?

    Comment: eBay does have the advantage of some feedback, times when a buyer searched, located an item, and purchased it. Perhaps these can be used in some way.

    Further comments; There is a tradeoff with having metadata. The most relevant features. However, each category would have different relevant features

    Comment: Some clickthrough analysis can be useful.
    Comment: This is just one case. We know that we want understanding. How can we make more progress? In general, it seems to be having technique of understanding relationships between concepts in various texts and then apply this to more domains.

    Comment: For people really trying to solve a problem, besides just talking about general understanding, what do you do in the real world where it is so messy? And there are issues of scaling.

    Comment: Another application is trying to understand content on the blogs. Zemanta Ltd. is trying to understand blog content to add appropriate images. Can we improve the experience of grazing content?

    Comment: Search is good if you know what you are looking for. With a large number of documents, it may not be that useful. With large data, and electronic discovery systems, how do you determine “what is the content?”.

    Then some of the participants mentioned their own work and interest, some of these are listed.
    -multiple context in the word. stemming.
    -Adobe optimization for serving content. using probability.
    -adding semantic filtering after syntactic parsing to an online tutoring system analyzing school students’ production in the course of “Language Arts and Writing”.
    -how do you learn as you go, and optimize as you go, and how it scales, and how it can be extended to other languages?
    -twitter analysis, web spam (at Yahoo!).

    =======================

    My comments (Gregory S. Tseytin)

    Many commercial applications need to process “unstructured” contents presented in natural language. Clearly what they need is the meaning of the texts, but, paradoxically, they avoid by all means the language’s methods of representation of meaning and instead rely on various statistical approaches (which, of course, look more accessible from the start). As a result, NLP has acquired a reputation of being inherently inaccurate. Indeed, I saw a blurb from a company maintaining a database of facts in cell biology. They were saying that their database was very well curated because they DIDN’T USE NLP. To overcome this perception, should we come up with a different name for the field?

    Is it possible to access the meaning of natural language texts by analyzing the sentences, discovering the relationships between the words and the underlying concepts? It is not an easy task, but, in the academia, there are elaborate formalized grammars and extensive dictionaries showing semantic relationships. The problem, as I think, is that the academic research follows its own traditions, striving more to verification of neatly presented theoretical ideas (even if they don’t cover all of the available linguistic material) than to satisfying practical needs. Also, given the scarcity of research grants, it is safer to adhere to well-established approaches. One obvious example of self-imposed theoretical limitations is development of syntax processing without reference to semantics, even though its use would dramatically reduce the number of variants to be considered.

    Well, semantics probably cannot be defined as a single universal concept, and each application has its own requirements. But this what the language is like: beyond the morphology and syntax, it is more like a loose collection of various application fields sharing common mechanisms from syntax, rather than a well-defiined single system. We should learn to live with it, and learn to apply the knowledge of syntactic mechanisms to the various applications. Can we find a way to overcome this disconnect between academic studies and business applications?

  44. I must be getting really old! From when I first got into this stuff compared to now, putting a setup together has gotten pretty simple. Man, I am blown away at what you can do now.

  45. Supply Chain – IT Sourcing Consultant Ability to validate and derive insight from large amounts of data from diverse sources . Strong experience in collecting, manipulating, interpreting, and cleansing data from multiple and diverse sources . Ability to perform trend analysis on large ….

Leave a Reply