DM SIG – ACM Silicon Valley Data Mining Camp on November 1, 2009
2009 Silicon Valley Data Mining Camp
Unconference Free Event (optional $20 for annual ACM membership)
We had 225 participants, 9 in our panel of experts and 3 sponsors.
We are starting to plan the next Data Mining Camp for Sat March 6 or March 20.
LOGISTICS:
- Date: Sunday, November 1 (Daylight Savings TIME CHANGE)
- Time: 12:00 to 7:30 pm (Starting earlier)
- Location: Hacker’s Dojo 140A South Whisman Rd Mountain View, CA 94041
- WiFi is available, you can twitter
- Please add content to this blog from your sessions during the Data Mining Camp!
- Hacker’s Dojo in the San Jose Mercury news, 10/15/09
- Free Parking around the Dojo (Check the map for areas to avoid parking)
- Food and Beverages included with RSVP
- RSVP: at this Linkedin Event. If you are not on LinkedIn, then RSVP to Greg_Makowski at yahoo.com with an email containing the subject line “RSVP to Data Mining Camp.”
- When you RSVP, please include topics of interest to you on the ACM site, click on the ACM event title if you don’t see the blog entry form at the bottom. See below for ideas.
- Bring: curiosity, business cards, portable (for notes, blogging, mine data)
- Questions, feedback, coordination, click on a name to email a contact:
GOLD SPONSORS:
REvolution Computing offers open source products and services for high performance analytics, including REvolution R Enterprise which delivers 100% R and more—optimized, validated and supported.
Increase Business Performance with Automated Customer Lifecycle Analytics with KXEN. KXEN is the leading provider of automated data mining software and customer analytic solutions for retail, communications, media, financial and marketing services companies to improve their customer insight and enhance corporate performance. KXEN solutions integrate predictive analytics and social network analysis into business processes to boost marketing campaign results and profitability.
- Over 50 Million Professions Use LinkedIn to exchange information, ideas, and opportunities.
CO-ADVERTISER SPONSORS:
- Analytic Bridge is a data mining social network and newsletter going out to 20,000 people
- KD Nuggets website and newsletter, 12,000 people
- ACM email newsletter, 3,000 people
- Predictive Analytics World is the cross-vendor conference covering commercial deployment, February 16-17, 2010 in San Francisco, CA 2,200 people in Prediction Impact newsletter
- SD Forum (The Software Development Forum) has ~20 events / month, and 16 SIGs
- www.CloudCamp.com Dave Nielsen has organized around 30 Cloud Camps internationally in the last year.
POSSIBLE TOPICS
….that may be proposed at the unconference (see the blog below for details, please add to the blog your ideas):
- Bring your data to model with KXEN
- Introduction to data mining, how to get started for techies, FAQ
- Introduction to data mining for business leaders, product managers, marketers and profit center owners
- Bring your project challenges to brainstorm on current projects with experts (discussion)
- Bring your business problems and data for a “data mining coopertition hackathon” (data analysis)
- Challenges in vertical market X (internet advertising, medical, green tech, finance, marketing, retail, …)
- Discuss algorithm X (Support Vector Machines, TreeNet, NaïveBayes, Clustering, outlier detection, text mining)
- Netflix $1,000,000 data mining competition, presentation of collaborative filtering papers by Yehuda Koren from Greg Makowski
- Data Mining forecasting of stock market data (lead by Mike Bowles and Steve Umfleet among others) i.e. SVD and time variation. Also see Fast SVD open source code
- PLANET – Parallel Learner for Assembling Numerous Ensemble Trees – developed by the Google AdWords group
- Using Cascading to organize Map Reduce (Hadoop) jobs in the Cloud (Inviting a 3rd party to lead)
- Cloud computing, Hadoop + data mining: Mahout, Mahout talk Fri 11/6 in Oakland
- Challenges and architectures to automate and embed data mining in software or web applications
- Semantic Web, Silicon Valley Semantic Technology (SVST) (Inviting a 3rd party to lead)
- Artificial Intelligence, AI Meetup (Inviting a 3rd party to lead)
- Data mining software standards, PMML and DMG, (Inviting a 3rd party to lead)
APPROXIMATE SCHEDULE
- 12:00 Arrive, name tags, network, brainstorm discussion topics with others
- Subway subs, snacks, soda, muffins (thank the sponsors!)
- Coffee at the event from Red Rock Coffee
- 12:45 Main session starts, seating in main area, overview to the day
- 1:00 Panel of industry experts answering questions from the audience, moderated by Dr. Patricia Hoffman
- Omid Razavi: President of KXEN
- DJ Patil: Chief Scientist — Product Analytics at LinkedIn
- Joshua Koran: VP, Targeting and Optimization at ValueClick
- Dr. Rajan Patel: Stanford Visiting Professor, Sr Statistician at Google, Emory University
- Greg Makowski: Golden Data Mining, applied data mining since 1992
- Mike Bowles: Seasoned in Startups and Quantitative Finance
- Contact us if you are interested in joining the panel
- 1:30 Gold Sponsor presentation from Revolution Computing
- 1:35 Gold Sponsor presentation from KXEN
- 1:40 Gold Sponsor presentation from LinkedIn
- 1:45 Introduction to Hacker’s Dojo
- 1:50 Audience members line up to suggest discussion topics to the room
If a minimum threshold of people are interested in the topic, then it gets a discussion slot. We can have up 6+ concurrent discussion slots per time slot (depending on audience size). We recommend for each discussion a primary facilitator and a note taker to report at the end. The note taker would be encouraged to add content and web links to the blog section of this event – so information could be shared with all.
- 2:30 Time Slot 1 (many concurrent sessions)
- 3:30 Time Slot 2 ( “ “ “ )
- 4:30 Time Slot 3
- 5:30 Time Slot 4
- 6:30 Report summary of sessions over food & drinks in the main area, networking
- A variety of pizzas for dinner (thank the sponsors!)
- 7:30 Camp organizers invite any help in picking up after the free unconference for the Dojo
PREPARED DEMOS
Topics may be prepared presentations or ad-hoc discussion topics. With Hacker’s Dojo, we have one computer projector. If you want to propose a computer presentation or demo, please coordinate in advance and/or bring a computer projector.
CALL FOR ADDITIONAL VOLUNTEERS, HELP and SUPPORT
- FACILITY: Can some people help with setting up the facility in the beginning, at around 11:15 to 11:30? Can we get some other help picking up after the event? This may involve setting up folding chairs and moving tables around, or helping to move chairs after the main session to the break out session locations. We could also use some help setting up the unconference schedule on the wall.
- BRING DATA FOR ANALYTIC HACKATHON: Bring your business problem and data for preliminary analysis. Bring problem specs, data specs, business metrics, copies on multiple USB memory sticks. Please coordinate in advance. Share info on the ACM blog, below. The analytic hackathon can cover one or more sessions, based on interest in participation.
- CALTRAIN TRANSPORT: Some people have contacted us, and are coming in from CalTrain, which is about a 15 minute walk to Hacker’s Dojo. Can volunteers start a blog posting below, offering to help with transportation to and/or from Mountain View CalTrain and the Hacker’s Dojo?
- OTHER PROJECTORS: We have one computer projector, but for demos or other presentations, it would be helpful to bring other projectors. Please let me know if you can bring one.
- ITEMS FOR RAFFLE: Any items you want to donate to the non-profit ACM that could be used at the event in a raffle would be appreciated. We would prefer you check in advance.
- VIDEO RECORDING: Monica Anderson of Syntience has offered to bring two video cameras to cover the event – we need camera operator volunteers. If we have enough volunteers, then each volunteer would cover less time and can participate more. Please coordinate directly with Monica, and keep Greg in the loop.
- VIDEO EDITING AND POST-PRODUCTION: If we can find someone or some people that like to work in Final Cut (Monica Anderson will teach) – Post-poroduction is a decent profession to learn, then we can post better videos. For a newbie to edit a video would take 10 hours of instruction from the trainer volunteer, or intense study on their part. Then it requires about 4-8 times the video time to edit it. So a 1 hr talk takes 4-8 hrs for someone who has done it before.

OTHER EVENTS OF INTEREST
- SDForum, Cloud Services SIG: The Federal Government and Cloud Computing, Tue, Oct 27, 6:30-9pm, Palo Alto
- SDForum’s Collaboration 2.0 Conference Friday, Oct 30, Santa Clara, $115
- RightScale User Meetup with evening cocktail party, Monday, Nov 2, 7:30am – 2pm, at the Cloud Computing Expo
- Cloud Computing Conference & Expo, Nov 2-4, Santa Clara Convention Center
- ApacheCon, Nov 2-6, Oakland Mariott (Mahout Fri 10am)
- Real time Social Media Monitoring and Marketing Wed, Nov 4, Palo Alto
- ACM Python Professional Development Seminar, Sat Nov 7, 8:30 – 5pm, Cupertino
- Are you interested in another Data Mining Camp, in Jan or Feb?
- Predictive Analytics World February 16-17, 2010 in San Francisco, CA
WHY “CAMP” ?
Q) Why use the term “Camp” to describe this unconference event?
A) Originally, in 2005, O’Reilly Media had a Friends Of O’Reilly (Foo Camp) unconference event, where some people actually camped out that weekend.
DAY OF THE EVENT REGISTRATION
Go to http://tinyurl.com/acmdmsig and enter your name, email along with any comments or feedback.




Please be advised that there is also
NASA’s Conference on Intelligent Data Understanding (Oct 14-16, 2009)
https://dashlink.arc.nasa.gov/group/ciduaisrp/
Our proposed topic is: E-commerce Data Mining.
How much time is available to us?
Thank you very much for organizing the Data Mining Camp.
Scott,
E-commerce Data Mining sounds fine for a subject. There are 4 time slots, with about 50 minutes per time slot and 10 minutes for break and to switch to the next time slot. Each time slot starts on the hour.
If you are interested in proposing more than one time slot, then I would suggest you propose variations on a theme. For example,
* “E-commerce Data Mining: analysis for banner ad selection”
* “E-commerce Data Mining: clustering customers to understand the best communication messages”
Propose each topic, and if the threshold of people is interested in each session, then you can hold multiple sessions in the unconference format.
Thanks,
Greg_Makowski at yahoo.com
TOPIC: Bring your project challenges to brainstorm on current projects with experts (discussion)
OBJECTIVE: Provide the project owner with new insights and approaches to their problem.
DETAIL:
* Do you want to get some help assessing if a business problem would be a good fit for analysis?
* Do you want to discuss how to structure a problem, the business value metrics, how to explore or prepare the data?
* Are you working on an existing project, the improving results have leveled off, and you want to brainstorm alternative approaches?
Contact us in advance, or blog, if you have questions on how to best articulate your challenges. Providing detail in advance to read can help others think and possibly prepare.
TOPIC: Bring your business problems and data for a “data mining coopertition hackathon” (data analysis)
OBJECTIVE:
Advance the problem owner’s insight into their problem, provide analysts with an interesting and useful challenge.
DETAIL:
This session would need two groups of prepared people to be successful:
1) PROBLEM OWNER(S): People with business problems, data and some articulation of the objective or success. Bring the data and specification on multiple memory sticks – or blog in advance and leave any web links to the data and/or specification. I would suggest comma or delimited data with field names in the first row if it is a structured data problem. See also http://www.kdnuggets.com/datasets/index.html for ideas and formats.
2) ANALYSTS: …with portables and analytic software willing to dive into a problem on a brief basis.
One way this session could proceed:
a) intro to the overall problem(s) and context
b) if a choice exists among problems, pick one or divide efforts
b) problem owner goes intro to details
c) analysts self-organe on how to cooperatively or competitively partition working on the problem (Exploratory Data Analysis, preprocessing, running it through various software, …)
d) iteratively work, regroup and share insights
If it is of interest to both groups (analytic and problem owner), this could become a track, spanning one or all time slots, allowing a more significant “cloudsourcing” of investigation effort.
TOPIC: Netflix $1,000,000 data mining competition, presentation of collaborative filtering papers by Yehuda Koren from Greg Makowski
OBJECTIVE: Review some of the extensive literature on collaborative filtering, discuss as a group, relate to the next Netflix data mining competition.
DETAIL:
We can have a general discussion on the subjects covered by the following reading:
* High Level Reading
Matrix Factorization Techniques for Recommender Systems Yehuda Koren; Robert Bell; Chris Volinsky, IEEE Computer, 2009, 8
Programming Collective Intelligence by Toby Segaran. The 2nd chapter gives a good introduction to collaborative filtering with Python examples.
* Detailed Reading
See papers on research.yahoo.com/Yehuda_Koren
Collaborative Filtering with Temporal Dynamics, Yehuda Koren, KDD 2009, ACM, 2009
Collaborative Filtering for Implicit Feedback Datasets, Yifan Hu; Yehuda Koren; Chris Volinsky, IEEE International Conference on Data Mining (ICDM 2008), IEEE, 2008
Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, Yehuda Koren, ACM Int. Conference on Knowledge Discovery and Data Mining (KDD’08), 2008
Proposed topic: Biomedical Data Mining: Dimensionality, Noise, Applications
I’d like to propose a topic on semantic web:
“Build structured tags from Wordnet”
the main idea is discussed on
http://lianaye.mysite.com
Yes, I’ll attend.
TOPIC: Finance & Trading meets Data Mining & Machine Learning
STRUCTURE:
- How do we identify bull, bear and sideways markets, without the benefits of hindsight, and without resorting to chartist rhetoric? Consider market and macro- factors that define attractors in a continuous state space.
- Discuss typical pitfalls when technologists present data mining trading strategies to fund managers.
- How are machine learning models trained on data mined trading and market information, from a practitioner’s perspective?
READING:
Singular Spectrum Analysis, Elsner & Tsonis
Inefficient Markets, Schleifer
TOPIC: Financial Track – Dealing with the changing character of financial markets.
Financial markets change character in ways that are difficult to characterize. Sometimes they move quickly, sometimes slowly, sometimes up, sometimes down, sometimes sideways etc. Sometimes markets change character in response to identifiable events, sometimes for no apparent reason.
Can data mining extract useable information in this setting? How?
TOPIC: Financial Track – Approaches to posing financial data mining problems.
The character of financial markets makes it difficult to formulate well-posed problems. For example, problem statements like “give buy and sell signals” or “is action A better or is B” seem well-posed, but it turns out to be complicated to even generate the truth sets for training.
What are the alternatives?
TOPIC: Open Source Data Mining Tools
OBJECTIVES: Identify (list) open source tools for data mining such as Weka and Mahout. Describe and evaluate them: what do they contain? What do they do and which applications are they best suited for?
TOPIC: Open Data Sources
OBJECTIVES: List sources of data that are openly available and describe their uses.
DETAILS: In the old days, it was difficult to get access to large collections of data for various reasons but mostly because they were proprietary. Since the advent of the internet more and more data is becoming available online. Toby Segeran’s book “Collective Intelligence” has a great set of examples. He shows a set of simple applications and he shows how to get data from the web to develop and test the applications. The objective here would be to build on this list and create a larger list of data sources and talk about their uses.
It would be remiss not to include the viewpoint of Nick Taleb when discussing data mining, non?
Perhaps (some other credible speaker might) mention of these two articles to which Taleb refers might be of use:
Halbert White 2000, A Reality Check for Data Snooping, Econometrica, Vol. 68 (Sep.), No. 5., pp. 1097-1126.
R. Sullivan, A. Timmermann, and H. White: “Data Snooping, Technical Trading Rule Performance, and the Bootstrap,” Journal of Finance, 54, 1647-1692 (1999).
Financial Track – Financial Data Variability
I put together a short white paper discussing the character of variability in financial data. I think it’s an easy read with lots of pictures. If you find any mistakes or lack of clarity, let me know. Here’s the link:
http://docs.google.com/Doc?docid=0ARinKxOPitdhZGhycWJ6OWdfMWZzdGg3OWZn&hl=en
Hello, Mike,
I have tried your link and get a nice intro page on Google Docs, but not your article. Do you need to set up invitation email addresses for others to read?
Topic proposal:
Using game strategy, Bubble Shooter (Free download from DeadWhale.com ) for tag voting or organizing data mining software standards.
Financial Track – Republished Data Variability
I have the permissions right this time, I believe. Here’s the new link:
http://docs.google.com/View?id=dhrqbz9g_8d4fjqdcc
The Financial Track seems to have attracted a lot of interest. There a tons of questions to address, and this doc (I hope the permissions are correct) poses many of them, and provides some definitions and motivation for trying to answer them.
http://docs.google.com/present/edit?id=0AS789QYRolhSZGZoand2eF8xMGdoOXBya2Rj&hl=en
TOPIC: Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud.
OBJECTIVES: Explain how to leverage on-demand/cloud computing for fast, cost-effective web mining tasks.
Cover common issues around web crawling, scaling, and reliable workflow.
DETAILS: I’ll be describing the problem space in general, and one viable solution (Hadoop/Cascading/Bixo in EC2) in detail, using an example of search engine optimization (SEO) as the concrete use case.
[...] days for the dojo, but a lot is happening there. While we were touring the folks organizing the 2009 Silicon Valley Data Mining Camp that will be held at the dojo were also there making preparations. It’s an interesting [...]
I would also be interested in
TOPIC: Open Source Data Mining Tools
I could give a presentation of the new version of KNIME (http://www.knime.org)
—————
Nicolas,
Yes, that would of interest as a subject. We can discuss Sunday if it makes sense to join a session or be a separate session. If you are able to bring a computer projector, that could be helpful.
Thanks, Greg Makowski
[...] I’ll be presenting at some point during the day – since it’s an unconference, you don’t really know who’s going to be talking about what/when. My topic is “Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud“. [...]
For all of you that have been suggesting topics, I have created sheets with your name and topic title. These can be used at the beginning of the conference to help organize which topics will be held in which rooms at what time slot.
Others that haven’t blogged yet are still welcome to suggest topics. Paper will be available for you to create sheets when you arrive.
If you are facilitating a topic, it would be great if you could find a person to help be a scribe and another person to help with the time keeping of your session.
Let the excitement begin … Patricia Hoffman
=============================
Open Source Data Mining Tools
=============================
Paul O’Rorke talked about Weka, a collection of machine learning algorithms for data mining tasks. Concerns about whether it’s still viable. One person said that pieces of it are still viable for clustering, feature selection.
An attendee mentioned MOA. MOA is a framework for data stream mining. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems.
David talked about R. Possible to quickly get results by using building blocks from other users. Often data is prepared before processing by R. On the back end is presentation tools. Sweave is a report generation backup that works well with R. Lots of research going on for out-of-memory modeling, to handle larger data sets. Also lots of work in parallel processing. BigMemory is a package for large models. Paul mentioned that R has a steep learning curve. David agreed that R is quirky, especially in terms of memory usage.
Attendee asked about comparing Matlab & R, with respect to viability in a production environment. He’d run into memory problems with Matlab. David said that it was similar, and recommended doing scoring outside of R. He estimates 3-6x more memory is required for R vs. C++.
Attendee said many people use R for prototyping and generating models, but production uses something else. Examples would be Numpy and SciPy.
Paul mentioned that R provides a very compact representation of data mining tasks. (Ken – so it’s the APL of data mining?)
Nicolas Cebron talked about KNIME (pronounced “naim”), a modular data exploration platform. Started in 2004. knime.org has full details. He demonstrated the KNIME application, which has a nice GUI for working with data sets. The model can be output as PMML.
Attendee asked about long-term viability of KNIME. Nicolas said that it’s been around for 4 years, has a vibrant community, and there are commercial companies creating modules.
Ted Dunning talked about Mahout, an Apache open source project with the goal of scalable machine learning/data mining. Java is main language, Hadoop & Lucene are foundation technologies. Currently has good algorithms for clustering, kmeans. Reasonably good classifiers. Supervised learning algorithm. Also recommendation framework called TASTE. Very young project. Has support for sparse matrix math – might pool efforts with Apache commons math project. Mahout is mature enough for some types of machine learning problems.
Chris Wensel from Concurrent talked about the Hadoop distributed file system, and how it differs from Sun’s distributed file system – HDFS is very specialized, optimized for streaming reads. Can’t do random updates to files. Scales to 1000s of servers. Very fault tolerant.
Ken Krugler (your faithful scribe) talked about the HECB (Hadoop, EC2, Cascading, Bixo) stack for web mining. Focus is on the collection and initial processing/reduction of the data, not hard core machine learning & data mining.
[...] here’s the report from today’s ACM Data Mining Camp Silicon Valley. This is not exactly live blogging, but it is neither the deep thinking, so do further research and [...]
Here’s my report from the camp [meant as internal company report initially]:
http://www.zemanta.com/fruitblog/acm-data-mining-camp-sillicon-valley-report/
It was fun, great organizers!
bye
Andraz Tori
I organized a session on Real Time Modeling challenges. There was a total of 12 attendees. Discussion was focused towards machine learning problems that require real time modeling. Discussion was divided into following parts:
1) Data-
i) Collection: which was agreed is less of a problem given the vast amount of real time data available from various sources. The
ii) Bigger challenge is to clean data from noise in near real time.
2) Mining:
i) Automated feature generation: need to automatically generated features in near real time is challenging. Unsupervised techniques or clustering along with techniques like SVD are used. Attendee suggested using SVD before clustering to reduce clustering cost.
ii) real time model training was also discussed. Difference between active training and real time training were discussed. Model validation and need to check the error rates on unseen data and checking for overfitting was also discussed as an important step.
DJ Patil from linkedin questioned at whether retraining the models will be more benefitial than extracting new features. It was mentioned that in some cases classification errors are due to lack of new features than the decay of the model. Specific problems of spam and attendee from Google discussed click fraud problem in this regard.
DJ Patil also discussed that buying sla time by showing users that system is trying to find the the best solution helps.
Project http://hunch.net/~vw/ was mentioned.
Other key discussion topic were K means clustering and how to find “K”. Canopy clustering, SVD/PCA. Attendee mentioned that for their problem ensemble techniques gave the best results.
Attendees that I can remember are:
Vipul Sharma, Proofpoint (sharmavipul AT gmail DOT com)
Uri Roduy (varancon AT gmail DOT com)
DJ Patil, linkedin
Prayrana Khadye (pkhadye AT gmail DOT com)
Pandu Rudraraju (pandur AT gmail DOT com)
Roy Kamimura CODEXIS (roy DOT Kamimura AT codexis DOT com)
Patrick R Nicolas, Semantic Web (patrick AT pnexpert DOT com)
Ben Gimpert (ben AT somethingmodern DOT com)
I thought event was a big hit. Lots of people from academia and industry showed up. Companies like twitter, linkedin, google, IM shopping etc were present. Lots of discussion on R. Large scale and near real time machine learning is obviously becoming a louder topic of discussion and hence hadoop, mahout, Bixo etc were mentioned a lot.
Kudos to organizers!
–
Vipul Sharma
And here’s my report on the open/public dataset discussion:
http://bixolabs.com/datasets/public-datasets/
I agree with Andraz – a very useful event, thanks!
– Ken
Also here’s a better version of my open source data mining tools report (posted above), with some links and cleaned up refs.
http://bixolabs.com/oss/open-source-data-mining-tools/
– Ken
I also posted the Powerpoint of my talk on elastic web mining:
http://www.slideshare.net/kkrugler/elastic-web-mining
Same, but in PDF so you get the notes:
http://www.slideshare.net/kkrugler/elastic-web-mining-2407818
– Ken
Report on biomedical/health-care data mining & Discussion transcript:
http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html
I’ve posted notes on Ken Krugler’s session on elastic web mining using the HECB open source stack (Bixo, Cascading, Hadoop & EC2) at http://ororke.com/paul/blog/?p=261
I don’t remember any controversial exchanges during the Q&A but if anyone remembers something important that I missed please feel free to post it there. Thx!
Press and Buzz
Thank you all for coming. I appreciate everyone for helping out. It was a great conference. I hope that everyone had a great time.
There is a blogs about the camp:
General impression: http://www.zemanta.com/fruitblog/acm-data-mining-camp-silicon-valley-report/
Focus on R: http://bit.ly/3sM5kQ
BioMedical: http://aurametrix.blogspot.com/2009/11/biomedical-data-mining-dimensionality.html
There were many “tweets” about the camp
People even put up a twitpics
http://twitpic.com/nx57f
http://tweetphoto.com/xbs1abje
http://img682.yfrog.com/i/b07.jpg/
Here are some slides about the event! http://www.slideshare.net/clibou/datacamp
Summary of presentation by Monica Anderson
Model-Free Methods
2:30
The presentation was based on http://artificial-intuition.com/
Model Free methods
Chaos theory, complexity theory, complex systems all cover some of the same ideas.
Chaotic:
Deep complexity
Nonlinear responses
State (memory)
Irreducible systems:
Open systems
High dimensionality
Time variance: interact with environment, internally modified
Ambiguous input:
Incomplete nput
Incorrect input
Multiple opinions
Strange references
Emergence:
Quality of life, testing, maintenance
Bizarre domains: have all the characteristics noted above re: chaotic, irreducible, ambiguous input, emergence
World models, partial worlds models, weather, climate
Life is bizarre: organisms genomics, human interactome
Reductionist models don’t work in bizarre systems.
People: societies, consumers,
Brains, minds, intelligence,
Language is bizarre:has all of the above characteristics
AI Complete: once you solve any one of these problems you have in theory solved all of them.
Chess: not bizarre: it’s a well-contained, well-defined system.
Model-free methods: Not everything in science is done by models or methods.
Reductionist science and everyday life are very distinct. Life sciences are part reductionist, part non-reductionist.
Models are derived form low level observation‡ scientific intuition –> scientific model.
Leibniz:
Engineers use models: low level observation –> engineer’s intuition –> scientific model (preconditions) ‡ scientific predictions.
Model free methods: subscientific intuition
Low level observation –> experience & intuition –> intuitive prediction. In a bizarre domain you cannot create or generate a model.
Easy to bizarre problems.
Newtonian mechanics
GPS navigation
…
language semantics
Full scientific models
Partial models (perilous)
Simulations
Statistics (depends on large numbers, distribution)
Non-parametric models to figure out or determine the distribution
Pseudo models
Model free models
The harder your problem the weaker your model and the more data you need.
Required understanding: full domain understanding, decent data, know nothing.
Evolution is a model free method.
Model free method zoo:
Language, evolution, adaptation, abstraction, learning, recognition, discovery, trial and error, repeat success, table lookup, consultation (very powerful).
Memory helps for progression. Sensory input.
Movie recommendations:
Do not attempt to model the movies
Don’t attempt to model the customers
Big data: use weaker models: non-parametric models, pseudomodels, model free methods.
Holism is opposite of reductionist methods.
Simplify problems by reduction, reduce a whole to its parts, reduce context dependencies, reduce to next lower discipline, reduce your data, reduce wasted effort.
A revolutionary capability:
Reductionists are masters of methods
Holists have superior ontology
Holists: work on these important problems.
Big data makes holistic methods possible
Model free methods are holistic.
If the problem is too hard, get MORE data.
Requires large computers.
Weaker models: let data speak for itself
Used in genomics and other life sciences
Reductionism: optimality, completeness, repeatability, timeliness, parsimony, economy, transparency, scrutability. Correctness, infallibility, portability, incrementality, idealism
Holism: discovery, learning, self-organizing systems, saliency, abstraction. Robustness, flexibility, diversity, synergy, pragmatism.
R: deduction, deduced novelty
H: induction, abduction, true novelty.
Summary:
In many important but difficult problem domains, building reductionist models is impossible. We call these Bizarre Domains.
In general the harder your domain the weaker your models, and the more data you need.
Big data makes model free methods possible.
Notes (thanks to Greg) on Automated Trading Systems: What are the questions?
3:30 What are the questions in financial trading? Steve Umfleet
Slides at:
http://docs.google.com/present/view?id=0AS789QYRolhSZGZoand2eF8xMGdoOXBya2Rj&hl=en
Fundamental analysis (price should be)
Technical analysis (price will be)
Effect of news
What does Thorpe do:
Liquidate position; don’t trade anything that has a news event
Trade off volatility: don’t necessarily care whether the news if favorable/unfavorable
Internet has sped up recognition of bullish news on equities
Less-well followed companies may have longer response periods
A Random Walk Down Wall Street
Sharpe: transaction costs make market effectively random
Schiller: Irrational Exuberance
What about seasonal effects? Momentum effects?
Knowledge about how to exploit complexity in market diffuses
Pairs trading worked for a while
Index trading worked for a while
Database scans: pattern matches. Brittle over time.
Trading system: when to enter or exit stock
Stock recommender system: identify a market situation and stocks to go long/short in. Diagnosing market state.
Portfolio management system: cash, stocks, bonds.
The moon
Portfolio allocation theory: academic studies that say you do better by focusing on allocation rather than picking specific stocks or temporal trading
Another academic study: Black-Scholes assumptions say that technical trading doesn’t work
Sharpe assumes that there is a correct price for a stock. Question: is there a single correct price (chaotic attractor).
Regime switching, timing, shifts in market timing: map to concepts in chaos, non-linear dynamics
Individual investors need to work on smaller stocks
Crossy
Paradigms
Neural networks
SVM
Decision trees
SVD (see empirical orthogonal functions)
Genetic algorithms
Statistical filtering
Fuzzy logic
Production systems
Will need to use multiple paradigms
For what are you looking? Need some hypothesis about what creates mis-pricing?
Tremendously high dimensionality and volume
Awash in data
Machine skills complement human skills
Psychology of human sense systems
Mirages exist in all sorts of human systems for perception
Emotionality
Tradestation (use forum, not doc), Neuroshell, SVMLight, libSVM, NeuralWare
TreeNet
NinjaTrader
Opinion: trading system should be completely automatic, no room for emotion
More feed back from the camp can be found here: http://www.ai-meetup.org/calendar/11592442/
Web Mining in the Cloud
Ken Krugler’s interesting presentation on elastic web data mining is discussed on Paul O’Roke’s blog (http://ororke.com/paul/blog/?p=261 ). Ken is the founder of Bixo Labs, Inc.
The Bixo site (http://bixolabs.com/2009/11/04/paul-ororke-summary-of-elastic-web-mining-talk/ ) shows how by using Bixo and Cascading much of the work is done for you.
Ken Krugler’s talk can be found here
http://bixolabs.com/2009/11/02/elastic-web-mining-talk/
Elastic Web Mining View more documents from Ken Krugler.
Shamod Lacoul gave a Semantic Web Talk which can be seen here http://www.cliveboulton.com/post/231943312/semantic-web-overview-via-shamod-dmcamp
Semantic Web demos can be seen here
http://www.hewettresearch.com/svcc2009/
A great text for beginners looking into the Semantic Web …
“Explorer’s Guide to the Semantic Web” by Thomas B. Passin
Sign up to be notified of the next Data Mining Camp
http://tinyurl.com/acmdmsig
Hello,
I am starting to organize another Data Mining Camp, perhaps Sat 3/6 or 3/20.
While Hacker’s Dojo is a great facility, with 225 people participating last time, and expectations of more people next time, I am investigating options for other locations. I am seeking contacts at large companies or universities, to discuss holding the next free Data Mining Camp.
I am also seeking constructive suggestions for improving the next DM Camp – please send to my email (listed at the top of this posting).
Thanks, Greg
To view videos of the conference look at the following web sites:
video.syntience.com and vimeo.com
Thanks to Syntience and Monica Anderson!
Comments posted praising the November Conference!
Patrick Nicolas
“ I do not think that you can gather so many experts in this field on a Sunday afternoon anywhere in the world except Silicon Valley. The organization of the camp was just awesome. I was able to share both technical issues and potential solutions with some very smart engineers. Greg and Tricia did a remarkable job. ”
Bill Tang
“ This is a great camp. The organizer did a fantastic job. The panel and the people leading the sessions are some of the best people in this field. I wish I could have attended all the parallel sessions. ”
Bob Blum
“ Fantastic! Great lectures, bright participants, fascinating subject. ”
Here is the SESSION MATRIX from the 11/1/2009 Data Mining Camp.
Please send any updates or corrections, and I will revise.
Thoughts on the SF Bay Data Mining Camp
I had a great time at the Data Mining Camp hosted by the ACM yesterday. The event was full of energy — despite original projections of around 100 attendees, in fact more than 200 people from around the region turned up to meet and discuss various aspects of data mining. This was an “unconference” – except for a 30-minute panel discussion (great discussion from folks at Google and LinkedIn, amongst others) there was no pre-set agenda. Instead, people proposed topics for discussion, participants expressed interest with a show of hands, and the talks were allocated to timeslots and rooms accordingly.
I wish I’d taken a photo of the papers stuck to the wall making up the final schedule — I’d guess in the end there were well over 20 talks on various topics related to data mining. Large-scale data processing with Hadoop and its machine-learning cousin Mahout was a hot topic. So was financial data mining (which was surprising to me, for a West Coast event). There were also talks on the semantic web, natural language processing, and many other topics I can’t remember now.
I proposed a topic on Data Mining and Machine Learning with R, which with a show of hands about 50 people were interested in. Then someone proposed a topic on Basic R, and more than 80 people signed up for that one. So there was a lot of interest in R at the conference — lots of people had heard about R but hadn’t yet used it. I participated in both those sessions. For the Basics session, I have an introduction to R resources for beginners and the R syntax. For the Machine Learning session, I relied heavily on Josh Reich’s machine learning script and other blog posts about predictive analytics. There was also a general session on open-source data mining tools. In addition to R, there was an interesting demo of KNIME (a workflow-based data mining tool that reminded me of Insightful Miner). It was interesting to see that KNIME can run algorithms from R and Weka, too.
All in all, a really invigorating event. Many thanks to the folks from the ACM for organizing it.
SF Bay ACM: 2009 Silicon Valley Data Mining Camp
Here are the slides from my talk
Recommended Link: the public terabyte dataset project page (http://bixolabs.com/datasets/public-terabyte-dataset-project/).