Principal Data Scientist
- San Francisco, California (San Francisco Bay Area)
Peter Skomoroch's Overview
- Equity Partner at Data Collective
Peter Skomoroch's Summary
I'm a data scientist and entrepreneur focused on building intelligent systems to automate tasks and enable better decisions. I specialize in solving hard algorithmic problems, leading cross-functional teams, and developing engaging products powered by data and machine learning.
Most recently, I applied my skills to the consumer internet space at LinkedIn, the world's largest professional network, where I was an early member of the data science team. As Principal Data Scientist, I led data science teams focused on reputation, search, inferred identity and building data products. I was also the creator of LinkedIn Skills & LinkedIn Endorsements. Endorsements was one of the fastest growing new product features in LinkedIn's history with over 3 billion endorsements of more than 70 million members within the first year after launch.
Before joining LinkedIn, I was Director of Analytics at Juice Analytics and a Senior Research Engineer at AOL Search. In a previous life, I developed price optimization models for Fortune 500 retailers, studied machine learning at MIT, and worked on Biodefense projects for DARPA and The Department of Defense. I have a B.S. in Mathematics and Physics from Brandeis University and research experience in Biology and Neuroscience.
Peter Skomoroch's Skills & Expertise
- Machine Learning
- Data Science
- Big Data
- Information Retrieval
- Product Management
- Text Mining
- Recommender Systems
- Data Mining
- Statistical Learning
- Collaborative Filtering
- Natural Language Processing
- Artificial Intelligence
- Data Analysis
- Apache Pig
- Amazon Web Services (AWS)
- Public Speaking
- Data Visualization
- Ruby on Rails
- Text Classification
- Information Extraction
- Amazon EC2
- Cloud Computing
- Amazon Mechanical Turk
- Linkedin Endorsements
- Rapid Prototyping
- Neural Networks
- Putting Out Fires
- Open Source
- Soul Retrieval
- Distributed Systems
- Yacht Racing
Peter Skomoroch's Experience
August 2011 – Present (3 years 2 months)
Helping Matt Ocko and Zachary Bogue with due diligence and advising Data Collective portfolio companies. Data Collective (DCVC) invests in data startups focused on infrastructure, analytics, and in applications that leverage data - including verticals like lending, travel, customer service, and medical research. Investors in companies like Kaggle, Parse, Trifacta, MemSQL, Interana, PlanetLabs, Platfora, ZenPayroll, FlipTop, Freshplum, MongoHQ, LendUp and Moleculo.
Principal Data Scientist
Public Company; 5001-10,000 employees; LNKD; Internet industry
September 2009 – October 2013 (4 years 2 months) Mountain View, CA
Led teams of Data Scientists focused on Reputation, Inferred Identity and Data Products. Was lead Data Scientist and creator of LinkedIn Skills & Endorsements, one of the fastest growing new products in LinkedIn's history. We reached over 3 Billion member endorsements 1 year after launch in October 2013, adding rich skill data and reputation signals to over 60 million member profiles.
Our projects included features like LinkedIn Skills, Suggested Skills, PeopleRank, Endorsements, and InMaps. Our team's specialties include entity extraction & discovery, recommendation algorithms, economic insights, network intelligence & dynamics.
In late 2009, as Sr. Data Scientist I built the original prototype of LinkedIn Skills using Hadoop & Rails, then worked with a talented team of engineers & designers to build and ship Skills on LinkedIn.com I served dual role as Product Manager and Sr. Data Scientist for 6 months following the launch of Skills before moving into management roles.
We worked on a number of other efforts that mined information from LinkedIn profile content, the social graph, and external data sources to build data driven products and surface actionable insights for members. Our tool set included things like Hadoop, Pig, Hive, Voldemort, Mechanical Turk, Java, Python, NLTK, along with various machine learning and numerical libraries.
Nonprofit; 1-10 employees; Internet industry
November 2011 – September 2013 (1 year 11 months) San Francisco Bay Area
Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education, and research.
2006 – October 2009 (3 years) Mountain View, CA
Lead consultant at Data Wrangling http://datawrangling.com offering software development services for clients in need of scalable data mining or search applications.
Built http://trendingtopics.org an open source Rails application that identifies trends on the web by using Hadoop, Hive, and Python to process Wikipedia log files on Amazon EC2.
Wrote articles and documentation for companies such as Cloudera and Amazon Web Services to demonstrate scalable processing of Netflix ratings, Last.FM listening data, and Wikipedia logs.
Designed and built the backend of an on-demand proteomic search system for a bioinformatics client. Released core code as the “ec2cluster” project on GitHub: a Rails web console, including a REST API, that launches temporary MPI clusters on Amazon EC2 for scalable parallel processing.
Provided basic consulting services for clients running MPI on EC2. Released Elasticwulf on google code: Python command line tools to launch and configure a distributed cluster on Amazon.
Machine learning consultant to a small investment fund. Mined commercial financial data, SEC filings, and alternative information sources on the web using machine learning and Mechanical Turk.
Director of Analytics
Privately Held; 11-50 employees; Information Technology and Services industry
2008 – 2009 (1 year)
Developed a Django based web analytics application called Concentrate http://www.concentrateme.com/ that discovers and visualizes patterns in search query data. Built backend infrastructure for text mining using Amazon EC2, SQS, and S3 using boto. Data processing was mainly implemented with SciPy, C, and the Python Natural Language Toolkit. Automated continuous integration on EC2 with Selenium, Hudson, and PyUnit. Payment system used Satchmo, deployment done via Capistrano and Puppet.
Developed a scalable pattern clustering algorithm for Concentrate that automatically discovers patterns in large amounts of search data and clusters long tail queries into manageable groups. http://www.datawrangling.com/search-map-interactive-visualization-of-query-clusters
Represented Juice at several conferences including giving a talk at PyCon 2008 on processing data with Amazon EC2 http://www.datawrangling.com/pycon-2008-elasticwulf-slides
Consulted on several client projects including processing marketing survey data for a media company and analyzing spatial vehicle usage patterns in customer data for FlexCar
Sr. Research Engineer
Public Company; 5001-10,000 employees; AOL; Internet industry
2006 – 2007 (1 year)
Member of the Search Analytics team at AOL
Developed search referral prediction system that applied machine learning techniques to query logs, web crawl data, and internal server logs to recommend site improvements and measure external competition in multiple content areas. Implemented using Nutch and Hadoop, along with Python NLTK, NumPy, and SciPy.
Lead engineer on a project building a web-based search analytics tool used to track the timing of bot activity in web logs, identify uncrawled sections of web properties, and improve the crawlability of large websites. The system included Ruby on Rails front-end and REST API to serve graph data and metrics. Backend used Python logfile parsers and a Hadoop cluster to build link graphs and summarize page content.
Educational Institution; 1001-5000 employees; Defense & Space industry
April 2004 – July 2006 (2 years 4 months)
Designed and implemented a prototype web-based decision support system and sensor data warehouse using Python & Oracle. Role included direction and training of junior staff members, design of underlying data models, system interfaces, & data visualization components.
Principal software & algorithm engineer developing dense, low-cost chemical and biological sensing networks using wireless sensor motes. Wrote sensor network detection algorithms, designed network data warehouse, constructed web-based front end for the system, and wrote embedded nesC code for Mica2 Crossbow wireless sensor boards. Performed simulation & analysis of the system in Matlab.
Applied machine learning techniques to significantly improve the accuracy of a prototype sensor to detect pathogens in time resolved environmental measurements.
Implemented prototype system to collect health information via cell phones and display population data in real-time via the web. Java architecture included a Quartz job scheduler, Sprint Location SOAP Web Services, PKCS12 security, request throttling logic, Oracle Spatial, & AJAX to display live results on Google maps.
Developed embedded TCP/IP socket layer code in C for a TI DSP based biosensor. Implemented embedded web server in C on the sensor with a SOAP access for automatic sensor discovery.
Designed data warehouse and web service infrastructure for the integration of streaming real-time sensor data. Wrote object-oriented C++ hardware drivers to process and upload large amounts of streaming data to Oracle in real time.
Developed Matlab simulations analyzing U.S. Census data in combination with environmental spatial datasets to study the effects of air particulate deposition with under multiple weather conditions.
Constructed performance models of an indoor biological sensor system for the protection of buildings. The models evaluated technical performance, cost, and simulated operations of the system to optimize sensor layout.
Privately Held; 10,001+ employees; Financial Services industry
October 2003 – April 2004 (7 months)
Built Oracle PL/SQL logic for brokerage applications to analyze campaign effectiveness, report trends, and track customer interactions. Constructed Java servlets and SOAP web services to process XML database requests. Performance tuned slow running applications and optimized SQL statements.
Worked with off-shore development teams in India and Ireland to develop Oracle and Siebel applications. Prototyped new database error handling and debugging approaches along with an automated build/test/deployment process for database code using Ant, JUnit, and Dbunit.
Privately Held; 201-500 employees; Computer Software industry
November 2002 – October 2003 (1 year)
Part of the Calc Engine team: backend system processed historical time series of retail transaction data to estimate prediction model parameters. Designed and loaded database schemas containing these parameters for use by the price optimization engine. Wrote and tuned SQL queries used by the forecast engine. Ported Oracle PL/SQL code to Java Stored Procedures for Oracle/DB2 dual platform product release.
Worked with ProfitLogic clients (Fortune 500 retailers) to refine business requirements for our products and rapidly fix performance issues / bugs. Often obtained performance improvements of 5-10x in slow SQL queries. Commended by clients and management for immediate resolution of issues.
Assisted R&D group with projects involving maximum likelihood estimation, Bayesian parameter estimation, genetic algorithms, seasonality, and clustering.
Privately Held; 201-500 employees; Computer Software industry
June 2000 – November 2002 (2 years 6 months)
Responsible for running the weekly forecast and price optimization model of our first major client (JCPenney). On call to resolve issues with the client and algorithm recommendations, ensuring that we met service level agreements.
Surfaced model accuracy issues with senior management and was allocated resources to construct an out-of-sample forecast testing system using Oracle, Mathematica, and Python. Worked with R&D to develop an improved model that became part of the standard software release. Developed empirical methodology for results measurement that was used to demonstrate up to 15% improvement in profits for clients.
Analyzed retail transaction data stored in Oracle and Teradata using Mathematica & Python to characterize the influence of climate, price, promotional events, holidays, store-performance, and other demand drivers on sales of a wide range of merchandise types. Developed production forecast model parameter estimation code in PL/SQL.
Ran forecast tests on data from prospective clients, came up with ROI and value propositions, and developed compelling information visualizations for PowerPoint decks and sales pitches.
Peter Skomoroch's Languages
Peter Skomoroch's Patents
Methods and Systems for Exploring Career Options
- United States Patent US20120226623 A1
- United States Patent Application 13/672,377
- Filed November 8, 2012
- United States Patent US8650177
- Issued February 11, 2014
Machine automated method of identifying a set of skills
Skill Ranking System
- United States Patent Application 13/357,302
- Filed January 24, 2012
Skill Customization System
- United States Patent Application 13/357,360
- Filed January 24, 2012
Inferring and Suggesting Attribute Values For a Social Network Service
- United States Patent Application 13/629,241
- Filed September 27, 2012
Methods & Systems for Recommending Decision Makers in an Organization
- United States Patent Application 3080.132PRV
- Filed September 30, 2013
- United States Patent Application 14/292,779
Peter Skomoroch's Education
Nondegree Student, Machine Learning
2004 – 2005
B.S., Mathematics, Physics
1996 – 2000
Operations Intern, Cignal Global Communications - Cambridge, MA, 1999-2000
Physics Research Assistant, Bucknell University - Lewisburg, PA, Summer 1999
Anatomy & Cell Biology Research Assistant, SUNY Health Science Center- Syracuse, NY, 1997-1998
Biophysics Research Assistant, SUNY Health Science Center - Syracuse, NY, 1995-1996
Neuroscience Research Assistant, Institute For Sensory Research, Syracuse, NY, 1994-1995
Campus jobs included: Undergraduate Physics TA, Electronics Technician for Physics Department, Calculus Grader, Physics Tutor, Calculus Tutor
Peter Skomoroch's Courses
Nondegree Student, Machine Learning
Massachusetts Institute of Technology
- Machine Learning (6.867)
- Neural Networks (9.641J)
- Real Analysis (18.100B)
Peter Skomoroch's Projects
- October 2009 to Present
LinkedIn Skills & Expertise is a set of tens of thousands of topic pages automatically constructed from LinkedIn profiles and external data sources. Using a variety of signals, we identify the most relevant people, places, and companies for each topic, track trends, and suggest skills users may want to add to their profiles.
- November 2011 to November 2011
Organized LinkedIn's first Veterans Hackday in conjunction with the White House to encourage hackers all over the country to build projects that benefit veterans. We had 44 projects submitted from around the country, 11 awesome finalists, and the celebrity judges (Tim O'Reilly, Sumit Agarwal, Jeff Weiner, Chris Vein) picked 3 amazing winners.
- September 2011 to Present
DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills”
- June 2012 to Present
Interface design incorporating social proof and a light weight endorsement action to Profile Skills. This feature leveraged earlier work on Profile Guided Editing, and used the same guided UI to suggest skill endorsements to profile viewers. Recipients of the endorsement receive an email and on-site notification, with a landing experience that suggests they endorse people they know - creating a feel-good viral loop.
Peter Skomoroch's Volunteer Experience & Causes
Causes I care about:
- Economic Empowerment
- Science and Technology
Organizations I support:
- Code for America
Peter Skomoroch's Additional Information
Machine learning, Information Retrieval, Search, Data Mining, Physics, Embedded Systems, Wireless Sensor Networks, Computational Neuroscience, Mathematical Finance, Optimization Algorithms, Prediction Markets, Collaborative Filtering, Parallel Programming and Cluster Computing, Python, Ruby, Web Frameworks, Mashups, General software engineering, Analytics, Data Visualization
- Groups and Associations:
Data Drinking Group
- Honors and Awards:
Westinghouse Science Talent Search Semifinalist, Brandeis University Presidential Scholarship, ...
Contact Peter for:
- expertise requests
- getting back in touch