In our practice, we hear our clients often say this to us.  Sometimes they don’t reach out to us until they’ve invested heavily into data engineering.  Too often these types of misconceptions, unfortunately, lead to lost money and time for startups. Let’s look at a couple of examples.

We had one client ask for a statistical analysis on an intervention study.  The end goal was to determine the efficacy of the intervention sessions on several cohorts over time.  They had a group of interns create a simple web app to collect feedback pre and post-intervention. They had 2 years’ worth of data on 5-10 cohort groups with about 15-30 people per group.  The developers were told to make sure individual responses would remain anonymous.

Unfortunately, the developers did not consider that to do a proper statistical analysis, we would still need to be able to link responses to each individual.  We don’t need to know the person’s details, just that response A and response B correspond to the pre and post-response of a certain person.  Because of this oversight, we could not perform any substantial analysis.  Very disappointing considering a couple of hours of a statistician’s time in the early stages of development would have prevented 2 years of study from going to waste.

Let’s look at another example.  An early-stage startup in the e-commerce space asked us to develop an AI system that would continuously look for product development opportunities.  For the input data, they wanted to create a big data web scraping system to collect product data.  We cautioned them against this approach for several reasons.  First, this effort was not aligned with their core business function.  Second, there were already established companies in this space and it would be cheaper to pay for access to that data.  Third, they did not know what data to grab and what the value of the information was.  

Unfortunately, they did not heed our warning and went ahead with the project.  One year later, they were spending several tens of thousands a month to support the infrastructure of their project with the vast majority of data being irrelevant for any analysis.  Eventually, they ended up deleting a year’s worth of data scraping.  

Here’s the takeaway.  Don’t expect to go through a big data engineering effort to collect data and expect that a data scientist can come at the end and work magic.  It’s very important to have data scientists, statisticians, or data analysts be part of the initial planning of any data engineering effort.  Just a few hours of their time can prevent a large waste of effort and costs.  

Here’s an extra insight for startups.  Don’t just collect data without thinking through the value.  What are you going to do with it?  You can’t use the same approaches that Google or Facebook use. You need to be focused on efficient testing of ideas or models and actionable analysis.