Why it is important to know what your end goal is…
When I set out on a two month journey to complete my capstone project for Flatiron School, effectively ending the first chapter of my story to become a data scientist, I knew that I wanted to have something more tangible than a business presentation to show for the work. It is my belief that the best data scientists are those who can put the information they are able to wrangle into the hands of a client or concerned party and allow them to make their own discoveries. Data scientists are gatekeepers of a hidden world of the information that surrounds us, and it is our job to guide people toward key insights.
With that goal in mind, I knew I wanted to build a Dash app. It sounded like an exciting way to put ML into production and would be my first attempt at bringing ML to life outside of a Jupyter Notebook. I also felt it would work well with the data I was planning to work with — bike rides taken by users of the Citibike bike-share program. My aim was to develop a model that could accurately predict whether there would be bikes available at a given bike station at a given time in the future, so with a little work, I knew I could make a great interactive dashboard in Dash that would allow a user to select a station of their choice and a future date and get a prediction of whether there would be a High, Moderate, or Low availability of bikes for them to use.
Not having used Dash before, I wanted to collect and clean my data and build out my classification models before really wrapping my head around what this new process would be. My sole focus during the outset was making progress in the data. This ended up being somewhat costly — by not having a clear understanding of what the final product would be, I did not know what my feature selection needed to be. Not knowing what features I needed to use meant that my first round of model tuning was somewhat useless, as I had to re-tune and re-tune the models as my inputs changed.
The first step to Dash is getting familiar with PyCharm. I had only been coding in Jupyter Notebooks, but PyCharm was widely lauded as being a favorite IDE for Dash. For those who are unfamiliar, PyCharm is an IDE, or Integrated Development Environment, essentially another coding platform which allows you to write and run code. From my basic use of PyCharm, I can say it offers some great features, like high-quality completion and offers support of code written in other languages, specifically SQL and other database languages. That said, it is a hefty program that is memory intensive and can take some time to initially set up.
Once you have the program installed, you might want to consider setting up a virtual environment. If you already have an environment set up for your Jupyter use, you can clone that in a virtual environment. I plan to write another post on that process soon, as it can be rather confusing. This is an important step because the environment will be what the Project Interpreter references — this is where your installed libraries will be pulled from. You can adjust which virtual environment you reference by navigating to File → Settings → Project Interpreter
If you don’t have a library you want to work with, it is easy enough to pip install the package or you can find it in a list of available items to bring in directly through PyCharm. Pretty cool stuff!
Here is where I ran into some trouble. Because I already had my dataset completed and models that had been trained, tuned, and tested, once I was up and running with Dash, I began to conceptualize what an application might look like with the model inputs, NOT what would be realistic for an end user to be able to provide to an application in the interface. Once I “put pen to paper” to map out what information a user would be able to provide, I realized there were a few things I was planning to give my model that would not be known by the user. For example, I had utilized historic ride information to generate a field of the number of bicycles in a station at a given day and time. I was feeding in what the exact bike count at the station was for each event, when really I would need to have a generalized or average count of bikes at a station for each unique station/day of the week/hour of the day which could fill into the model without user input. I also was using weather temperature as an input, which was not something that a user would want to have to provide themselves. Instead, I should have set up some integration to a weather API that would predict hourly weather for the next 7 days. Some of these problems I was able to fix, others I am still working through.
So here are my takeaways from the project:
By under thinking at the outset, I ended up building the plane while I was flying it. And the real cost of that was time — time spent writing code to structure the initial round of data that was eventually rewritten; time spent while models iterated through the process of grid search cross validation for the nth-round after I’d made changes; and time spent being more than a little exasperated when I found myself looking at the first build of the application, realizing it didn’t make much sense.
I hope the below questions will help guide any readers looking to embark on a similar project where a comfortable, consumable product is the aim — I know they are questions I’ll start with next time!
1. What is the consumer supposed to be able to take away from their interaction with your product?
a. For me, this was something I knew early on — I wanted to know what the likelihood of a bike being available was at a given date and time.
2. How much information will the consumer be able to bring to the table their interaction with your application? How much will you as the data scientist need to provide for them?
a. This is where I jumped the gun on assuming the count of bikes in a station at a given time would be common knowledge, an approach I later changed.
3. What is the structure of the interface and how will choices made by the user build into an input for your ML model to model.predict(input)?
a. Working through this problem was probably some of the most rewarding work on the project. It required learning dynamic dropdown input lists in Dash and was a great practice exercise structuring dictionaries.
You can find a gif of the app in the Readme of the project here (my next step is figuring out how to deploy a Dash app in Heroku!).
Best of luck in your Dash learnings!