João Falanga
- Jan 11, 2023
- 4 min read

What is Data Science?

Understanding Data Science and applications

Nowadays, when we talk about data science, we think of an umbrella that includes many things. So, I'm going to give you an example of part of what we usually call data science.

And the idea is, think of a company that sells ice cream or other cold items that we like to drink, right? — in moments, mainly, hot — and when you start analyzing the monthly or daily sales and on the side of these numbers you put the temperature of the day or the average temperature of the month, you can notice that there is a relationship or, more specifically, a correlation between these two values, these two sequences of values.

When the temperature is higher, more ice cream is consumed, lower temperature is less, they grow together.

Or even other variables, you look at the variable “if the weather is cloudy” or “if it rained” and you realize that the correlation is the opposite. When it rains, there is less consumption. And you can find this type of behavior that is not necessarily a relationship of cause, but a correlation between these numbers.

What is exploratory analysis?

So, it looks like the data scientist person is grappling with a question that they don't quite know what they're looking for. Does this have to do with that name that is used a lot, exploratory analysis?

That term was coined quite recently, by the way, you know? Less than a century, 50 years ago. To define and separate two parts in studying the data because one part is really testing a theory that I have, a hypothesis, that I want to see if it's true. So if I have a theory that when it's hot, it sells more, I can do a test for that, or other things, for example, the theory that a medicine cures the flu, and then I do a test for that.

So, this is a phase I can work on, but before that, you can have another phase that is simply looking at the data, seeing what you find there and this is the exploratory analysis phase. With that, you can find a lot of things you don't expect, raise questions based on the data you've looked at, and after you raise all the questions, intuitions, hypotheses, you put them to the test, create models and do other things.

Real example of exploratory analysis

There is another example, which is an online School, and there we have several courses and different people.

People take courses and study, but as it is online and can be used at any time, there are people who study once a week, twice a week, 3 times a week, occasionally, once in a while, with or without rhythm, it has everything. And, if you look at the data, one of the things you notice is that those who visit twice a week or more (the platform) have a rate of completion of courses in the short, medium and long term that is totally different from those who visit at a slower pace. Smaller.

Which reminds me a little of the English course we take when we're kids, you go there on Mondays and Wednesdays or Tuesdays and Thursdays, you don't just go whenever you want, whenever you want, and period. Having a rhythm makes you keep that work in the medium-long term, so we see a correlation between having a rhythm with a medium-long term goal of completing several studies.

Now, if this is a consequence, a causal relationship or just a correlation, we have to do some tests to be certain.

What is Cause and Correlation?

Sometimes just being the correlation is a good sign for us to think “look as an educational institution, let's try to engage people to use the platform more because they will conclude and have a better use” or we think, it's a hypothesis.

Of course, there is a famous phrase that says that correlation is not causality, but it is a good sign. Although, if you look for it, there is a site called spurious correlations.

The site that has a book, I think the book has the same name and has super cool examples of fascinating correlations, for example, the years that the US stock market rose in relation to the years that Nicolas Cage released the film.

Is Python the new Excel?

Excel is more friendly to the end user, however much making formulas, one inside the other, etc., is not at all trivial, right, the functional and reactive way of Excel, it is not trivial, but I still think that Excel is easier to learn the first time.

Data Science languages and libraries: R, Python, Pandas

And where do these acronyms, these keywords, Python, R and Pandas that are very much in the daily life of the data scientist?

So, there is Excel which is a way for us to work the data in a spreadsheet and, which, I think, is super cool because it is easy for us to visualize this information, but we can also describe it in a way that we tell the computer to do things, (in an) imperative way, which is generally imperative what we end up using in these other languages, and with that, you have Python, R and other languages as alternatives to Excel, there are other tools too, of course.

So, you will have, perhaps more control, perhaps (with the possibility of) doing things deeper, more easily and there will be a language struggle a little, but there are alternatives, Excel, R, Python, each with its advantages and disadvantages.

Within the world of Python, you will have, for example, Pandas as a library that basically everyone uses, Jupyter basically as an exploration space for testing, with an exploratory space. But if you look around, you'll see people using the same Jupyter, not just to test it out, but to run things for real, like Netflix which uses a Jupyter cluster running their machine learning algorithms.

So, you can use these tools to do other things, if you have Pandas as the main, probably Python library, you have NumPy as well, but it's more numerical, but pandas is the main library of one of these languages.