Let’s use a simple and free dataset to simulate what analyzing a real-world dataset will be like from start to finish.
Hopefully, this is something you can model a portfolio piece off of or even use as a portfolio piece if you honestly understand the code and what’s happening.
Also as a disclaimer about the article’s title, you can start in 10 minutes, but good data analysis takes time to complete.
Why use R to analyze data?
In my career so far, I have used C++, Python, Matlab, Javascript, SQL, and of course, R.
While I’ve never actually used Javascript to analyze data, I have used the other languages I mentioned to do so and by far I prefer R to analyze, manipulate, and create visuals from a raw dataset.
There are some instances where R wouldn’t be so appropriate or necessary, but, the same could be said for any programming language and we’re talking about ANALYZING DATA not making web apps.
When it comes to importing and manipulating raw datasets, R is the most intuitive to use and has the most useful native functions. R centers on the idea of working with Tibbles, or data frames, which are essentially large tables of data. Basic filtering, sorting, data mapping, and merging are as simple as writing a single line of code. The same goes for creating visually pleasing graphics and reports. Using ggplot2 and the Latex-based RMarkdown makes fantastic-looking reports and graphics with very little effort.
I could write about why R is the superior choice all day, but that will be a different article. Let’s go ahead and start using R to work with some data.
If you are a complete beginner and need help installing and setting up R, check out my visual guide here.
Analyzing Data Overview: 4 major steps
In this section, I will describe each step in the analyzing data lifecycle generally. If you want to skip to the reproducible example, go to the section titled “Analyzing Data: Step-by-Step Example”.
1. Import and cleanse your data
The first thing you will need to do is import your data into the R environment. Common file types you might encounter will be CSV, XLSX, or maybe JSON. In all honesty, don’t be intimidated by different file types and think you can’t manipulate them.
Ultimately, every data file type is just data plus some unique delimiting characters. Delimiting characters mean characters that indicate the boundary between different pieces of data. For CSV files, these are commas and line breaks. You can try and open any file extension type with the notepad app and see what delimiting characters exist in that file.
For the most part, there will be a specific function in R designed to read a specific file type. You will need to choose the right one for the job. See the table below to help you find the right package and function.
Data Import with R Summary Table
File Type | Package Name(s) | Import Function | Import Documentation | Export Function | Export Documentation |
.xlsb | readxlsb | read_xlsb() | docs | N/A | N/A |
.xls/.xlsx | readxl, writexl | read_xls(), read_excel() | docs | write_xlsx() | docs |
.csv | readr, AlphaPart | read_csv(), read_excel() | docs | write.csv() | docs |
.xml | XML, methods | xmlParseDoc() | docs | saveXML() | docs |
.json | rjson | fromJSON() | docs | toJSON(), write() | docs |
.txt | utils | read.table() | docs | write.table() | docs |
I wrote more in-depth about importing specific data file types that you can see here.
2. Define what you would like to understand from the data
Here is where the real art of data analysis comes in and this step separates the amateurs from quality analysts.
This step is a bit tricky to describe generally and is really best illustrated with an example so definitely keep reading below to see the example I provide. Generally speaking, however, you want to understand the key analytics needed to make a business decision and you need to understand the limitations of the dataset you have.
So I encourage you to stop what you’re doing and ask yourself each one of the following questions. I also want you to write out your answers. This step is the absolute most important part of data analysis so please don’t rush through this.
- What business decision is my client trying to make?
- What are some statistics that could best inform my client’s business decision?
- What statistics can I generate from my dataset?
- Is there any overlap between what I can generate and the statistics that would best inform my client?
- What graphics can I show that might best convey my statistics?
Again I am going to emphasize that I want you to take your time here. Carefully considering how to do the analysis and understanding the business situation is more important than knowing how to get a correlation between two values.
Your job is NOT to generate a bunch of statistics. Your job is to identify the relevant statistics to inform your client’s business decision. The relevant statistic could be something as simple as an average value. Do not think you need to use the most complicated statistics you know, because more often than not these statistics are less intuitive and can be confusing for your client.
3. Produce statistics from your data
NOTE: I was tempted to make “Exploratory Analysis” an additional step, but will just put it into this step for simplicity. I will spare exact details of implementation for the example in Part 2 of this miniseries.
Ready. Aim. Fire.
Ok so now you have imported your data and decided which statistics to produce that might be relevant to your analysis. You are ready to start producing your statistics.
Not every statistic you produce with be valuable. Sometimes it pays to check for correlations and trends in your data whether you expect them to exist or not. Sometimes you will find the real gems in a dataset by checking something that you expected to be obvious.
This could mean checking for differences in sales by day of the week or time of day. This could mean seeing if the user demographics have any significant impact on user engagement and app feature use.
You want to look at many different aspects of the data to “sniff out” where the actionable data insights are. While you are producing statistics try and keep the relevant business decisions your client needs to make in mind.
4. Interpret the Statistics
Alright!
So now you have some statistics. Averages, correlations, percentages, and maybe some retention rates are thrown in there.
Now we need to craft a story that we can communicate to the client. Don’t exaggerate or say more than the statistics show so you will need to be careful with your language here. Any claim that you make must have numbers to back it up.
Ignore any obvious values or statistics that don’t help your client with their business goals. Your client doesn’t need to hear the noise.
Try and write out in words what the statistics are telling you. Use the statistics you produced in the sentences you write.
Can even a non-tech or non-math person understand your explanation?
If they can’t, you need to go back and edit your sentences.
5. Produce relevant visualizations and communicate what you find
The final step of analysis is communicating what you find back to your client.
It doesn’t matter how innovative or interesting your findings are if you can’t communicate this in a way your client understands.
You should have a much better idea now of what relevant statistics you have and want to show your client. So how can you present this visually?
Humans for the most part are very visual. Telling someone about a tragic car accident doesn’t invoke the same response as seeing a picture of the carnage of the scene. This example might be extreme, but the same idea applies to presenting your analysis.
Try and create at least one graphic for each distinct finding of your analysis. It sometimes is tricky to know which type of graph or plot to use. Trends are typically best shown with line charts and distributions are paired well with bar plots or histograms. Choosing the right plot can be more of an art than a science so don’t worry about making mistakes early on. You’ll get a feel for it.
You might be worn out by now, but the job is not yet complete. Don’t get lazy. As a data professional, taking the time to create clean, illustrative graphics is part of your job.
Is R too hard to learn for a non-tech person? Check out my breakdown of the time commitment required to develop employable skills with R here.
Tips
1. Compile your report with RMarkdown
This is something that I like to do with my reports as I already use R to analyze the data and produce graphics.
RMarkdown is a LaTeX-based library that allows you to create presentation read-generated reports that will impress any client. LaTeX is a report generative package created in the 80s by Leslie Lamport after he was frustrated with available options at the time. LaTeX can produce publication reports automatically and is a joy to use.
RMarkdown allows you to code and generate graphics within an RMarkdown file and create reports in either PDF, Word, or standalone HTML formats.
Can you make a full-time income with only R? The answer is yes, and check out my article on earning an income with R to see how it can realistically be done.
Conclusion
The process of data analysis is much more than just mindlessly computing statistics. Use this general process to successfully tackle any project that might come your way 🙂
Also, look out for part 2 of this miniseries where I apply these steps to a real-life example. You can follow along with the example and perhaps even use it for your own portfolio afterward.
Thanks for reading and look out for more content from RTL Coding. Hope this article was useful and if you would like to support this blog and this content feel free to donate via Paypal to help support more helpful content.