“I don’t Like Cricket…I love it”
Web Scraping Meets The Tidyverse
Introduction
Cricket is a bat-and-ball game played between two teams of eleven players on a field at the center of which is a 20-meter (22-yard) pitch with a wicket at each end, each comprising two bails balanced on three stumps. The batting side scores runs by striking the ball bowled at the wicket with the bat, while the bowling and fielding side tries to prevent this and dismiss each player (so they are “out”)(Wikipedia).

Motivation
Cricket’s quadrennial showpiece, the ICC Cricket World Cup 2019 began in the ‘home’ of cricket, England earlier this month. There was a lot of hype about the latest edition of the World Cup before it even began. ESPNcricinfo has become a popular website for cricket fan to access all cricket games and players data. The motivation behind this blog is how to scrape useful information off website and generate some basic insights from it with the help of R. In this first blog I present an extensive exploration of the “International cricket results from 1971 to 2019” data set. This data set lists all the international cricket matches(as in games opposing countries or nations) from 1971 to early 2019.
More specifically, this blog will cover the following:
- We’ll first learn how you can scrape ESPNCricinfo.com
ESPNCricinfo.com
to gather different teams records to date - Then, we’ll see some basic techniques to extract information off of one page: we’ll extract playing teams, winner, margin, ground, match date and scoreboard for all once day international (ODI’s) team record by year
- And individual team score of all the ODI’s on a subpage for all matches by year
- With these tools at hand, you’re ready to step up your game and compare the matches of different cricket teams (of our own choice): we’ll see how you can make use of
tidyverse
packages such as plotting anddplyr
, in combination withstringr
, to inspect the data further and to formulate a hypothesis for further investigation and statistical inference that follows the philosophy of thetidyverse
.
Web Scraping ESPNCricinfo.com : rvest
Step 1: Preparations
Step 2: Scrap The Team Records
Step 3: Extract Scorecard URLS
Step 4: Scrap The Scorecards
Step 1: Preparations
To begin with, I made a vector of the years I want to scrap.
Step 2: Scrap The Team Records
In the next step, I applied map
this function to the list of URLs I generated earlier. To do this, I used the map()
function from thepurrr
package to the `rvest` functions to the year-url data frame.
Step 3: Extract Scorecard URLS
Extract the urls to the scorecard from thehref
attribute.Then take that list of URLS and scrape the data I was looking for, and then stick it into a data frame after some preprocessing.
Step 4: Scrap The Scorecards
I map
the rvest
functions to the scorecard urls. Since this was a large number of urls. I used progress bar progress
package.
Data Cleaning and Wrangling With Tidyverse
One of the big issues when it comes to working with data in any context is the issue of data cleaning and merging of datasets, since it is often the case that I collated data from ESPNCricinfo.com
. There are a myriad of ways in which R can used for the data wrangling but I relied heavily on tidyverse. I used tidyr dplyr
for differnet data wrangling and reshaping tasks. Finally, I wrote convenient function that takes as input from the scoreboard data frame. It extracts all games score binding them into one tibble. Thenmap
function applied to get the data frame with the needed information for team records. The resultant dataframe was joined after some processing.
Visual Data Exploration With Tidyverse
Data visualisation is a critical tool in the data analysis process. Visualisation tasks can range from generating fundamental distribution plots to understanding the interplay of complex induential variables in machine learning algorithms. With the dataset created I will visualise the distribution the ODI matches played over years etc. I will ggplot2
to create the main graphic, along with some plots looking at trends in loosing, winning for top ranked cricket teams.
Number Of Matches Per Year
International Cricket Council (ICC) gives points to all the teams based on their performances in different tournaments and bilateral series. These points are then used for ranking of the teams. The ranking helps in keeping a healthy competition among the countries to keep fighting for victories. Based on the ranking I have taken into only top 10 teams for exploratoration and analysis purpose.

Number Of Matches For Top Teams

World Cup 2019 Playing Teams
The British Empire had been instrumental in spreading the cricket over- seas and by the middle of the 19th century it had become well estab- lished in Australia, the Caribbean, India, New Zealand, North America and South Africa. However, I am going to focus only the teams which have qualified for world cup 2019; include Afghanistan, Australia, Bangladesh, England, India, New Zealand, Pakistan, South Africa, Sri Lanka, West Indies.

Which World Cup Playing Team Has Best Win Ratio
ODIs, it is India who lead the pack. West Indies, South Africa, England and Bangldesh have recorded wins in the format and sit pretty at the top 5 of the list. Every team has now played at least once time to predict which country based on the performances we have seen thus far is going to win is still far fetched. However, both India and England seem good contenders for title provided they can keep up thier current playing form.

Conclusions:
We identifed how we can split web scraping in different phaeses which have their own challenges to be attacked: the site analysis phase, the data analysis and design phase and the production phase. In each of these phaseswe mentioned a number of activities to be carried out and questions to be answered before going to thenext phase . In this blog we have seen how rvest
package of R for web scraping was applied in the area of statistics. We have showed that how web scraping is used in circumstances such as to explore background variables and to re- trieve metadata and how it can be combined magnificently with tidyverse
package of R. We have at the ICC best ODI cricket teams, which are the ones with the highest win ratios. We have provided plausibe isnight for predicting likely `ICC Cricket World Cup 2019`, which I am gonna take in next blog. Stay tuned.
References:
“I don’t Like Cricket…I love it!”: 10cc — Dreadlock Holiday
“Cricket”. Wikipedia, International Cricket Council (ICC)
“Statistics and records”: ESPNcricinfo.com
“ggplot2”: H Wickham — elegant graphics for data analysis
“Rvest”: H Wickham — Easily harvest (scrape) web pages
“The tidyverse”: H Wickham — R package
“gist-syntax-themes”: https://github.com/lonekorean/gist-syntax- themes