![]() Once the site is rendered, you can use the Select command to click on the first username from the comment thread.(Note: ParseHub will only be able to scrape comments that are actually displayed on the page). First, start a new project on ParseHub and enter the URL you will be scraping comments from.In this case, we will scrape comments from this thread on r/technology which is currently at the top of the subreddit with over 1000 comments. In this case, we will choose a thread with a lot of comments. First, we will choose a specific posts we’d like to scrape. Scraping reddit comments works in a very similar way. Finally, click on “Repeat Current Template” to confirm. In this case, we will input 2, which equals to 3 full pages of posts scraped. Click “Yes” and enter the number of times you’d like ParseHub to click on it. A pop-up will appear asking you if this a “next page'' button.Now click on the PLUS(+) sign on the next selection and choose the Click command.Expand the next selection and remove the 2 Extract commands created by default.Using the Select command, click on the “next” link at the bottom of the subreddit page.Click the PLUS(+) sign next to your page selection and choose the Select command.Now we will tell ParseHub to navigate to the next couple of pages and scrape more posts. But we might want to scrape more than just the first page. ParseHub is now setup to scrape the first page of posts of the subreddit we’ve chosen. You can then follow the steps on our guide “ How to Scrape and Download images from any Website” to download the images to your hard drive. The method below will be able to extract the URL for each image post. ![]() You might be interested in scraping data from and image-focused subreddit. ![]() Your final project should look like this:.Now, repeat step 5 to create new Relative Select commands to extract the posts’ usernames, flairs, number of comments and number of votes.Here, use the drop down menu to change the extract command to “title Attribute”. To change this, go to the left sidebar, expand your date selection and click on the extract command. ![]() You will notice that this new selection is pulling the relative timestamp (“2 hours ago”) and not the actual time and date on which the post was made.An arrow will appear to show the selection. Using Relative Select, click on the title of the first post on the page and then on the timestamp for the post.Now, use the PLUS (+) sign next to the post selection and select the Relative Select command.We have now told ParseHub to extract both the title and link URL for every post on the page. On the left sidebar, rename your selection to posts.Click on the second post title on the page to select them all. The rest of the post titles on the page will also be highlighted in yellow.It will be highlighted in green to indicate that it has been selected. Start by clicking on the title of the first post on the page.Once submitted, the URL will render inside ParseHub and you will be able to make your first selection.Make sure you are using the version of the site. In ParseHub, click on New Project and submit the URL of the subreddit you’d like to scrape.Make sure you download and open ParseHub, this will be the web scraper we will use for our project.Mainly because the layout allows for easier scraping due to how links work on the page. We will assume that we want to scrape these into a simple spreadsheet for us to analyze.Īdditionally, we will scrape using the old reddit layout. Reddit and Web Scrapingįor this example, we will scrape the r/deals subreddit. Want to learn more about web scraping? Check out our guide on web scraping and what it is used for. To achieve this, we will use ParseHub, a powerful and free web scraper that can deal with any sort of dynamic website. This includes links, comments, images, usernames and more. Today, we will walk through the process of using a tool for automated web scraping to extract all kinds of information from any subreddit. Either for marketing analysis, sentimental analysis or just for archival purposes. This also means that the information on some subreddits can be quite valuable. No matter what your interests are, you will most likely find a subreddit with a thriving community for each of them.
0 Comments
Leave a Reply. |