Finding data data for testing in your own Hadoop projects doesn’t have to be hard!
There are many place to find free data sets for running in your development environments. Checkout this video to find out my Top 4 places to find Big Data. Spoiler alert you can also find small data in these places….
YouTube Video
—
Transcript
Hi and welcome back to Thomas Hanson com have you ever been working in your big data environment thought we could have one more data to test it be great if I could have more data synthetic eye test out this new open source tool or just maybe this new function that you want to run today I’m going to talk about my four favorite places to find big data number four on the list is Yahoo actually the yahoo finance section you can actually go in here and look up your favorite stock or even your favorite mutual fund and find historic information and so what I like to do is I like to come in here and get historic information that will give you daily values on the stop you can take that data and inserted into HDFS or a database for however you want there’s a lot of different options and this data actually export to csv it’s really accurate data but it is limited in the set because you’re only looking at stock values but if you need a quick fix to get some data this is where I come to first coming in at number 3 is actually some weather data from Noah this data is very accurate but one of the drawbacks to getting the data and the reason it’s only number three on the list you have to actually open an account and request hey in this geographic information i would like to compile the weather data from here and so if you’re looking for accurate data this is a very good site that i would use but if you’re looking for something quick this is not going to be something that you want to use typically you’ll receive the data in less than 24 hours but just know that it could be a lot longer and that’s why weather data is number three on the list coming in at number two and a really close favorite to number one is tableaus public website and their sample data sets and this is relatively new to me but they have a lot of different information sets and a lot of different categories so like government lifestyle health and then one of my favorites that sports the format’s come back in Excel or CSV format so it’s really easy another cool thing is you don’t have to login so you can just come in download these datasets upload them into HDFS and start playing away and so that’s why this is number two on my list tableaus public data sets and now for number one on my list of your favorite places to find data is Kaggle’s website and cable start off on the scene is just a contest side for data scientist or amateur data scientists to be able to test out and solve problems one of the famous examples was Netflix there was a contest out there to see if you could be Netflix data scientist in how to recommend better videos for people and so it’s really cool I think they gave out like a million dollars for the contest but now this website is more than just a contest site it actually has data sets and its really a one-stop-shop for data scientist so it’s one of those websites you want to come in and you want to check for me i really like the datasets ight now you do have to login to be able to access the data but you have a vast amounts of data sets and if I were stuck on an island i can only have one of these it would be the Kegel website because they’re always updating a lot of different datasets they have something small and something large and so you can see here you can go through in search and you can see the latest data that’s been updated you can search by different features and like I said it’s community-driven so there’s always new data sets available this is why it’s number one on my list and so just a recap remember for top favorite places number four was Yahoo’s finance section number three was the weather data and Noah number two and a close favorite was tableaus public website where they have the sample data sets and the number one the best place was Kaggle datasets thanks for tuning in and be sure to