Pages

Monday, 2 February 2015

Loading Twitter Data into Hive and processing

In this post, we will load our tweets into Hive and query them to learn about our little world.
To load our tweet-JSON into Hive, we’ll use the rcongiu Hive-JSON-Serde. Download and build it via:
wget http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar
mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api \
  -Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar
mvn package
Find the jar it generated via:
find .|grep jar
./target/json-serde-1.1.4-jar-with-dependencies.jar
./target/json-serde-1.1.4.jar
Run hive, and create our table with the following commands:
add jar /path/to/my/Hive-Json-Serde/target/json-serde-1.1.4-jar-with-dependencies.jar;

create table tweets (
   created_at string,
   entities struct <
      hashtags: array ,
            text: string>>,
      media: array ,
            media_url: string,
            media_url_https: string,
            sizes: array >,
            url: string>>,
      urls: array ,
            url: string>>,
      user_mentions: array ,
            name: string,
            screen_name: string>>>,
   geo struct <
      coordinates: array ,
      type: string>,
   id bigint,
   id_str string,
   in_reply_to_screen_name string,
   in_reply_to_status_id bigint,
   in_reply_to_status_id_str string,
   in_reply_to_user_id int,
   in_reply_to_user_id_str string,
   retweeted_status struct <
      created_at: string,
      entities: struct <
         hashtags: array ,
               text: string>>,
         media: array ,
               media_url: string,
               media_url_https: string,
               sizes: array >,
               url: string>>,
         urls: array ,
               url: string>>,
         user_mentions: array ,
               name: string,
               screen_name: string>>>,
      geo: struct <
         coordinates: array ,
         type: string>,
      id: bigint,
      id_str: string,
      in_reply_to_screen_name: string,
      in_reply_to_status_id: bigint,
      in_reply_to_status_id_str: string,
      in_reply_to_user_id: int,
      in_reply_to_user_id_str: string,
      source: string,
      text: string,
      user: struct <
         id: int,
         id_str: string,
         name: string,
         profile_image_url_https: string,
         protected: boolean,
         screen_name: string,
         verified: boolean>>,
   source string,
   text string,
   user struct <
      id: int,
      id_str: binary,
      name: string,
      profile_image_url_https: string,
      protected: boolean,
      screen_name: string,
      verified: boolean>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
Load it full of data from the tweet JSON file we created last tutorial:
LOAD DATA LOCAL INPATH '/path/to/all_tweets.json' OVERWRITE INTO TABLE tweets;
Verify our data loaded with a count:
SELECT COUNT(*) from tweets;
OK
24655
Our tweets are loaded! Some fun queries to run:
    • Sample some tweets
SELECT text from tweets limit 5
Which gets us:
OK
Paddled out, tried to psyche myself into wave for 30 minutes...
Waves twice as tall as me are scary
No waves here yet, nap time
Doin 80 on i10w
Gustav and panama city beach here I come
    • Top people we reply to:
SELECT in_reply_to_screen_name, 
  COUNT(*) as total from tweets 
  GROUP BY in_reply_to_screen_name 
  ORDER BY total DESC 
  LIMIT 30;
Which gets us the top N people I reply to:
OK
NULL 13447
sanjay 356
Urvaksh 282
ChrisDiehl 268
pfreet 230
mikeschinkel 222
mmealling 193
keithmcgreggor 191
peteskomoroch 183
semil 183
...
Hive has some builtin n-gram analysis utilities, documented here that we can use. For example:
SELECT sentences(lower(text)) FROM tweets;
[["dear","twitter","send","me","my","tweets","plz","you","promised","me"]]
[["pig","eye","for","the","sql","guy","http","t.co","vjx4rcugix","via","sharethis"]]
[["rt","hortonworks","pig","eye","for","the","sql","guy","with","mortardata","http","t.co","vnkwsswnkv","hadoop"]]
We can use these to do n-gram analysis:
SELECT ngrams(sentences(lower(text), 3, 10) FROM tweets;
Which is kind of amusing:
[{"ngram":["http","instagr.am","p"],"estfrequency":136.0},
{"ngram":["i","want","to"],"estfrequency":100.0},
{"ngram":["on","hacker","news"],"estfrequency":92.0},
{"ngram":["you","have","to"],"estfrequency":66.0},
{"ngram":["a","lot","of"],"estfrequency":65.0},
{"ngram":["i","need","to"],"estfrequency":63.0},
{"ngram":["is","looking","for"],"estfrequency":59.0},
{"ngram":["hortonworks","is","looking"],"estfrequency":59.0},
{"ngram":["there","is","no"],"estfrequency":56.0},{"ngram":["is","there","a"],"estfrequency":53.0}]
You can see common phrases, as well as hortonworks job offerings that are auto-tweeted, and of course – ‘on hacker news’ – talking about Hacker News :)
We can also check out our tweets that are RTs.
SELECT retweeted_status.user.screen_name, COUNT(*) as total 
  FROM tweets 
  WHERE retweeted_status.user is not null 
  GROUP BY retweeted_status.user.screen_name 
  ORDER BY total desc 
  LIMIT 20;
This gets me:
OK
peteskomoroch 99
hortonworks 97
ChrisDiehl 56
newsycombinator 55
newsyc20 38
adamnash 31
bradfordcross 29
kjurney 29
Once we have our tweets in Hive, there’s no limit to what we can do to them! This is what Hive excels at.

Note:- Objective to copy this blog is to make central repo for hadoop developer.